Conceptual diagram of the Digital Retina pipeline — light enters a stylised eye, passes through translucent layers of neural circuitry rendered as cooperative neuron sheets in different colours, emerges as a reconstructed pixel grid.
◇ Pipeline: light → 16 cooperating layers → structured noun phrases

Vision API · Beta · Free · Patent pending

We taught a 4-vCPU box to see like a VLM — without renting one.

Sixteen cooperating perceptual stages feed a learned pattern dictionary that stabilises read-out. 82.4% phrase coverage against frontier VLMs on 47 strictly held-out images. ~1.7 s per image. No GPU.

Beta tier · 50 imgs/key/day · no credit card

82.4%
out-of-sample phrase coverage
n=47
held-out images, 2 VLM oracles
1.7 s
p50 latency, warm worker
16
cooperating neural stages

◇ Architecture

Not a single model.
A choir of small ones.

Each layer hears one specific thing. Their combined output is stabilised by a learned dictionary — the same trick the visual cortex pulls.

Patent pending — specifics abstracted

Perceptual

Open CLIP, YOLOv8, OCR, face + emotion, fine-grained vocabularies

Off-the-shelf vision models, batched and cache-shared. Each one names what it specifically knows.

Compositional

Per-region scoring, context-aware classifiers, caption grounding

Upstream localisations get re-scored under context-conditioned vocabularies — much finer than whole-image scoring.

Emergent

Dynamical-signature read-out via a learned pattern dictionary

A 50-dimensional signature in a structured dynamical system is matched against a labelled archetype dictionary that grows monotonically with new oracle labels.

◇ What you get back

Structured. Specific.
Not just one sentence.

Each call returns the full structured read-out: VLM-style narrative, ranked concept tags with categories, YOLO objects with bounding boxes, OCR text, face count and emotions, detected scene type, dominant colours, fine-grained details, composition and texture descriptors, and image provenance (photograph / traditional-art / AI-generated).

Or use the /v1beta/models/…:generateContent shim and keep your existing Gemini SDK call unchanged.

{
  "description": "[1280x853] 3 persons,
   dining table, two laptops. Scene:
   people meeting in an office.",
  "concepts": [
    { "name": "people meeting in office",
      "category": "activity",
      "score": 0.247 }
  ],
  "objects": [
    { "class": "person",  "conf": 0.93 },
    { "class": "laptop",  "conf": 0.78 },
    { "class": "dining table",
      "conf": 0.71 }
  ],
  "places": [
    { "category": "conference center",
      "score": 0.24 }
  ],
  "dominant_colors": [
    "black", "dark brown", "gray"
  ],
  "fine_objects": [
    "id badge", "laptop computer",
    "candid photography style"
  ],
  "ai_detection": {
    "verdict": "photograph",
    "confidence": 0.46
  },
  "elapsed_ms": 1618
}

◇ Drop-in usage

Swap one endpoint.
Keep your code.

If you're already using the Gemini SDK or just curl-posting base64 images: point the endpoint at retina.frank.ink and keep the same request/response shape.

The Gemini-compat surface is at /v1beta/models/{model}:generateContent with the same content/parts/inline_data shape. Your existing agent framework will not notice the swap.

# Retina-native (structured JSON):
curl https://retina.frank.ink/v1/analyze \
  -H "Authorization: Bearer rk_live_..." \
  -F "file=@photo.jpg" \
  -F "hint=is there a dog?"

# Gemini-compat:
curl https://retina.frank.ink/v1beta/\
models/gemini-flash-latest:\
generateContent \
  -H "x-goog-api-key: rk_live_..." \
  -H "Content-Type: application/json" \
  -d '{ "contents": [ ... ] }'

◇ What this does not do

No compositional reasoning

Phrase coverage measures whether concepts appear in the output — it does not test whether bindings are correct ('red shirt on man' may all be present in any arrangement).

Narrow image distribution

Trained on stock photography, art, AI-illustration, world architecture. Out-of-distribution behaviour on medical, microscopy, satellite, industrial inspection is unmeasured and likely poor.

Named entities are hard

Hyper-specific brand names, proper nouns, named cultural objects are the dominant failure mode. Coverage gap, not architectural failure.

◇ Try it

Sign up. 50 images per day.
Free during Beta.

Self-serve API keys via Clerk. Retina-native JSON or Gemini-compatible drop-in. Read the paper for the empirical case; bring an image to see the system answer.