
Vision API · Beta · Free · Patent pending
Sixteen cooperating perceptual stages feed a learned pattern dictionary that stabilises read-out. 82.4% phrase coverage against frontier VLMs on 47 strictly held-out images. ~1.7 s per image. No GPU.
Beta tier · 50 imgs/key/day · no credit card
◇ Architecture
Each layer hears one specific thing. Their combined output is stabilised by a learned dictionary — the same trick the visual cortex pulls.
Patent pending — specifics abstracted
Perceptual
Off-the-shelf vision models, batched and cache-shared. Each one names what it specifically knows.
Compositional
Upstream localisations get re-scored under context-conditioned vocabularies — much finer than whole-image scoring.
Emergent
A 50-dimensional signature in a structured dynamical system is matched against a labelled archetype dictionary that grows monotonically with new oracle labels.
◇ What you get back
Each call returns the full structured read-out: VLM-style narrative, ranked concept tags with categories, YOLO objects with bounding boxes, OCR text, face count and emotions, detected scene type, dominant colours, fine-grained details, composition and texture descriptors, and image provenance (photograph / traditional-art / AI-generated).
Or use the /v1beta/models/…:generateContent shim and keep your existing Gemini SDK call unchanged.
{
"description": "[1280x853] 3 persons,
dining table, two laptops. Scene:
people meeting in an office.",
"concepts": [
{ "name": "people meeting in office",
"category": "activity",
"score": 0.247 }
],
"objects": [
{ "class": "person", "conf": 0.93 },
{ "class": "laptop", "conf": 0.78 },
{ "class": "dining table",
"conf": 0.71 }
],
"places": [
{ "category": "conference center",
"score": 0.24 }
],
"dominant_colors": [
"black", "dark brown", "gray"
],
"fine_objects": [
"id badge", "laptop computer",
"candid photography style"
],
"ai_detection": {
"verdict": "photograph",
"confidence": 0.46
},
"elapsed_ms": 1618
}◇ Drop-in usage
If you're already using the Gemini SDK or just curl-posting base64 images: point the endpoint at retina.frank.ink and keep the same request/response shape.
The Gemini-compat surface is at /v1beta/models/{model}:generateContent with the same content/parts/inline_data shape. Your existing agent framework will not notice the swap.
# Retina-native (structured JSON):
curl https://retina.frank.ink/v1/analyze \
-H "Authorization: Bearer rk_live_..." \
-F "file=@photo.jpg" \
-F "hint=is there a dog?"
# Gemini-compat:
curl https://retina.frank.ink/v1beta/\
models/gemini-flash-latest:\
generateContent \
-H "x-goog-api-key: rk_live_..." \
-H "Content-Type: application/json" \
-d '{ "contents": [ ... ] }'◇ What this does not do
Phrase coverage measures whether concepts appear in the output — it does not test whether bindings are correct ('red shirt on man' may all be present in any arrangement).
Trained on stock photography, art, AI-illustration, world architecture. Out-of-distribution behaviour on medical, microscopy, satellite, industrial inspection is unmeasured and likely poor.
Hyper-specific brand names, proper nouns, named cultural objects are the dominant failure mode. Coverage gap, not architectural failure.
◇ Try it
Self-serve API keys via Clerk. Retina-native JSON or Gemini-compatible drop-in. Read the paper for the empirical case; bring an image to see the system answer.