Feat/2470a ollama vision#68
Merged
Merged
Conversation
#2470 (slice a) — vision-input path for the four built-in adapters,
with programmatic image fixtures and per-provider live tests. Sibling
work (`Content` → `LlmMessage` translation, multipart `@Generable`
input via KSP, manifest-anchored capability validation) is the rest
of #2470 / #2468 / #2471, layered on top of this.
```kotlin
val png = VisionFixtures.threeSquaresPng()
val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0)
client.chat(listOf(
LlmMessage(
role = "user",
content = "How many squares?",
images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
),
))
// → LlmResponse.Text("3")
```
Implementation:
- `LlmMessage.images: List<ImagePart>? = null` — optional, back-compat
default. Adapters translate to per-provider wire when non-null AND
role is "user"; otherwise zero diff vs pre-#2470.
- `ImagePart(base64: String, wireMime: ImagePart.WireMime)` — closed
WireMime sealed type (Png / Jpeg / Gif / Webp). String mime is not
accepted in the public ctor. Base64 stored as `String` so structural
equals/hashCode work (the `ByteArray` data-class trap, avoided).
Per-provider wire shapes (pinned by VisionWireFormatTest):
| Provider | User-message shape |
|-----------|-------------------------------------------------------|
| Ollama | `{role:"user", content:"text", images:["<b64>", ...]}` |
| Claude | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]}` |
| OpenAI | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]}` |
| DeepSeek | inherits OpenAI; most DeepSeek models lack vision and silently ignore the field (shape-tested, no live call) |
Each adapter's vision path is gated:
- role must be "user" — system/assistant/tool messages with non-null
`images` ignore the field on the wire (no provider's API carries
images on those roles).
- `images = null` or empty → exact pre-#2470 wire shape (back-compat
pinned by dedicated tests).
`VisionFixtures` (test source set): 256×256 PNGs generated via
`BufferedImage` + `ImageIO`. Two fixtures —
`threeSquaresPng()` (red/blue/green squares, well-separated, thick
black outlines so counting is unambiguous) and `housePng()` (triangle
roof + body + door + two windows, terracotta + beige colour scheme).
Reproducible byte-for-byte; ships in source, no external assets.
Tests:
- VisionWireFormatTest.kt (8 cases): per-provider wire shape for both
the vision path and the no-images back-compat path; multiple images
in one message; non-user-role images filtered; PNG fixture sanity
(magic bytes + reasonable size).
- VisionLiveTest.kt (6 cases): per-provider end-to-end against:
* Ollama qwen3-vl:8b — tagged `live-llm`, runs via
`./gradlew integrationTest`
* Claude Haiku 4.5 — tagged `live-cloud-api`, runs in default
`:test`, assumeTrue skips when no key
* OpenAI gpt-4o-mini — same pattern
Cost discipline per call: 256×256 PNG (~5KB), temperature=0,
maxTokens=80, single-turn. Each test sends a fixture image with a
short text prompt, parses the text response, asserts loose keyword
match (3 / three for the squares; house / home / cottage / building
/ cabin / barn for the house). Model names overridable via env
(`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, etc.) for CI flexibility.
Full unit suite: 1794 tests, 0 failures.
To run the live vision tests:
- `./gradlew integrationTest --tests "*VisionLiveTest*"` — Ollama
(requires `qwen3-vl:8b` pulled in local or Ollama Cloud)
- `./gradlew test --tests "*VisionLiveTest*"` — Claude + OpenAI (run
in default :test under live-cloud-api tag; assumeTrue skips per
provider when no key)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/multimodal.md — new "Vision input — talking to the model (#2470 slice a)" section between the existing foundation content and the "What's still coming" list. Walks through the LlmMessage.images field, ImagePart shape, per-provider wire-format table, back-compat + role-gating guarantees, programmatic VisionFixtures, and the per-provider live test how-to-run. "What's coming" list updated to flag the #2470 slice-a/slice-b split (this commit is slice a; the Content → LlmMessage.images loop translation is slice b). - README.md — new "Vision input to models" bullet right after the multimodal foundation bullet. Names all four providers and their default test models with the cost-discipline notes. - CHANGELOG.md `## [Unreleased]` — new "Added — Vision input across all providers (#2470 slice a)" section ABOVE the existing multimodal foundation section. Covers LlmMessage.images + ImagePart, per- provider adapter rows, role-gating, fixtures, live test setup + cost discipline, wire-format unit-test count. No source changes. Full suite stays at 1794 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.