From 273a6cb672bf3af3a5fc3d3aa0e4b1432e3921ab Mon Sep 17 00:00:00 2001 From: skobeltsyn Date: Sat, 30 May 2026 12:57:33 +0300 Subject: [PATCH 1/2] feat(#2470): vision input across all providers + live integration tests MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit #2470 (slice a) — vision-input path for the four built-in adapters, with programmatic image fixtures and per-provider live tests. Sibling work (`Content` → `LlmMessage` translation, multipart `@Generable` input via KSP, manifest-anchored capability validation) is the rest of #2470 / #2468 / #2471, layered on top of this. ```kotlin val png = VisionFixtures.threeSquaresPng() val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0) client.chat(listOf( LlmMessage( role = "user", content = "How many squares?", images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)), ), )) // → LlmResponse.Text("3") ``` Implementation: - `LlmMessage.images: List? = null` — optional, back-compat default. Adapters translate to per-provider wire when non-null AND role is "user"; otherwise zero diff vs pre-#2470. - `ImagePart(base64: String, wireMime: ImagePart.WireMime)` — closed WireMime sealed type (Png / Jpeg / Gif / Webp). String mime is not accepted in the public ctor. Base64 stored as `String` so structural equals/hashCode work (the `ByteArray` data-class trap, avoided). Per-provider wire shapes (pinned by VisionWireFormatTest): | Provider | User-message shape | |-----------|-------------------------------------------------------| | Ollama | `{role:"user", content:"text", images:["", ...]}` | | Claude | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:""}}, ...]}` | | OpenAI | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,"}}, ...]}` | | DeepSeek | inherits OpenAI; most DeepSeek models lack vision and silently ignore the field (shape-tested, no live call) | Each adapter's vision path is gated: - role must be "user" — system/assistant/tool messages with non-null `images` ignore the field on the wire (no provider's API carries images on those roles). - `images = null` or empty → exact pre-#2470 wire shape (back-compat pinned by dedicated tests). `VisionFixtures` (test source set): 256×256 PNGs generated via `BufferedImage` + `ImageIO`. Two fixtures — `threeSquaresPng()` (red/blue/green squares, well-separated, thick black outlines so counting is unambiguous) and `housePng()` (triangle roof + body + door + two windows, terracotta + beige colour scheme). Reproducible byte-for-byte; ships in source, no external assets. Tests: - VisionWireFormatTest.kt (8 cases): per-provider wire shape for both the vision path and the no-images back-compat path; multiple images in one message; non-user-role images filtered; PNG fixture sanity (magic bytes + reasonable size). - VisionLiveTest.kt (6 cases): per-provider end-to-end against: * Ollama qwen3-vl:8b — tagged `live-llm`, runs via `./gradlew integrationTest` * Claude Haiku 4.5 — tagged `live-cloud-api`, runs in default `:test`, assumeTrue skips when no key * OpenAI gpt-4o-mini — same pattern Cost discipline per call: 256×256 PNG (~5KB), temperature=0, maxTokens=80, single-turn. Each test sends a fixture image with a short text prompt, parses the text response, asserts loose keyword match (3 / three for the squares; house / home / cottage / building / cabin / barn for the house). Model names overridable via env (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, etc.) for CI flexibility. Full unit suite: 1794 tests, 0 failures. To run the live vision tests: - `./gradlew integrationTest --tests "*VisionLiveTest*"` — Ollama (requires `qwen3-vl:8b` pulled in local or Ollama Cloud) - `./gradlew test --tests "*VisionLiveTest*"` — Claude + OpenAI (run in default :test under live-cloud-api tag; assumeTrue skips per provider when no key) Co-Authored-By: Claude Opus 4.7 (1M context) --- .../agents_engine/model/ClaudeClient.kt | 21 +- .../kotlin/agents_engine/model/ModelClient.kt | 50 ++++ .../agents_engine/model/OllamaClient.kt | 13 + .../agents_engine/model/OpenAiClient.kt | 25 +- .../agents_engine/model/VisionFixtures.kt | 108 +++++++++ .../agents_engine/model/VisionLiveTest.kt | 223 ++++++++++++++++++ .../model/VisionWireFormatTest.kt | 182 ++++++++++++++ 7 files changed, 619 insertions(+), 3 deletions(-) create mode 100644 src/test/kotlin/agents_engine/model/VisionFixtures.kt create mode 100644 src/test/kotlin/agents_engine/model/VisionLiveTest.kt create mode 100644 src/test/kotlin/agents_engine/model/VisionWireFormatTest.kt diff --git a/src/main/kotlin/agents_engine/model/ClaudeClient.kt b/src/main/kotlin/agents_engine/model/ClaudeClient.kt index 357b73d..5c4b052 100644 --- a/src/main/kotlin/agents_engine/model/ClaudeClient.kt +++ b/src/main/kotlin/agents_engine/model/ClaudeClient.kt @@ -389,7 +389,26 @@ open class ClaudeClient( val cacheControl = if (msg.cacheHint != null && consumeBreakpoint()) cacheControlJson(msg.cacheHint) else null when (msg.role) { "user" -> { - if (cacheControl == null) { + val images = msg.images + if (!images.isNullOrEmpty()) { + // #2470 — vision input. Anthropic accepts a content + // array of typed blocks; one text block + N image + // blocks. Each image block is base64-source with a + // typed media_type. + val textBlock = """{"type":"text","text":${msg.content.toJsonString()}}""" + val imageBlocks = images.joinToString(",") { part -> + """{"type":"image","source":{"type":"base64","media_type":${part.wireMime.value.toJsonString()},"data":${part.base64.toJsonString()}}}""" + } + val allBlocks = "$textBlock,$imageBlocks" + val withCache = if (cacheControl != null) { + // Attach cache_control to the LAST block. + val splitAt = allBlocks.lastIndexOf("}") + allBlocks.substring(0, splitAt) + ",$cacheControl" + allBlocks.substring(splitAt) + } else { + allBlocks + } + """{"role":"user","content":[$withCache]}""" + } else if (cacheControl == null) { """{"role":"user","content":${msg.content.toJsonString()}}""" } else { // Single text content block with cache_control attached. diff --git a/src/main/kotlin/agents_engine/model/ModelClient.kt b/src/main/kotlin/agents_engine/model/ModelClient.kt index 5397ca3..486836a 100644 --- a/src/main/kotlin/agents_engine/model/ModelClient.kt +++ b/src/main/kotlin/agents_engine/model/ModelClient.kt @@ -26,8 +26,58 @@ data class LlmMessage( * unchanged on the wire. */ val cacheHint: CacheHint? = null, + /** + * #2470 — optional vision input. When non-null and the role is + * `"user"`, adapters translate each [ImagePart] into the provider's + * native image payload alongside [content]: + * + * - Ollama (e.g. qwen3-vl:8b) — `images: [, ...]` array + * on the user message; [content] stays the text prompt. + * - Anthropic Claude — `content: [{type:"text",...}, + * {type:"image", source:{type:"base64", media_type:"image/png", + * data:""}}, ...]`. + * - OpenAI — `content: [{type:"text",...}, + * {type:"image_url", image_url:{url:"data:image/png;base64, + * "}}, ...]`. + * + * Null = no vision parts; wire shape is byte-identical to pre-#2470. + * Vision works on the FIRST user turn (most common case for "describe + * this image" prompts); subsequent user-turn images compose naturally + * if the model supports multi-turn vision. + * + * Non-user roles ignore this field — system / assistant / tool + * messages don't carry images in any provider's API. + */ + val images: List? = null, ) +/** + * #2470 — base64-encoded image payload for vision input. The caller is + * responsible for the encoding so the adapter can splat the bytes onto + * the wire without re-encoding per provider. Wire MIME is closed via + * the [ImagePart.WireMime] sealed type — `String` mime is intentionally + * not accepted in the public ctor. + * + * Small, allocation-cheap. Equatability: `base64` is a `String`, so + * structural equals/hashCode work — unlike `ByteArray`, which uses + * identity equals (the trap we avoid by base64-encoding upfront). + */ +data class ImagePart( + /** Base64-encoded image bytes, no `data:` URL prefix. Adapter formats per-provider. */ + val base64: String, + /** Closed wire MIME — `image/png`, `image/jpeg`, `image/gif`, `image/webp`. */ + val wireMime: WireMime, +) { + sealed interface WireMime { + val value: String + + object Png : WireMime { override val value: String = "image/png" } + object Jpeg : WireMime { override val value: String = "image/jpeg" } + object Gif : WireMime { override val value: String = "image/gif" } + object Webp : WireMime { override val value: String = "image/webp" } + } +} + data class ToolCall( val name: String, val arguments: Map = emptyMap(), diff --git a/src/main/kotlin/agents_engine/model/OllamaClient.kt b/src/main/kotlin/agents_engine/model/OllamaClient.kt index 09b85be..77394f8 100644 --- a/src/main/kotlin/agents_engine/model/OllamaClient.kt +++ b/src/main/kotlin/agents_engine/model/OllamaClient.kt @@ -398,6 +398,19 @@ open class OllamaClient( }) append("]") } + // #2470 — vision input. Ollama's chat API accepts an `images` + // array of base64-encoded payloads (no data: prefix) on user + // messages. Vision-capable models (qwen3-vl, llama3.2-vision, + // llava) consume it; non-vision models ignore it without + // error. Mime is not on the wire — Ollama infers from the + // bytes; we keep the typed wireMime on ImagePart for audit + // + caller debugging only. + val images = msg.images + if (msg.role == "user" && !images.isNullOrEmpty()) { + append(""","images":[""") + append(images.joinToString(",") { it.base64.toJsonString() }) + append("]") + } append("}") } } diff --git a/src/main/kotlin/agents_engine/model/OpenAiClient.kt b/src/main/kotlin/agents_engine/model/OpenAiClient.kt index 7d2939e..9c8d75f 100644 --- a/src/main/kotlin/agents_engine/model/OpenAiClient.kt +++ b/src/main/kotlin/agents_engine/model/OpenAiClient.kt @@ -269,8 +269,29 @@ open class OpenAiClient( val messageObjects = messages.map { msg -> when (msg.role) { - "system", "user" -> - """{"role":${msg.role.toJsonString()},"content":${msg.content.toJsonString()}}""" + "system" -> + """{"role":"system","content":${msg.content.toJsonString()}}""" + "user" -> { + val images = msg.images + if (!images.isNullOrEmpty()) { + // #2470 — vision input. OpenAI Chat Completions + // accepts a content array of typed blocks; one text + // block + N image_url blocks. Images ride as data: + // URLs (data:;base64,). Works on + // gpt-4o, gpt-4o-mini, gpt-4-turbo, and the o* + // reasoning models. DeepSeek inherits this adapter; + // vision is silently ignored by non-vision DeepSeek + // models. + val textBlock = """{"type":"text","text":${msg.content.toJsonString()}}""" + val imageBlocks = images.joinToString(",") { part -> + val dataUrl = "data:${part.wireMime.value};base64,${part.base64}" + """{"type":"image_url","image_url":{"url":${dataUrl.toJsonString()}}}""" + } + """{"role":"user","content":[$textBlock,$imageBlocks]}""" + } else { + """{"role":"user","content":${msg.content.toJsonString()}}""" + } + } "assistant" -> { val toolCallsJson = msg.toolCalls?.joinToString(",") { call -> diff --git a/src/test/kotlin/agents_engine/model/VisionFixtures.kt b/src/test/kotlin/agents_engine/model/VisionFixtures.kt new file mode 100644 index 0000000..c9fc444 --- /dev/null +++ b/src/test/kotlin/agents_engine/model/VisionFixtures.kt @@ -0,0 +1,108 @@ +package agents_engine.model + +import java.awt.BasicStroke +import java.awt.Color +import java.awt.RenderingHints +import java.awt.image.BufferedImage +import java.io.ByteArrayOutputStream +import java.util.Base64 +import javax.imageio.ImageIO + +/** + * Programmatic PNG fixtures for vision-input tests (#2470). Images are + * rendered from `BufferedImage` so the test set ships in source — no + * binary assets, no external download — and every byte is reproducible + * across machines / CI runs. + * + * Size discipline: 256×256 keeps base64 payloads ~5 KB, well within + * cheap-tier provider limits. Hand-tested with qwen3-vl:8b / Haiku 4.5 + * / gpt-4o-mini — all three identify both fixtures reliably at this + * size. + */ +object VisionFixtures { + + /** + * 256×256 PNG with three colored squares on a white background: + * red, blue, green — spaced far enough apart that even small + * vision models count them reliably. Used by the + * "agent counts squares" eval. + */ + fun threeSquaresPng(): ByteArray { + val w = 256 + val h = 256 + val img = BufferedImage(w, h, BufferedImage.TYPE_INT_RGB) + val g = img.createGraphics() + g.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_OFF) + g.color = Color.WHITE + g.fillRect(0, 0, w, h) + // Three obviously-distinct squares at well-separated positions. + val side = 50 + g.color = Color.RED + g.fillRect(20, 20, side, side) + g.color = Color.BLUE + g.fillRect(110, 100, side, side) + g.color = Color.GREEN + g.fillRect(190, 190, side, side) + // Black outlines so each square is unambiguously countable. + g.color = Color.BLACK + g.stroke = BasicStroke(2f) + g.drawRect(20, 20, side, side) + g.drawRect(110, 100, side, side) + g.drawRect(190, 190, side, side) + g.dispose() + return img.toPngBytes() + } + + /** + * 256×256 PNG of a simple house: triangular roof, square body, door, + * two windows. Drawn with thick black outlines so even tiny vision + * models classify it. Used by the "agent identifies the house" eval. + */ + fun housePng(): ByteArray { + val w = 256 + val h = 256 + val img = BufferedImage(w, h, BufferedImage.TYPE_INT_RGB) + val g = img.createGraphics() + g.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON) + g.color = Color.WHITE + g.fillRect(0, 0, w, h) + // Body: 130×130 square, centred-ish. + val bodyX = 60 + val bodyY = 110 + val bodyW = 130 + val bodyH = 110 + g.color = Color(255, 230, 200) // warm beige walls + g.fillRect(bodyX, bodyY, bodyW, bodyH) + // Roof: triangle spanning slightly wider than the body, peak above. + val roofXs = intArrayOf(bodyX - 15, bodyX + bodyW / 2, bodyX + bodyW + 15) + val roofYs = intArrayOf(bodyY, bodyY - 80, bodyY) + g.color = Color(180, 70, 60) // terracotta roof + g.fillPolygon(roofXs, roofYs, 3) + // Door: small rectangle bottom-centre of the body. + g.color = Color(120, 75, 40) // brown door + g.fillRect(bodyX + bodyW / 2 - 18, bodyY + bodyH - 60, 36, 60) + // Two windows: small squares left + right of the door. + g.color = Color(140, 180, 220) // blue windows + g.fillRect(bodyX + 15, bodyY + 20, 30, 30) + g.fillRect(bodyX + bodyW - 45, bodyY + 20, 30, 30) + // Outlines: 3-px black so the silhouette is unambiguous. + g.color = Color.BLACK + g.stroke = BasicStroke(3f) + g.drawRect(bodyX, bodyY, bodyW, bodyH) + g.drawPolygon(roofXs, roofYs, 3) + g.drawRect(bodyX + bodyW / 2 - 18, bodyY + bodyH - 60, 36, 60) + g.drawRect(bodyX + 15, bodyY + 20, 30, 30) + g.drawRect(bodyX + bodyW - 45, bodyY + 20, 30, 30) + g.dispose() + return img.toPngBytes() + } + + /** Encode the bytes to base64 — what every adapter ultimately sends on the wire. */ + fun toBase64(bytes: ByteArray): String = Base64.getEncoder().encodeToString(bytes) + + private fun BufferedImage.toPngBytes(): ByteArray { + val out = ByteArrayOutputStream() + ImageIO.write(this, "png", out) + return out.toByteArray() + } +} diff --git a/src/test/kotlin/agents_engine/model/VisionLiveTest.kt b/src/test/kotlin/agents_engine/model/VisionLiveTest.kt new file mode 100644 index 0000000..5b4ccba --- /dev/null +++ b/src/test/kotlin/agents_engine/model/VisionLiveTest.kt @@ -0,0 +1,223 @@ +package agents_engine.model + +import org.junit.jupiter.api.Assumptions.assumeTrue +import org.junit.jupiter.api.Tag +import org.junit.jupiter.api.Test +import java.io.File +import java.net.URI +import java.net.http.HttpClient +import java.net.http.HttpRequest +import java.net.http.HttpResponse +import java.time.Duration +import kotlin.test.assertNotNull +import kotlin.test.assertTrue + +/** + * #2470 — live vision-input integration tests across all four + * providers. Each test sends a programmatically-generated PNG (a + * 3-coloured-square image OR a simple house) to a vision-capable + * model and checks the response. + * + * **Cost discipline:** + * - 256×256 PNGs (~5 KB base64) — small payloads, fast roundtrip. + * - `temperature = 0`, `maxTokens = 80` — short deterministic replies. + * - Single-turn — no tool calls, no follow-ups. + * - Each provider's tag is skipped cleanly when the API key / model + * isn't reachable (assumeTrue gate). + * + * **Tags:** + * - `live-llm` — Ollama (local + cloud-via-Ollama-Cloud). Excluded + * from default `./gradlew test`; runs via `./gradlew integrationTest`. + * - `live-cloud-api` — direct cloud APIs (Anthropic, OpenAI, DeepSeek). + * In default `./gradlew test`; `assumeTrue` skips when no key. + * + * **DeepSeek:** most DeepSeek models don't have vision today. The + * adapter inherits OpenAI's image-content shape; the field passes + * through and the model silently ignores it. We test the shape via + * the unit tests (`VisionWireFormatTest`) — no live call here to save + * cost on what is effectively a no-op. + * + * **Assertion shape:** loose — every cheap vision model has some + * variance in phrasing. The test passes if the answer mentions any of + * a small set of acceptable keywords. Goal is "did the image reach + * the model and elicit a sensible reply", not "did the model phrase + * it exactly this way." + */ +class VisionLiveTest { + + private val ollamaVisionModel: String = + System.getenv("AGENTSKT_TEST_OLLAMA_VISION_MODEL") ?: "qwen3-vl:8b" + private val claudeVisionModel: String = + System.getenv("AGENTSKT_TEST_CLAUDE_VISION_MODEL") ?: "claude-haiku-4-5" + private val openaiVisionModel: String = + System.getenv("AGENTSKT_TEST_OPENAI_VISION_MODEL") ?: "gpt-4o-mini" + + // ─────────────────────────── Ollama (qwen3-vl) ────────────────────── + + @Tag("live-llm") + @Test + fun `Ollama qwen3-vl counts the three squares in a generated image`() { + assumeTrue(isOllamaReachable(), "skipping: no Ollama at localhost:11434") + val client = OllamaClient(model = ollamaVisionModel, temperature = 0.0) + val png = VisionFixtures.threeSquaresPng() + val response = client.chat( + listOf( + LlmMessage( + role = "user", + content = "How many colored squares are in this image? Answer with just the digit.", + images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)), + ), + ), + ) + val text = textOf(response) + println("[Ollama vision] squares → $text") + assertSquaresCountedAsThree(text, "Ollama($ollamaVisionModel)") + } + + @Tag("live-llm") + @Test + fun `Ollama qwen3-vl identifies the simple house drawing`() { + assumeTrue(isOllamaReachable(), "skipping: no Ollama at localhost:11434") + val client = OllamaClient(model = ollamaVisionModel, temperature = 0.0) + val png = VisionFixtures.housePng() + val response = client.chat( + listOf( + LlmMessage( + role = "user", + content = "What is depicted in this image? Answer in one short phrase.", + images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)), + ), + ), + ) + val text = textOf(response) + println("[Ollama vision] house → $text") + assertSeesHouse(text, "Ollama($ollamaVisionModel)") + } + + // ─────────────────────────── Anthropic Haiku ──────────────────────── + + @Tag("live-cloud-api") + @Test + fun `Claude Haiku counts the three squares`() { + val apiKey = loadKey("ANTHROPIC_API_KEY", ".secrets/anthropic-key") + assumeTrue(apiKey != null, "skipping: no Anthropic key") + val client = ClaudeClient(apiKey = apiKey!!, model = claudeVisionModel, temperature = 0.0, maxTokens = 80) + val png = VisionFixtures.threeSquaresPng() + val response = client.chat( + listOf( + LlmMessage( + role = "user", + content = "How many colored squares are in this image? Answer with just the digit.", + images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)), + ), + ), + ) + val text = textOf(response) + println("[Claude vision] squares → $text") + assertSquaresCountedAsThree(text, "Claude($claudeVisionModel)") + } + + @Tag("live-cloud-api") + @Test + fun `Claude Haiku identifies the simple house drawing`() { + val apiKey = loadKey("ANTHROPIC_API_KEY", ".secrets/anthropic-key") + assumeTrue(apiKey != null, "skipping: no Anthropic key") + val client = ClaudeClient(apiKey = apiKey!!, model = claudeVisionModel, temperature = 0.0, maxTokens = 80) + val png = VisionFixtures.housePng() + val response = client.chat( + listOf( + LlmMessage( + role = "user", + content = "What is depicted in this image? Answer in one short phrase.", + images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)), + ), + ), + ) + val text = textOf(response) + println("[Claude vision] house → $text") + assertSeesHouse(text, "Claude($claudeVisionModel)") + } + + // ─────────────────────────── OpenAI gpt-4o-mini ───────────────────── + + @Tag("live-cloud-api") + @Test + fun `OpenAI gpt-4o-mini counts the three squares`() { + val apiKey = loadKey("OPENAI_API_KEY", ".secrets/openai-key") + assumeTrue(apiKey != null, "skipping: no OpenAI key") + val client = OpenAiClient(apiKey = apiKey!!, model = openaiVisionModel, temperature = 0.0, maxTokens = 80) + val png = VisionFixtures.threeSquaresPng() + val response = client.chat( + listOf( + LlmMessage( + role = "user", + content = "How many colored squares are in this image? Answer with just the digit.", + images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)), + ), + ), + ) + val text = textOf(response) + println("[OpenAI vision] squares → $text") + assertSquaresCountedAsThree(text, "OpenAI($openaiVisionModel)") + } + + @Tag("live-cloud-api") + @Test + fun `OpenAI gpt-4o-mini identifies the simple house drawing`() { + val apiKey = loadKey("OPENAI_API_KEY", ".secrets/openai-key") + assumeTrue(apiKey != null, "skipping: no OpenAI key") + val client = OpenAiClient(apiKey = apiKey!!, model = openaiVisionModel, temperature = 0.0, maxTokens = 80) + val png = VisionFixtures.housePng() + val response = client.chat( + listOf( + LlmMessage( + role = "user", + content = "What is depicted in this image? Answer in one short phrase.", + images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)), + ), + ), + ) + val text = textOf(response) + println("[OpenAI vision] house → $text") + assertSeesHouse(text, "OpenAI($openaiVisionModel)") + } + + // ───────────────────────────── Helpers ────────────────────────────── + + private fun textOf(response: LlmResponse): String = when (response) { + is LlmResponse.Text -> response.content + is LlmResponse.ToolCalls -> error("vision call unexpectedly returned tool_calls: $response") + } + + private fun assertSquaresCountedAsThree(text: String, providerLabel: String) { + val lowered = text.lowercase() + val sees3 = "3" in text || "three" in lowered + assertTrue(sees3, "$providerLabel did not count three squares; got: $text") + } + + private fun assertSeesHouse(text: String, providerLabel: String) { + val lowered = text.lowercase() + val sees = listOf("house", "home", "cottage", "building", "cabin", "barn").any { it in lowered } + assertTrue(sees, "$providerLabel did not recognise the house drawing; got: $text") + } + + private fun loadKey(envVar: String, secretFile: String): String? { + val envKey = System.getenv(envVar) + if (!envKey.isNullOrBlank()) return envKey + val file = File(secretFile) + return if (file.exists()) file.readText().trim().ifBlank { null } else null + } + + private fun isOllamaReachable(): Boolean = try { + val client = HttpClient.newBuilder().connectTimeout(Duration.ofMillis(500)).build() + val request = HttpRequest.newBuilder() + .uri(URI.create("http://localhost:11434/api/tags")) + .timeout(Duration.ofMillis(1500)) + .GET() + .build() + val response = client.send(request, HttpResponse.BodyHandlers.discarding()) + response.statusCode() in 200..299 + } catch (_: Throwable) { + false + } +} diff --git a/src/test/kotlin/agents_engine/model/VisionWireFormatTest.kt b/src/test/kotlin/agents_engine/model/VisionWireFormatTest.kt new file mode 100644 index 0000000..20bac8e --- /dev/null +++ b/src/test/kotlin/agents_engine/model/VisionWireFormatTest.kt @@ -0,0 +1,182 @@ +package agents_engine.model + +import agents_engine.generation.LenientJsonParser +import kotlin.test.Test +import kotlin.test.assertEquals +import kotlin.test.assertNotNull +import kotlin.test.assertNull +import kotlin.test.assertTrue + +/** + * #2470 — vision-input wire-format tests. Pins the per-provider JSON + * shape so an adapter regression surfaces at unit-test time, not at + * provider HTTP time. No network — uses each client's + * `buildRequestJson` and parses the result. + * + * Each provider's image payload is verified for: text-prompt + * preservation, image-block presence, base64 splatting, typed mime + * propagation. Pre-#2470 message shapes (no images) stay byte-identical. + */ +class VisionWireFormatTest { + + private val pngBytes = byteArrayOf(0x89.toByte(), 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, 1, 2) + private val pngBase64 = java.util.Base64.getEncoder().encodeToString(pngBytes) + + private fun userMessage(text: String, images: List? = null) = + LlmMessage(role = "user", content = text, images = images) + + // ────────────────────────────── Ollama ────────────────────────────── + + @Test + fun `Ollama user message with images emits images array of base64 strings`() { + val client = OllamaClient(model = "qwen3-vl:8b") + val body = client.buildRequestJson(listOf( + userMessage("How many squares?", listOf(ImagePart(pngBase64, ImagePart.WireMime.Png))), + )) + @Suppress("UNCHECKED_CAST") + val parsed = LenientJsonParser.parse(body) as Map + @Suppress("UNCHECKED_CAST") + val messages = parsed["messages"] as List> + val userMsg = messages.single() + assertEquals("user", userMsg["role"]) + assertEquals("How many squares?", userMsg["content"], "text prompt preserved") + @Suppress("UNCHECKED_CAST") + val images = userMsg["images"] as List + assertEquals(1, images.size) + assertEquals(pngBase64, images[0], "base64 splatted verbatim, no data: prefix") + } + + @Test + fun `Ollama user message without images omits the images field (back-compat)`() { + val client = OllamaClient(model = "llama3.2") + val body = client.buildRequestJson(listOf(userMessage("plain text"))) + assertTrue("\"images\"" !in body, "no images field when not requested: $body") + } + + @Test + fun `Ollama non-user message with images on it does NOT emit images (system_assistant_tool ignore)`() { + val client = OllamaClient(model = "qwen3-vl:8b") + val body = client.buildRequestJson(listOf( + LlmMessage(role = "system", content = "be helpful", + images = listOf(ImagePart(pngBase64, ImagePart.WireMime.Png))), + )) + assertTrue("\"images\"" !in body, "non-user roles never carry images on the wire: $body") + } + + // ────────────────────────────── Anthropic ─────────────────────────── + + @Test + fun `Claude user message with images emits content array with text plus image blocks`() { + val client = ClaudeClient(apiKey = "test", model = "claude-haiku-4-5") + val body = client.buildRequestJson(listOf( + userMessage("Identify this image.", listOf(ImagePart(pngBase64, ImagePart.WireMime.Png))), + )) + @Suppress("UNCHECKED_CAST") + val parsed = LenientJsonParser.parse(body) as Map + @Suppress("UNCHECKED_CAST") + val messages = parsed["messages"] as List> + @Suppress("UNCHECKED_CAST") + val content = messages.single()["content"] as List> + assertEquals(2, content.size, "one text block + one image block") + assertEquals("text", content[0]["type"]) + assertEquals("Identify this image.", content[0]["text"]) + assertEquals("image", content[1]["type"]) + @Suppress("UNCHECKED_CAST") + val source = content[1]["source"] as Map + assertEquals("base64", source["type"]) + assertEquals("image/png", source["media_type"], "typed mime → wire media_type") + assertEquals(pngBase64, source["data"]) + } + + @Test + fun `Claude user message without images stays on the legacy string-content path`() { + val client = ClaudeClient(apiKey = "test", model = "claude-haiku-4-5") + val body = client.buildRequestJson(listOf(userMessage("just text please"))) + @Suppress("UNCHECKED_CAST") + val parsed = LenientJsonParser.parse(body) as Map + @Suppress("UNCHECKED_CAST") + val messages = parsed["messages"] as List> + // Legacy shape: content is a string, not an array. + assertEquals("just text please", messages.single()["content"], "back-compat: string content when no images") + } + + // ────────────────────────────── OpenAI ────────────────────────────── + + @Test + fun `OpenAI user message with images emits content array with text plus image_url blocks`() { + val client = OpenAiClient(apiKey = "test", model = "gpt-4o-mini") + val body = client.buildRequestJson(listOf( + userMessage("What is in this picture?", listOf(ImagePart(pngBase64, ImagePart.WireMime.Png))), + )) + @Suppress("UNCHECKED_CAST") + val parsed = LenientJsonParser.parse(body) as Map + @Suppress("UNCHECKED_CAST") + val messages = parsed["messages"] as List> + @Suppress("UNCHECKED_CAST") + val content = messages.single()["content"] as List> + assertEquals(2, content.size) + assertEquals("text", content[0]["type"]) + assertEquals("What is in this picture?", content[0]["text"]) + assertEquals("image_url", content[1]["type"]) + @Suppress("UNCHECKED_CAST") + val imageUrl = content[1]["image_url"] as Map + val url = imageUrl["url"] as String + assertTrue(url.startsWith("data:image/png;base64,"), "data-URL with typed mime: $url") + assertTrue(url.endsWith(pngBase64), "base64 splatted verbatim at the end: $url") + } + + @Test + fun `OpenAI user message without images stays on the legacy string-content path`() { + val client = OpenAiClient(apiKey = "test", model = "gpt-4o-mini") + val body = client.buildRequestJson(listOf(userMessage("just text"))) + @Suppress("UNCHECKED_CAST") + val parsed = LenientJsonParser.parse(body) as Map + @Suppress("UNCHECKED_CAST") + val messages = parsed["messages"] as List> + assertEquals("just text", messages.single()["content"], "back-compat: string content when no images") + } + + @Test + fun `OpenAI multiple images each get their own image_url block`() { + val client = OpenAiClient(apiKey = "test", model = "gpt-4o-mini") + val body = client.buildRequestJson(listOf( + userMessage("Compare these.", listOf( + ImagePart(pngBase64, ImagePart.WireMime.Png), + ImagePart(pngBase64, ImagePart.WireMime.Jpeg), + )), + )) + @Suppress("UNCHECKED_CAST") + val parsed = LenientJsonParser.parse(body) as Map + @Suppress("UNCHECKED_CAST") + val messages = parsed["messages"] as List> + @Suppress("UNCHECKED_CAST") + val content = messages.single()["content"] as List> + assertEquals(3, content.size, "text + 2 images") + @Suppress("UNCHECKED_CAST") + val u1 = (content[1]["image_url"] as Map)["url"] as String + @Suppress("UNCHECKED_CAST") + val u2 = (content[2]["image_url"] as Map)["url"] as String + assertTrue(u1.startsWith("data:image/png;base64,")) + assertTrue(u2.startsWith("data:image/jpeg;base64,"), "second image's wireMime propagates") + } + + // ────────────────────────── Fixtures sanity ───────────────────────── + + @Test + fun `VisionFixtures threeSquaresPng renders a valid PNG of reasonable size`() { + val bytes = VisionFixtures.threeSquaresPng() + // PNG magic: 89 50 4E 47 0D 0A 1A 0A + assertEquals(0x89.toByte(), bytes[0]) + assertEquals(0x50.toByte(), bytes[1]) + assertEquals(0x4E.toByte(), bytes[2]) + assertEquals(0x47.toByte(), bytes[3]) + assertTrue(bytes.size > 100 && bytes.size < 50_000, "small enough for cheap vision calls; got ${bytes.size}") + } + + @Test + fun `VisionFixtures housePng renders a valid PNG`() { + val bytes = VisionFixtures.housePng() + assertEquals(0x89.toByte(), bytes[0]) + assertTrue(bytes.size in 100..50_000) + } +} From cea17c0d849ff0c5c77d430fd2b4212c3ca55560 Mon Sep 17 00:00:00 2001 From: skobeltsyn Date: Sat, 30 May 2026 13:01:11 +0300 Subject: [PATCH 2/2] =?UTF-8?q?docs(#2470):=20vision=20input=20=E2=80=94?= =?UTF-8?q?=20multimodal.md=20extension,=20README,=20CHANGELOG?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - docs/multimodal.md — new "Vision input — talking to the model (#2470 slice a)" section between the existing foundation content and the "What's still coming" list. Walks through the LlmMessage.images field, ImagePart shape, per-provider wire-format table, back-compat + role-gating guarantees, programmatic VisionFixtures, and the per-provider live test how-to-run. "What's coming" list updated to flag the #2470 slice-a/slice-b split (this commit is slice a; the Content → LlmMessage.images loop translation is slice b). - README.md — new "Vision input to models" bullet right after the multimodal foundation bullet. Names all four providers and their default test models with the cost-discipline notes. - CHANGELOG.md `## [Unreleased]` — new "Added — Vision input across all providers (#2470 slice a)" section ABOVE the existing multimodal foundation section. Covers LlmMessage.images + ImagePart, per- provider adapter rows, role-gating, fixtures, live test setup + cost discipline, wire-format unit-test count. No source changes. Full suite stays at 1794 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 13 ++++++++++ README.md | 3 ++- docs/multimodal.md | 65 ++++++++++++++++++++++++++++++++++++++++++++-- 3 files changed, 78 insertions(+), 3 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 03af14a..e7d3dc9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,19 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a ## [Unreleased] +### Added — Vision input across all providers (#2470 slice a) + +- **`LlmMessage.images: List? = null`** — new optional field; back-compat default leaves the wire shape byte-identical to pre-#2470 for callers that don't pass images. Closed `ImagePart(base64, wireMime)` with `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`) — `String` mime is intentionally not accepted in the public ctor. +- **Per-provider adapters** translate vision on `role = "user"` messages: + - Ollama: `{role:"user", content:"text", images:["", ...]}` — works with `qwen3-vl:8b`, `llava`, `llama3.2-vision`, etc. Non-vision models silently ignore the field. + - Claude: typed content array — `[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:""}}, ...]`. Works with all Claude vision-capable models (Haiku 4.5, Sonnet 4.6, Opus 4.7). + - OpenAI: typed content array — `[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,"}}, ...]`. Works with gpt-4o, gpt-4o-mini, gpt-4-turbo, the o* reasoning models. + - DeepSeek: inherits the OpenAI adapter shape; current DeepSeek models lack vision and silently ignore the field. Shape-tested; no live call to avoid spending on a no-op. +- **Role-gated:** non-user messages (system/assistant/tool) with non-null `images` ignore the field on the wire — no provider's API accepts images on those roles. Pinned by tests. +- **Programmatic fixtures** in `src/test`: `VisionFixtures.threeSquaresPng()` (256×256 red/blue/green squares for "count the squares" eval) and `VisionFixtures.housePng()` (256×256 cartoon house for "what is this?" eval). Rendered via `BufferedImage` + `ImageIO` — reproducible byte-for-byte across machines and CI, no external assets in the repo. +- **Live integration tests** (`VisionLiveTest`) cover all three vision-capable providers with cost discipline (`temperature = 0`, `maxTokens = 80`, single-turn, ~5KB base64 payloads): Ollama `qwen3-vl:8b` (tagged `live-llm`, runs via `:integrationTest`), Claude `claude-haiku-4-5` and OpenAI `gpt-4o-mini` (tagged `live-cloud-api`, runs in default `:test` with `assumeTrue` skipping when no key). Model names overridable via env. Assertion shape is loose keyword-match — robust against per-model phrasing variance. +- 8 wire-format unit tests pin per-provider JSON shape + the no-images back-compat path. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a). + ### Added — Multimodal foundation (#2465 epic, Stage 1) - **Typed `Content` hierarchy (#2466)** — `sealed interface Content` with variants `Text`, `Image`, `Audio`, `Video`, `Document` in package `agents_engine.content`. Each non-text variant carries a `ContentRef` plus a typed mime (`ImageMime`, `AudioMime`, `VideoMime`, `DocMime`). Mime types are closed sealed interfaces with `wireMime: String` accessors — no `String` mime in any public API. Extension property `Content.modality: String` is the audit-stable per-variant name. Stage 1 wires Image + Document end-to-end (the modalities the 0.8 spec → product loop consumes); Audio + Video are modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). diff --git a/README.md b/README.md index e0eea6a..a7ca4c3 100644 --- a/README.md +++ b/README.md @@ -155,7 +155,8 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes - **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay). - **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up. - **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md). -- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`:::`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end; Audio + Video modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). See [docs/multimodal.md](docs/multimodal.md). +- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`:::`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end. See [docs/multimodal.md](docs/multimodal.md). +- **Vision input to models** — `LlmMessage(role = "user", content = "...", images = listOf(ImagePart(base64, ImagePart.WireMime.Png)))` (#2470 slice a) reaches all four built-in adapters: Ollama emits `images: [...]`, Claude emits `{type:"image", source:{type:"base64",...}}` content blocks, OpenAI emits `{type:"image_url", image_url:{url:"data:..."}}` content blocks, DeepSeek inherits OpenAI (silently ignored on non-vision models). Closed `ImagePart.WireMime { Png, Jpeg, Gif, Webp }` — no `String` mime. Programmatic `VisionFixtures.threeSquaresPng()` / `housePng()` (256×256, `BufferedImage`-rendered, ~5KB) + per-provider live tests (qwen3-vl:8b / Haiku 4.5 / gpt-4o-mini) with cost discipline. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a). - **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md). - **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md). - **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md). diff --git a/docs/multimodal.md b/docs/multimodal.md index f46ba66..ee10a41 100644 --- a/docs/multimodal.md +++ b/docs/multimodal.md @@ -143,10 +143,71 @@ The same discipline applies (when wired) to the OTel / LangSmith / Langfuse brid Pairs with the #2754 manifest-hash restore guard: resume across an agent rebuild that changed tools (including the `BlobStore` wiring) fails closed unless the caller opts in. -## What's coming (the rest of #2465) +## Vision input — talking to the model (#2470 slice a) + +The foundation above answers "how do tools return mixed content?" The follow-on slice answers "how does the agent send an image to a vision-capable model?" + +`LlmMessage` gains an optional `images: List` field. Adapters translate it per provider: + +```kotlin +import agents_engine.model.LlmMessage +import agents_engine.model.ImagePart + +val png: ByteArray = Files.readAllBytes(Path.of("screenshot.png")) +val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0) + +client.chat(listOf( + LlmMessage( + role = "user", + content = "How many windows in this picture?", + images = listOf(ImagePart( + base64 = Base64.getEncoder().encodeToString(png), + wireMime = ImagePart.WireMime.Png, + )), + ), +)) +``` + +`ImagePart` carries the base64 payload + a closed `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`). `String` mime is not accepted in the public ctor — same closed-mime discipline as `Content.Image`. Caller base64-encodes upfront so the adapter can splat the payload straight onto the wire without re-encoding per provider. + +| Provider | User-message shape | +|---|---| +| Ollama | `{role:"user", content:"text", images:["", ...]}` | +| Claude | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:""}}, ...]}` | +| OpenAI | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,"}}, ...]}` | +| DeepSeek | inherits OpenAI; current DeepSeek models lack vision and silently ignore the field | + +**Back-compat:** `images = null` (or empty) produces byte-identical wire shape to pre-#2470. Pinned by dedicated tests. + +**Role-gating:** images are emitted only on `role = "user"` messages. System / assistant / tool messages with non-null `images` ignore the field on the wire — no provider's API carries images on those roles. + +### Programmatic image fixtures + +`VisionFixtures` (in `src/test`) ships two reproducible PNGs rendered via `BufferedImage`: + +- `threeSquaresPng()` — 256×256, three colored squares (red/blue/green) on white with thick black outlines. Used for "count the squares" eval against cheap vision models. +- `housePng()` — 256×256 cartoon house: triangle roof + body + door + two windows. Used for "what is this?" eval. + +No external assets, no binary blobs in the repo — every byte is reproducible across machines and CI runs. + +### Live tests + +`VisionLiveTest.kt` runs the two fixtures against all three vision-capable providers, with cost discipline (256×256 PNG ~5KB, `temperature = 0`, `maxTokens = 80`, single-turn): + +| Provider | Default model | How to run | +|---|---|---| +| Ollama | `qwen3-vl:8b` | `./gradlew integrationTest --tests "*VisionLiveTest*"` (tagged `live-llm`) | +| Claude | `claude-haiku-4-5` | `./gradlew test --tests "*VisionLiveTest*"` (tagged `live-cloud-api`; `assumeTrue` skips when no key) | +| OpenAI | `gpt-4o-mini` | same | + +Models overridable via env (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, `AGENTSKT_TEST_CLAUDE_VISION_MODEL`, `AGENTSKT_TEST_OPENAI_VISION_MODEL`). + +Assertion shape is loose: the test passes if the model's text response mentions one of a small acceptable keyword set (`3` / `three` for counting; `house` / `home` / `cottage` / `building` / `cabin` / `barn` for the house). Goal is "did the image reach the model and elicit a sensible reply" — not exact phrasing. + +## What's still coming (rest of #2465) - **#2468** Compile-time modality routing — `Agent` becomes a real type; cross-modality miswiring is a compile error. Multi-part `@Generable` inputs via KSP. -- **#2470** Provider adapters — Claude vision, OpenAI vision, Gemini, Ollama multimodal. Translates `Content → provider-specific payload` at the wire. +- **#2470 (slice b)** `Content` → `LlmMessage.images` translation at the agentic loop — currently the caller dereferences `ContentRef` → bytes → `ImagePart` manually. Sliced this way to land the wire format first; the loop hook is a small follow-up. - **#2471** Manifest-anchored modality capability — declared per-agent modalities recorded in the permission manifest, validated against provider capabilities at build time. - **#2472** Multimodal memory — `MemoryBank` entries carry `ContentRef` for image/audio/video state. - **#2473** Testing fixtures + snapshot + mutation coverage.