From 273a6cb672bf3af3a5fc3d3aa0e4b1432e3921ab Mon Sep 17 00:00:00 2001
From: skobeltsyn <Konstantin@skobeltsyn.com>
Date: Sat, 30 May 2026 12:57:33 +0300
Subject: [PATCH 1/2] feat(#2470): vision input across all providers + live
 integration tests
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

#2470 (slice a) — vision-input path for the four built-in adapters,
with programmatic image fixtures and per-provider live tests. Sibling
work (`Content` → `LlmMessage` translation, multipart `@Generable`
input via KSP, manifest-anchored capability validation) is the rest
of #2470 / #2468 / #2471, layered on top of this.

```kotlin
val png = VisionFixtures.threeSquaresPng()
val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0)
client.chat(listOf(
    LlmMessage(
        role = "user",
        content = "How many squares?",
        images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
    ),
))
// → LlmResponse.Text("3")
```

Implementation:

- `LlmMessage.images: List<ImagePart>? = null` — optional, back-compat
  default. Adapters translate to per-provider wire when non-null AND
  role is "user"; otherwise zero diff vs pre-#2470.
- `ImagePart(base64: String, wireMime: ImagePart.WireMime)` — closed
  WireMime sealed type (Png / Jpeg / Gif / Webp). String mime is not
  accepted in the public ctor. Base64 stored as `String` so structural
  equals/hashCode work (the `ByteArray` data-class trap, avoided).

Per-provider wire shapes (pinned by VisionWireFormatTest):

| Provider  | User-message shape                                    |
|-----------|-------------------------------------------------------|
| Ollama    | `{role:"user", content:"text", images:["<b64>", ...]}` |
| Claude    | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]}` |
| OpenAI    | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]}` |
| DeepSeek  | inherits OpenAI; most DeepSeek models lack vision and silently ignore the field (shape-tested, no live call) |

Each adapter's vision path is gated:
- role must be "user" — system/assistant/tool messages with non-null
  `images` ignore the field on the wire (no provider's API carries
  images on those roles).
- `images = null` or empty → exact pre-#2470 wire shape (back-compat
  pinned by dedicated tests).

`VisionFixtures` (test source set): 256×256 PNGs generated via
`BufferedImage` + `ImageIO`. Two fixtures —
`threeSquaresPng()` (red/blue/green squares, well-separated, thick
black outlines so counting is unambiguous) and `housePng()` (triangle
roof + body + door + two windows, terracotta + beige colour scheme).
Reproducible byte-for-byte; ships in source, no external assets.

Tests:

- VisionWireFormatTest.kt (8 cases): per-provider wire shape for both
  the vision path and the no-images back-compat path; multiple images
  in one message; non-user-role images filtered; PNG fixture sanity
  (magic bytes + reasonable size).
- VisionLiveTest.kt (6 cases): per-provider end-to-end against:
  * Ollama qwen3-vl:8b — tagged `live-llm`, runs via
    `./gradlew integrationTest`
  * Claude Haiku 4.5 — tagged `live-cloud-api`, runs in default
    `:test`, assumeTrue skips when no key
  * OpenAI gpt-4o-mini — same pattern
  Cost discipline per call: 256×256 PNG (~5KB), temperature=0,
  maxTokens=80, single-turn. Each test sends a fixture image with a
  short text prompt, parses the text response, asserts loose keyword
  match (3 / three for the squares; house / home / cottage / building
  / cabin / barn for the house). Model names overridable via env
  (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, etc.) for CI flexibility.

Full unit suite: 1794 tests, 0 failures.

To run the live vision tests:
- `./gradlew integrationTest --tests "*VisionLiveTest*"` — Ollama
  (requires `qwen3-vl:8b` pulled in local or Ollama Cloud)
- `./gradlew test --tests "*VisionLiveTest*"` — Claude + OpenAI (run
  in default :test under live-cloud-api tag; assumeTrue skips per
  provider when no key)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../agents_engine/model/ClaudeClient.kt       |  21 +-
 .../kotlin/agents_engine/model/ModelClient.kt |  50 ++++
 .../agents_engine/model/OllamaClient.kt       |  13 +
 .../agents_engine/model/OpenAiClient.kt       |  25 +-
 .../agents_engine/model/VisionFixtures.kt     | 108 +++++++++
 .../agents_engine/model/VisionLiveTest.kt     | 223 ++++++++++++++++++
 .../model/VisionWireFormatTest.kt             | 182 ++++++++++++++
 7 files changed, 619 insertions(+), 3 deletions(-)
 create mode 100644 src/test/kotlin/agents_engine/model/VisionFixtures.kt
 create mode 100644 src/test/kotlin/agents_engine/model/VisionLiveTest.kt
 create mode 100644 src/test/kotlin/agents_engine/model/VisionWireFormatTest.kt
diff --git a/src/main/kotlin/agents_engine/model/ClaudeClient.kt b/src/main/kotlin/agents_engine/model/ClaudeClient.kt
index 357b73d..5c4b052 100644
--- a/src/main/kotlin/agents_engine/model/ClaudeClient.kt
+++ b/src/main/kotlin/agents_engine/model/ClaudeClient.kt
@@ -389,7 +389,26 @@ open class ClaudeClient(
             val cacheControl = if (msg.cacheHint != null && consumeBreakpoint()) cacheControlJson(msg.cacheHint) else null
             when (msg.role) {
                 "user" -> {
-                    if (cacheControl == null) {
+                    val images = msg.images
+                    if (!images.isNullOrEmpty()) {
+                        // #2470 — vision input. Anthropic accepts a content
+                        // array of typed blocks; one text block + N image
+                        // blocks. Each image block is base64-source with a
+                        // typed media_type.
+                        val textBlock = """{"type":"text","text":${msg.content.toJsonString()}}"""
+                        val imageBlocks = images.joinToString(",") { part ->
+                            """{"type":"image","source":{"type":"base64","media_type":${part.wireMime.value.toJsonString()},"data":${part.base64.toJsonString()}}}"""
+                        }
+                        val allBlocks = "$textBlock,$imageBlocks"
+                        val withCache = if (cacheControl != null) {
+                            // Attach cache_control to the LAST block.
+                            val splitAt = allBlocks.lastIndexOf("}")
+                            allBlocks.substring(0, splitAt) + ",$cacheControl" + allBlocks.substring(splitAt)
+                        } else {
+                            allBlocks
+                        }
+                        """{"role":"user","content":[$withCache]}"""
+                    } else if (cacheControl == null) {
                         """{"role":"user","content":${msg.content.toJsonString()}}"""
                     } else {
                         // Single text content block with cache_control attached.
diff --git a/src/main/kotlin/agents_engine/model/ModelClient.kt b/src/main/kotlin/agents_engine/model/ModelClient.kt
index 5397ca3..486836a 100644
--- a/src/main/kotlin/agents_engine/model/ModelClient.kt
+++ b/src/main/kotlin/agents_engine/model/ModelClient.kt
@@ -26,8 +26,58 @@ data class LlmMessage(
      * unchanged on the wire.
      */
     val cacheHint: CacheHint? = null,
+    /**
+     * #2470 — optional vision input. When non-null and the role is
+     * `"user"`, adapters translate each [ImagePart] into the provider's
+     * native image payload alongside [content]:
+     *
+     *   - Ollama (e.g. qwen3-vl:8b) — `images: [<base64>, ...]` array
+     *     on the user message; [content] stays the text prompt.
+     *   - Anthropic Claude — `content: [{type:"text",...},
+     *     {type:"image", source:{type:"base64", media_type:"image/png",
+     *     data:"<base64>"}}, ...]`.
+     *   - OpenAI — `content: [{type:"text",...},
+     *     {type:"image_url", image_url:{url:"data:image/png;base64,
+     *     <base64>"}}, ...]`.
+     *
+     * Null = no vision parts; wire shape is byte-identical to pre-#2470.
+     * Vision works on the FIRST user turn (most common case for "describe
+     * this image" prompts); subsequent user-turn images compose naturally
+     * if the model supports multi-turn vision.
+     *
+     * Non-user roles ignore this field — system / assistant / tool
+     * messages don't carry images in any provider's API.
+     */
+    val images: List<ImagePart>? = null,
 )
 
+/**
+ * #2470 — base64-encoded image payload for vision input. The caller is
+ * responsible for the encoding so the adapter can splat the bytes onto
+ * the wire without re-encoding per provider. Wire MIME is closed via
+ * the [ImagePart.WireMime] sealed type — `String` mime is intentionally
+ * not accepted in the public ctor.
+ *
+ * Small, allocation-cheap. Equatability: `base64` is a `String`, so
+ * structural equals/hashCode work — unlike `ByteArray`, which uses
+ * identity equals (the trap we avoid by base64-encoding upfront).
+ */
+data class ImagePart(
+    /** Base64-encoded image bytes, no `data:` URL prefix. Adapter formats per-provider. */
+    val base64: String,
+    /** Closed wire MIME — `image/png`, `image/jpeg`, `image/gif`, `image/webp`. */
+    val wireMime: WireMime,
+) {
+    sealed interface WireMime {
+        val value: String
+
+        object Png : WireMime { override val value: String = "image/png" }
+        object Jpeg : WireMime { override val value: String = "image/jpeg" }
+        object Gif : WireMime { override val value: String = "image/gif" }
+        object Webp : WireMime { override val value: String = "image/webp" }
+    }
+}
+
 data class ToolCall(
     val name: String,
     val arguments: Map<String, Any?> = emptyMap(),
diff --git a/src/main/kotlin/agents_engine/model/OllamaClient.kt b/src/main/kotlin/agents_engine/model/OllamaClient.kt
index 09b85be..77394f8 100644
--- a/src/main/kotlin/agents_engine/model/OllamaClient.kt
+++ b/src/main/kotlin/agents_engine/model/OllamaClient.kt
@@ -398,6 +398,19 @@ open class OllamaClient(
                     })
                     append("]")
                 }
+                // #2470 — vision input. Ollama's chat API accepts an `images`
+                // array of base64-encoded payloads (no data: prefix) on user
+                // messages. Vision-capable models (qwen3-vl, llama3.2-vision,
+                // llava) consume it; non-vision models ignore it without
+                // error. Mime is not on the wire — Ollama infers from the
+                // bytes; we keep the typed wireMime on ImagePart for audit
+                // + caller debugging only.
+                val images = msg.images
+                if (msg.role == "user" && !images.isNullOrEmpty()) {
+                    append(""","images":[""")
+                    append(images.joinToString(",") { it.base64.toJsonString() })
+                    append("]")
+                }
                 append("}")
             }
         }
diff --git a/src/main/kotlin/agents_engine/model/OpenAiClient.kt b/src/main/kotlin/agents_engine/model/OpenAiClient.kt
index 7d2939e..9c8d75f 100644
--- a/src/main/kotlin/agents_engine/model/OpenAiClient.kt
+++ b/src/main/kotlin/agents_engine/model/OpenAiClient.kt
@@ -269,8 +269,29 @@ open class OpenAiClient(
 
         val messageObjects = messages.map { msg ->
             when (msg.role) {
-                "system", "user" ->
-                    """{"role":${msg.role.toJsonString()},"content":${msg.content.toJsonString()}}"""
+                "system" ->
+                    """{"role":"system","content":${msg.content.toJsonString()}}"""
+                "user" -> {
+                    val images = msg.images
+                    if (!images.isNullOrEmpty()) {
+                        // #2470 — vision input. OpenAI Chat Completions
+                        // accepts a content array of typed blocks; one text
+                        // block + N image_url blocks. Images ride as data:
+                        // URLs (data:<wireMime>;base64,<payload>). Works on
+                        // gpt-4o, gpt-4o-mini, gpt-4-turbo, and the o*
+                        // reasoning models. DeepSeek inherits this adapter;
+                        // vision is silently ignored by non-vision DeepSeek
+                        // models.
+                        val textBlock = """{"type":"text","text":${msg.content.toJsonString()}}"""
+                        val imageBlocks = images.joinToString(",") { part ->
+                            val dataUrl = "data:${part.wireMime.value};base64,${part.base64}"
+                            """{"type":"image_url","image_url":{"url":${dataUrl.toJsonString()}}}"""
+                        }
+                        """{"role":"user","content":[$textBlock,$imageBlocks]}"""
+                    } else {
+                        """{"role":"user","content":${msg.content.toJsonString()}}"""
+                    }
+                }
 
                 "assistant" -> {
                     val toolCallsJson = msg.toolCalls?.joinToString(",") { call ->
diff --git a/src/test/kotlin/agents_engine/model/VisionFixtures.kt b/src/test/kotlin/agents_engine/model/VisionFixtures.kt
new file mode 100644
index 0000000..c9fc444
--- /dev/null
+++ b/src/test/kotlin/agents_engine/model/VisionFixtures.kt
@@ -0,0 +1,108 @@
+package agents_engine.model
+
+import java.awt.BasicStroke
+import java.awt.Color
+import java.awt.RenderingHints
+import java.awt.image.BufferedImage
+import java.io.ByteArrayOutputStream
+import java.util.Base64
+import javax.imageio.ImageIO
+
+/**
+ * Programmatic PNG fixtures for vision-input tests (#2470). Images are
+ * rendered from `BufferedImage` so the test set ships in source — no
+ * binary assets, no external download — and every byte is reproducible
+ * across machines / CI runs.
+ *
+ * Size discipline: 256×256 keeps base64 payloads ~5 KB, well within
+ * cheap-tier provider limits. Hand-tested with qwen3-vl:8b / Haiku 4.5
+ * / gpt-4o-mini — all three identify both fixtures reliably at this
+ * size.
+ */
+object VisionFixtures {
+
+    /**
+     * 256×256 PNG with three colored squares on a white background:
+     * red, blue, green — spaced far enough apart that even small
+     * vision models count them reliably. Used by the
+     * "agent counts squares" eval.
+     */
+    fun threeSquaresPng(): ByteArray {
+        val w = 256
+        val h = 256
+        val img = BufferedImage(w, h, BufferedImage.TYPE_INT_RGB)
+        val g = img.createGraphics()
+        g.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_OFF)
+        g.color = Color.WHITE
+        g.fillRect(0, 0, w, h)
+        // Three obviously-distinct squares at well-separated positions.
+        val side = 50
+        g.color = Color.RED
+        g.fillRect(20, 20, side, side)
+        g.color = Color.BLUE
+        g.fillRect(110, 100, side, side)
+        g.color = Color.GREEN
+        g.fillRect(190, 190, side, side)
+        // Black outlines so each square is unambiguously countable.
+        g.color = Color.BLACK
+        g.stroke = BasicStroke(2f)
+        g.drawRect(20, 20, side, side)
+        g.drawRect(110, 100, side, side)
+        g.drawRect(190, 190, side, side)
+        g.dispose()
+        return img.toPngBytes()
+    }
+
+    /**
+     * 256×256 PNG of a simple house: triangular roof, square body, door,
+     * two windows. Drawn with thick black outlines so even tiny vision
+     * models classify it. Used by the "agent identifies the house" eval.
+     */
+    fun housePng(): ByteArray {
+        val w = 256
+        val h = 256
+        val img = BufferedImage(w, h, BufferedImage.TYPE_INT_RGB)
+        val g = img.createGraphics()
+        g.setRenderingHint(RenderingHints.KEY_ANTIALIASING, RenderingHints.VALUE_ANTIALIAS_ON)
+        g.color = Color.WHITE
+        g.fillRect(0, 0, w, h)
+        // Body: 130×130 square, centred-ish.
+        val bodyX = 60
+        val bodyY = 110
+        val bodyW = 130
+        val bodyH = 110
+        g.color = Color(255, 230, 200) // warm beige walls
+        g.fillRect(bodyX, bodyY, bodyW, bodyH)
+        // Roof: triangle spanning slightly wider than the body, peak above.
+        val roofXs = intArrayOf(bodyX - 15, bodyX + bodyW / 2, bodyX + bodyW + 15)
+        val roofYs = intArrayOf(bodyY, bodyY - 80, bodyY)
+        g.color = Color(180, 70, 60) // terracotta roof
+        g.fillPolygon(roofXs, roofYs, 3)
+        // Door: small rectangle bottom-centre of the body.
+        g.color = Color(120, 75, 40) // brown door
+        g.fillRect(bodyX + bodyW / 2 - 18, bodyY + bodyH - 60, 36, 60)
+        // Two windows: small squares left + right of the door.
+        g.color = Color(140, 180, 220) // blue windows
+        g.fillRect(bodyX + 15, bodyY + 20, 30, 30)
+        g.fillRect(bodyX + bodyW - 45, bodyY + 20, 30, 30)
+        // Outlines: 3-px black so the silhouette is unambiguous.
+        g.color = Color.BLACK
+        g.stroke = BasicStroke(3f)
+        g.drawRect(bodyX, bodyY, bodyW, bodyH)
+        g.drawPolygon(roofXs, roofYs, 3)
+        g.drawRect(bodyX + bodyW / 2 - 18, bodyY + bodyH - 60, 36, 60)
+        g.drawRect(bodyX + 15, bodyY + 20, 30, 30)
+        g.drawRect(bodyX + bodyW - 45, bodyY + 20, 30, 30)
+        g.dispose()
+        return img.toPngBytes()
+    }
+
+    /** Encode the bytes to base64 — what every adapter ultimately sends on the wire. */
+    fun toBase64(bytes: ByteArray): String = Base64.getEncoder().encodeToString(bytes)
+
+    private fun BufferedImage.toPngBytes(): ByteArray {
+        val out = ByteArrayOutputStream()
+        ImageIO.write(this, "png", out)
+        return out.toByteArray()
+    }
+}
diff --git a/src/test/kotlin/agents_engine/model/VisionLiveTest.kt b/src/test/kotlin/agents_engine/model/VisionLiveTest.kt
new file mode 100644
index 0000000..5b4ccba
--- /dev/null
+++ b/src/test/kotlin/agents_engine/model/VisionLiveTest.kt
@@ -0,0 +1,223 @@
+package agents_engine.model
+
+import org.junit.jupiter.api.Assumptions.assumeTrue
+import org.junit.jupiter.api.Tag
+import org.junit.jupiter.api.Test
+import java.io.File
+import java.net.URI
+import java.net.http.HttpClient
+import java.net.http.HttpRequest
+import java.net.http.HttpResponse
+import java.time.Duration
+import kotlin.test.assertNotNull
+import kotlin.test.assertTrue
+
+/**
+ * #2470 — live vision-input integration tests across all four
+ * providers. Each test sends a programmatically-generated PNG (a
+ * 3-coloured-square image OR a simple house) to a vision-capable
+ * model and checks the response.
+ *
+ * **Cost discipline:**
+ * - 256×256 PNGs (~5 KB base64) — small payloads, fast roundtrip.
+ * - `temperature = 0`, `maxTokens = 80` — short deterministic replies.
+ * - Single-turn — no tool calls, no follow-ups.
+ * - Each provider's tag is skipped cleanly when the API key / model
+ *   isn't reachable (assumeTrue gate).
+ *
+ * **Tags:**
+ * - `live-llm` — Ollama (local + cloud-via-Ollama-Cloud). Excluded
+ *   from default `./gradlew test`; runs via `./gradlew integrationTest`.
+ * - `live-cloud-api` — direct cloud APIs (Anthropic, OpenAI, DeepSeek).
+ *   In default `./gradlew test`; `assumeTrue` skips when no key.
+ *
+ * **DeepSeek:** most DeepSeek models don't have vision today. The
+ * adapter inherits OpenAI's image-content shape; the field passes
+ * through and the model silently ignores it. We test the shape via
+ * the unit tests (`VisionWireFormatTest`) — no live call here to save
+ * cost on what is effectively a no-op.
+ *
+ * **Assertion shape:** loose — every cheap vision model has some
+ * variance in phrasing. The test passes if the answer mentions any of
+ * a small set of acceptable keywords. Goal is "did the image reach
+ * the model and elicit a sensible reply", not "did the model phrase
+ * it exactly this way."
+ */
+class VisionLiveTest {
+
+    private val ollamaVisionModel: String =
+        System.getenv("AGENTSKT_TEST_OLLAMA_VISION_MODEL") ?: "qwen3-vl:8b"
+    private val claudeVisionModel: String =
+        System.getenv("AGENTSKT_TEST_CLAUDE_VISION_MODEL") ?: "claude-haiku-4-5"
+    private val openaiVisionModel: String =
+        System.getenv("AGENTSKT_TEST_OPENAI_VISION_MODEL") ?: "gpt-4o-mini"
+
+    // ─────────────────────────── Ollama (qwen3-vl) ──────────────────────
+
+    @Tag("live-llm")
+    @Test
+    fun `Ollama qwen3-vl counts the three squares in a generated image`() {
+        assumeTrue(isOllamaReachable(), "skipping: no Ollama at localhost:11434")
+        val client = OllamaClient(model = ollamaVisionModel, temperature = 0.0)
+        val png = VisionFixtures.threeSquaresPng()
+        val response = client.chat(
+            listOf(
+                LlmMessage(
+                    role = "user",
+                    content = "How many colored squares are in this image? Answer with just the digit.",
+                    images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
+                ),
+            ),
+        )
+        val text = textOf(response)
+        println("[Ollama vision] squares → $text")
+        assertSquaresCountedAsThree(text, "Ollama($ollamaVisionModel)")
+    }
+
+    @Tag("live-llm")
+    @Test
+    fun `Ollama qwen3-vl identifies the simple house drawing`() {
+        assumeTrue(isOllamaReachable(), "skipping: no Ollama at localhost:11434")
+        val client = OllamaClient(model = ollamaVisionModel, temperature = 0.0)
+        val png = VisionFixtures.housePng()
+        val response = client.chat(
+            listOf(
+                LlmMessage(
+                    role = "user",
+                    content = "What is depicted in this image? Answer in one short phrase.",
+                    images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
+                ),
+            ),
+        )
+        val text = textOf(response)
+        println("[Ollama vision] house → $text")
+        assertSeesHouse(text, "Ollama($ollamaVisionModel)")
+    }
+
+    // ─────────────────────────── Anthropic Haiku ────────────────────────
+
+    @Tag("live-cloud-api")
+    @Test
+    fun `Claude Haiku counts the three squares`() {
+        val apiKey = loadKey("ANTHROPIC_API_KEY", ".secrets/anthropic-key")
+        assumeTrue(apiKey != null, "skipping: no Anthropic key")
+        val client = ClaudeClient(apiKey = apiKey!!, model = claudeVisionModel, temperature = 0.0, maxTokens = 80)
+        val png = VisionFixtures.threeSquaresPng()
+        val response = client.chat(
+            listOf(
+                LlmMessage(
+                    role = "user",
+                    content = "How many colored squares are in this image? Answer with just the digit.",
+                    images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
+                ),
+            ),
+        )
+        val text = textOf(response)
+        println("[Claude vision] squares → $text")
+        assertSquaresCountedAsThree(text, "Claude($claudeVisionModel)")
+    }
+
+    @Tag("live-cloud-api")
+    @Test
+    fun `Claude Haiku identifies the simple house drawing`() {
+        val apiKey = loadKey("ANTHROPIC_API_KEY", ".secrets/anthropic-key")
+        assumeTrue(apiKey != null, "skipping: no Anthropic key")
+        val client = ClaudeClient(apiKey = apiKey!!, model = claudeVisionModel, temperature = 0.0, maxTokens = 80)
+        val png = VisionFixtures.housePng()
+        val response = client.chat(
+            listOf(
+                LlmMessage(
+                    role = "user",
+                    content = "What is depicted in this image? Answer in one short phrase.",
+                    images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
+                ),
+            ),
+        )
+        val text = textOf(response)
+        println("[Claude vision] house → $text")
+        assertSeesHouse(text, "Claude($claudeVisionModel)")
+    }
+
+    // ─────────────────────────── OpenAI gpt-4o-mini ─────────────────────
+
+    @Tag("live-cloud-api")
+    @Test
+    fun `OpenAI gpt-4o-mini counts the three squares`() {
+        val apiKey = loadKey("OPENAI_API_KEY", ".secrets/openai-key")
+        assumeTrue(apiKey != null, "skipping: no OpenAI key")
+        val client = OpenAiClient(apiKey = apiKey!!, model = openaiVisionModel, temperature = 0.0, maxTokens = 80)
+        val png = VisionFixtures.threeSquaresPng()
+        val response = client.chat(
+            listOf(
+                LlmMessage(
+                    role = "user",
+                    content = "How many colored squares are in this image? Answer with just the digit.",
+                    images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
+                ),
+            ),
+        )
+        val text = textOf(response)
+        println("[OpenAI vision] squares → $text")
+        assertSquaresCountedAsThree(text, "OpenAI($openaiVisionModel)")
+    }
+
+    @Tag("live-cloud-api")
+    @Test
+    fun `OpenAI gpt-4o-mini identifies the simple house drawing`() {
+        val apiKey = loadKey("OPENAI_API_KEY", ".secrets/openai-key")
+        assumeTrue(apiKey != null, "skipping: no OpenAI key")
+        val client = OpenAiClient(apiKey = apiKey!!, model = openaiVisionModel, temperature = 0.0, maxTokens = 80)
+        val png = VisionFixtures.housePng()
+        val response = client.chat(
+            listOf(
+                LlmMessage(
+                    role = "user",
+                    content = "What is depicted in this image? Answer in one short phrase.",
+                    images = listOf(ImagePart(VisionFixtures.toBase64(png), ImagePart.WireMime.Png)),
+                ),
+            ),
+        )
+        val text = textOf(response)
+        println("[OpenAI vision] house → $text")
+        assertSeesHouse(text, "OpenAI($openaiVisionModel)")
+    }
+
+    // ───────────────────────────── Helpers ──────────────────────────────
+
+    private fun textOf(response: LlmResponse): String = when (response) {
+        is LlmResponse.Text -> response.content
+        is LlmResponse.ToolCalls -> error("vision call unexpectedly returned tool_calls: $response")
+    }
+
+    private fun assertSquaresCountedAsThree(text: String, providerLabel: String) {
+        val lowered = text.lowercase()
+        val sees3 = "3" in text || "three" in lowered
+        assertTrue(sees3, "$providerLabel did not count three squares; got: $text")
+    }
+
+    private fun assertSeesHouse(text: String, providerLabel: String) {
+        val lowered = text.lowercase()
+        val sees = listOf("house", "home", "cottage", "building", "cabin", "barn").any { it in lowered }
+        assertTrue(sees, "$providerLabel did not recognise the house drawing; got: $text")
+    }
+
+    private fun loadKey(envVar: String, secretFile: String): String? {
+        val envKey = System.getenv(envVar)
+        if (!envKey.isNullOrBlank()) return envKey
+        val file = File(secretFile)
+        return if (file.exists()) file.readText().trim().ifBlank { null } else null
+    }
+
+    private fun isOllamaReachable(): Boolean = try {
+        val client = HttpClient.newBuilder().connectTimeout(Duration.ofMillis(500)).build()
+        val request = HttpRequest.newBuilder()
+            .uri(URI.create("http://localhost:11434/api/tags"))
+            .timeout(Duration.ofMillis(1500))
+            .GET()
+            .build()
+        val response = client.send(request, HttpResponse.BodyHandlers.discarding())
+        response.statusCode() in 200..299
+    } catch (_: Throwable) {
+        false
+    }
+}
diff --git a/src/test/kotlin/agents_engine/model/VisionWireFormatTest.kt b/src/test/kotlin/agents_engine/model/VisionWireFormatTest.kt
new file mode 100644
index 0000000..20bac8e
--- /dev/null
+++ b/src/test/kotlin/agents_engine/model/VisionWireFormatTest.kt
@@ -0,0 +1,182 @@
+package agents_engine.model
+
+import agents_engine.generation.LenientJsonParser
+import kotlin.test.Test
+import kotlin.test.assertEquals
+import kotlin.test.assertNotNull
+import kotlin.test.assertNull
+import kotlin.test.assertTrue
+
+/**
+ * #2470 — vision-input wire-format tests. Pins the per-provider JSON
+ * shape so an adapter regression surfaces at unit-test time, not at
+ * provider HTTP time. No network — uses each client's
+ * `buildRequestJson` and parses the result.
+ *
+ * Each provider's image payload is verified for: text-prompt
+ * preservation, image-block presence, base64 splatting, typed mime
+ * propagation. Pre-#2470 message shapes (no images) stay byte-identical.
+ */
+class VisionWireFormatTest {
+
+    private val pngBytes = byteArrayOf(0x89.toByte(), 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, 1, 2)
+    private val pngBase64 = java.util.Base64.getEncoder().encodeToString(pngBytes)
+
+    private fun userMessage(text: String, images: List<ImagePart>? = null) =
+        LlmMessage(role = "user", content = text, images = images)
+
+    // ────────────────────────────── Ollama ──────────────────────────────
+
+    @Test
+    fun `Ollama user message with images emits images array of base64 strings`() {
+        val client = OllamaClient(model = "qwen3-vl:8b")
+        val body = client.buildRequestJson(listOf(
+            userMessage("How many squares?", listOf(ImagePart(pngBase64, ImagePart.WireMime.Png))),
+        ))
+        @Suppress("UNCHECKED_CAST")
+        val parsed = LenientJsonParser.parse(body) as Map<String, Any?>
+        @Suppress("UNCHECKED_CAST")
+        val messages = parsed["messages"] as List<Map<String, Any?>>
+        val userMsg = messages.single()
+        assertEquals("user", userMsg["role"])
+        assertEquals("How many squares?", userMsg["content"], "text prompt preserved")
+        @Suppress("UNCHECKED_CAST")
+        val images = userMsg["images"] as List<String>
+        assertEquals(1, images.size)
+        assertEquals(pngBase64, images[0], "base64 splatted verbatim, no data: prefix")
+    }
+
+    @Test
+    fun `Ollama user message without images omits the images field (back-compat)`() {
+        val client = OllamaClient(model = "llama3.2")
+        val body = client.buildRequestJson(listOf(userMessage("plain text")))
+        assertTrue("\"images\"" !in body, "no images field when not requested: $body")
+    }
+
+    @Test
+    fun `Ollama non-user message with images on it does NOT emit images (system_assistant_tool ignore)`() {
+        val client = OllamaClient(model = "qwen3-vl:8b")
+        val body = client.buildRequestJson(listOf(
+            LlmMessage(role = "system", content = "be helpful",
+                images = listOf(ImagePart(pngBase64, ImagePart.WireMime.Png))),
+        ))
+        assertTrue("\"images\"" !in body, "non-user roles never carry images on the wire: $body")
+    }
+
+    // ────────────────────────────── Anthropic ───────────────────────────
+
+    @Test
+    fun `Claude user message with images emits content array with text plus image blocks`() {
+        val client = ClaudeClient(apiKey = "test", model = "claude-haiku-4-5")
+        val body = client.buildRequestJson(listOf(
+            userMessage("Identify this image.", listOf(ImagePart(pngBase64, ImagePart.WireMime.Png))),
+        ))
+        @Suppress("UNCHECKED_CAST")
+        val parsed = LenientJsonParser.parse(body) as Map<String, Any?>
+        @Suppress("UNCHECKED_CAST")
+        val messages = parsed["messages"] as List<Map<String, Any?>>
+        @Suppress("UNCHECKED_CAST")
+        val content = messages.single()["content"] as List<Map<String, Any?>>
+        assertEquals(2, content.size, "one text block + one image block")
+        assertEquals("text", content[0]["type"])
+        assertEquals("Identify this image.", content[0]["text"])
+        assertEquals("image", content[1]["type"])
+        @Suppress("UNCHECKED_CAST")
+        val source = content[1]["source"] as Map<String, Any?>
+        assertEquals("base64", source["type"])
+        assertEquals("image/png", source["media_type"], "typed mime → wire media_type")
+        assertEquals(pngBase64, source["data"])
+    }
+
+    @Test
+    fun `Claude user message without images stays on the legacy string-content path`() {
+        val client = ClaudeClient(apiKey = "test", model = "claude-haiku-4-5")
+        val body = client.buildRequestJson(listOf(userMessage("just text please")))
+        @Suppress("UNCHECKED_CAST")
+        val parsed = LenientJsonParser.parse(body) as Map<String, Any?>
+        @Suppress("UNCHECKED_CAST")
+        val messages = parsed["messages"] as List<Map<String, Any?>>
+        // Legacy shape: content is a string, not an array.
+        assertEquals("just text please", messages.single()["content"], "back-compat: string content when no images")
+    }
+
+    // ────────────────────────────── OpenAI ──────────────────────────────
+
+    @Test
+    fun `OpenAI user message with images emits content array with text plus image_url blocks`() {
+        val client = OpenAiClient(apiKey = "test", model = "gpt-4o-mini")
+        val body = client.buildRequestJson(listOf(
+            userMessage("What is in this picture?", listOf(ImagePart(pngBase64, ImagePart.WireMime.Png))),
+        ))
+        @Suppress("UNCHECKED_CAST")
+        val parsed = LenientJsonParser.parse(body) as Map<String, Any?>
+        @Suppress("UNCHECKED_CAST")
+        val messages = parsed["messages"] as List<Map<String, Any?>>
+        @Suppress("UNCHECKED_CAST")
+        val content = messages.single()["content"] as List<Map<String, Any?>>
+        assertEquals(2, content.size)
+        assertEquals("text", content[0]["type"])
+        assertEquals("What is in this picture?", content[0]["text"])
+        assertEquals("image_url", content[1]["type"])
+        @Suppress("UNCHECKED_CAST")
+        val imageUrl = content[1]["image_url"] as Map<String, Any?>
+        val url = imageUrl["url"] as String
+        assertTrue(url.startsWith("data:image/png;base64,"), "data-URL with typed mime: $url")
+        assertTrue(url.endsWith(pngBase64), "base64 splatted verbatim at the end: $url")
+    }
+
+    @Test
+    fun `OpenAI user message without images stays on the legacy string-content path`() {
+        val client = OpenAiClient(apiKey = "test", model = "gpt-4o-mini")
+        val body = client.buildRequestJson(listOf(userMessage("just text")))
+        @Suppress("UNCHECKED_CAST")
+        val parsed = LenientJsonParser.parse(body) as Map<String, Any?>
+        @Suppress("UNCHECKED_CAST")
+        val messages = parsed["messages"] as List<Map<String, Any?>>
+        assertEquals("just text", messages.single()["content"], "back-compat: string content when no images")
+    }
+
+    @Test
+    fun `OpenAI multiple images each get their own image_url block`() {
+        val client = OpenAiClient(apiKey = "test", model = "gpt-4o-mini")
+        val body = client.buildRequestJson(listOf(
+            userMessage("Compare these.", listOf(
+                ImagePart(pngBase64, ImagePart.WireMime.Png),
+                ImagePart(pngBase64, ImagePart.WireMime.Jpeg),
+            )),
+        ))
+        @Suppress("UNCHECKED_CAST")
+        val parsed = LenientJsonParser.parse(body) as Map<String, Any?>
+        @Suppress("UNCHECKED_CAST")
+        val messages = parsed["messages"] as List<Map<String, Any?>>
+        @Suppress("UNCHECKED_CAST")
+        val content = messages.single()["content"] as List<Map<String, Any?>>
+        assertEquals(3, content.size, "text + 2 images")
+        @Suppress("UNCHECKED_CAST")
+        val u1 = (content[1]["image_url"] as Map<String, Any?>)["url"] as String
+        @Suppress("UNCHECKED_CAST")
+        val u2 = (content[2]["image_url"] as Map<String, Any?>)["url"] as String
+        assertTrue(u1.startsWith("data:image/png;base64,"))
+        assertTrue(u2.startsWith("data:image/jpeg;base64,"), "second image's wireMime propagates")
+    }
+
+    // ────────────────────────── Fixtures sanity ─────────────────────────
+
+    @Test
+    fun `VisionFixtures threeSquaresPng renders a valid PNG of reasonable size`() {
+        val bytes = VisionFixtures.threeSquaresPng()
+        // PNG magic: 89 50 4E 47 0D 0A 1A 0A
+        assertEquals(0x89.toByte(), bytes[0])
+        assertEquals(0x50.toByte(), bytes[1])
+        assertEquals(0x4E.toByte(), bytes[2])
+        assertEquals(0x47.toByte(), bytes[3])
+        assertTrue(bytes.size > 100 && bytes.size < 50_000, "small enough for cheap vision calls; got ${bytes.size}")
+    }
+
+    @Test
+    fun `VisionFixtures housePng renders a valid PNG`() {
+        val bytes = VisionFixtures.housePng()
+        assertEquals(0x89.toByte(), bytes[0])
+        assertTrue(bytes.size in 100..50_000)
+    }
+}

From cea17c0d849ff0c5c77d430fd2b4212c3ca55560 Mon Sep 17 00:00:00 2001
From: skobeltsyn <Konstantin@skobeltsyn.com>
Date: Sat, 30 May 2026 13:01:11 +0300
Subject: [PATCH 2/2] =?UTF-8?q?docs(#2470):=20vision=20input=20=E2=80=94?=
 =?UTF-8?q?=20multimodal.md=20extension,=20README,=20CHANGELOG?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- docs/multimodal.md — new "Vision input — talking to the model (#2470
  slice a)" section between the existing foundation content and the
  "What's still coming" list. Walks through the LlmMessage.images
  field, ImagePart shape, per-provider wire-format table, back-compat
  + role-gating guarantees, programmatic VisionFixtures, and the
  per-provider live test how-to-run. "What's coming" list updated to
  flag the #2470 slice-a/slice-b split (this commit is slice a; the
  Content → LlmMessage.images loop translation is slice b).
- README.md — new "Vision input to models" bullet right after the
  multimodal foundation bullet. Names all four providers and their
  default test models with the cost-discipline notes.
- CHANGELOG.md `## [Unreleased]` — new "Added — Vision input across
  all providers (#2470 slice a)" section ABOVE the existing multimodal
  foundation section. Covers LlmMessage.images + ImagePart, per-
  provider adapter rows, role-gating, fixtures, live test setup +
  cost discipline, wire-format unit-test count.

No source changes. Full suite stays at 1794 / 0 failures from the
prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md       | 13 ++++++++++
 README.md          |  3 ++-
 docs/multimodal.md | 65 ++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 78 insertions(+), 3 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 03af14a..e7d3dc9 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,19 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a
 
 ## [Unreleased]
 
+### Added — Vision input across all providers (#2470 slice a)
+
+- **`LlmMessage.images: List<ImagePart>? = null`** — new optional field; back-compat default leaves the wire shape byte-identical to pre-#2470 for callers that don't pass images. Closed `ImagePart(base64, wireMime)` with `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`) — `String` mime is intentionally not accepted in the public ctor.
+- **Per-provider adapters** translate vision on `role = "user"` messages:
+  - Ollama: `{role:"user", content:"text", images:["<b64>", ...]}` — works with `qwen3-vl:8b`, `llava`, `llama3.2-vision`, etc. Non-vision models silently ignore the field.
+  - Claude: typed content array — `[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]`. Works with all Claude vision-capable models (Haiku 4.5, Sonnet 4.6, Opus 4.7).
+  - OpenAI: typed content array — `[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]`. Works with gpt-4o, gpt-4o-mini, gpt-4-turbo, the o* reasoning models.
+  - DeepSeek: inherits the OpenAI adapter shape; current DeepSeek models lack vision and silently ignore the field. Shape-tested; no live call to avoid spending on a no-op.
+- **Role-gated:** non-user messages (system/assistant/tool) with non-null `images` ignore the field on the wire — no provider's API accepts images on those roles. Pinned by tests.
+- **Programmatic fixtures** in `src/test`: `VisionFixtures.threeSquaresPng()` (256×256 red/blue/green squares for "count the squares" eval) and `VisionFixtures.housePng()` (256×256 cartoon house for "what is this?" eval). Rendered via `BufferedImage` + `ImageIO` — reproducible byte-for-byte across machines and CI, no external assets in the repo.
+- **Live integration tests** (`VisionLiveTest`) cover all three vision-capable providers with cost discipline (`temperature = 0`, `maxTokens = 80`, single-turn, ~5KB base64 payloads): Ollama `qwen3-vl:8b` (tagged `live-llm`, runs via `:integrationTest`), Claude `claude-haiku-4-5` and OpenAI `gpt-4o-mini` (tagged `live-cloud-api`, runs in default `:test` with `assumeTrue` skipping when no key). Model names overridable via env. Assertion shape is loose keyword-match — robust against per-model phrasing variance.
+- 8 wire-format unit tests pin per-provider JSON shape + the no-images back-compat path. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a).
+
 ### Added — Multimodal foundation (#2465 epic, Stage 1)
 
 - **Typed `Content` hierarchy (#2466)** — `sealed interface Content` with variants `Text`, `Image`, `Audio`, `Video`, `Document` in package `agents_engine.content`. Each non-text variant carries a `ContentRef` plus a typed mime (`ImageMime`, `AudioMime`, `VideoMime`, `DocMime`). Mime types are closed sealed interfaces with `wireMime: String` accessors — no `String` mime in any public API. Extension property `Content.modality: String` is the audit-stable per-variant name. Stage 1 wires Image + Document end-to-end (the modalities the 0.8 spec → product loop consumes); Audio + Video are modelled now and exercised through provider adapters in Stage 2 (#2470, deferred).
diff --git a/README.md b/README.md
index e0eea6a..a7ca4c3 100644
--- a/README.md
+++ b/README.md
@@ -155,7 +155,8 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
 - **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay).
 - **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up.
 - **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval<IN, OUT>("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md).
-- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end; Audio + Video modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). See [docs/multimodal.md](docs/multimodal.md).
+- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end. See [docs/multimodal.md](docs/multimodal.md).
+- **Vision input to models** — `LlmMessage(role = "user", content = "...", images = listOf(ImagePart(base64, ImagePart.WireMime.Png)))` (#2470 slice a) reaches all four built-in adapters: Ollama emits `images: [<b64>...]`, Claude emits `{type:"image", source:{type:"base64",...}}` content blocks, OpenAI emits `{type:"image_url", image_url:{url:"data:..."}}` content blocks, DeepSeek inherits OpenAI (silently ignored on non-vision models). Closed `ImagePart.WireMime { Png, Jpeg, Gif, Webp }` — no `String` mime. Programmatic `VisionFixtures.threeSquaresPng()` / `housePng()` (256×256, `BufferedImage`-rendered, ~5KB) + per-provider live tests (qwen3-vl:8b / Haiku 4.5 / gpt-4o-mini) with cost discipline. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a).
 - **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md).
 - **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md).
 - **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md).
diff --git a/docs/multimodal.md b/docs/multimodal.md
index f46ba66..ee10a41 100644
--- a/docs/multimodal.md
+++ b/docs/multimodal.md
@@ -143,10 +143,71 @@ The same discipline applies (when wired) to the OTel / LangSmith / Langfuse brid
 
 Pairs with the #2754 manifest-hash restore guard: resume across an agent rebuild that changed tools (including the `BlobStore` wiring) fails closed unless the caller opts in.
 
-## What's coming (the rest of #2465)
+## Vision input — talking to the model (#2470 slice a)
+
+The foundation above answers "how do tools return mixed content?" The follow-on slice answers "how does the agent send an image to a vision-capable model?"
+
+`LlmMessage` gains an optional `images: List<ImagePart>` field. Adapters translate it per provider:
+
+```kotlin
+import agents_engine.model.LlmMessage
+import agents_engine.model.ImagePart
+
+val png: ByteArray = Files.readAllBytes(Path.of("screenshot.png"))
+val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0)
+
+client.chat(listOf(
+    LlmMessage(
+        role = "user",
+        content = "How many windows in this picture?",
+        images = listOf(ImagePart(
+            base64 = Base64.getEncoder().encodeToString(png),
+            wireMime = ImagePart.WireMime.Png,
+        )),
+    ),
+))
+```
+
+`ImagePart` carries the base64 payload + a closed `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`). `String` mime is not accepted in the public ctor — same closed-mime discipline as `Content.Image`. Caller base64-encodes upfront so the adapter can splat the payload straight onto the wire without re-encoding per provider.
+
+| Provider | User-message shape |
+|---|---|
+| Ollama | `{role:"user", content:"text", images:["<b64>", ...]}` |
+| Claude | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]}` |
+| OpenAI | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]}` |
+| DeepSeek | inherits OpenAI; current DeepSeek models lack vision and silently ignore the field |
+
+**Back-compat:** `images = null` (or empty) produces byte-identical wire shape to pre-#2470. Pinned by dedicated tests.
+
+**Role-gating:** images are emitted only on `role = "user"` messages. System / assistant / tool messages with non-null `images` ignore the field on the wire — no provider's API carries images on those roles.
+
+### Programmatic image fixtures
+
+`VisionFixtures` (in `src/test`) ships two reproducible PNGs rendered via `BufferedImage`:
+
+- `threeSquaresPng()` — 256×256, three colored squares (red/blue/green) on white with thick black outlines. Used for "count the squares" eval against cheap vision models.
+- `housePng()` — 256×256 cartoon house: triangle roof + body + door + two windows. Used for "what is this?" eval.
+
+No external assets, no binary blobs in the repo — every byte is reproducible across machines and CI runs.
+
+### Live tests
+
+`VisionLiveTest.kt` runs the two fixtures against all three vision-capable providers, with cost discipline (256×256 PNG ~5KB, `temperature = 0`, `maxTokens = 80`, single-turn):
+
+| Provider | Default model | How to run |
+|---|---|---|
+| Ollama | `qwen3-vl:8b` | `./gradlew integrationTest --tests "*VisionLiveTest*"` (tagged `live-llm`) |
+| Claude | `claude-haiku-4-5` | `./gradlew test --tests "*VisionLiveTest*"` (tagged `live-cloud-api`; `assumeTrue` skips when no key) |
+| OpenAI | `gpt-4o-mini` | same |
+
+Models overridable via env (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, `AGENTSKT_TEST_CLAUDE_VISION_MODEL`, `AGENTSKT_TEST_OPENAI_VISION_MODEL`).
+
+Assertion shape is loose: the test passes if the model's text response mentions one of a small acceptable keyword set (`3` / `three` for counting; `house` / `home` / `cottage` / `building` / `cabin` / `barn` for the house). Goal is "did the image reach the model and elicit a sensible reply" — not exact phrasing.
+
+## What's still coming (rest of #2465)
 
 - **#2468** Compile-time modality routing — `Agent<Image, X>` becomes a real type; cross-modality miswiring is a compile error. Multi-part `@Generable` inputs via KSP.
-- **#2470** Provider adapters — Claude vision, OpenAI vision, Gemini, Ollama multimodal. Translates `Content → provider-specific payload` at the wire.
+- **#2470 (slice b)** `Content` → `LlmMessage.images` translation at the agentic loop — currently the caller dereferences `ContentRef` → bytes → `ImagePart` manually. Sliced this way to land the wire format first; the loop hook is a small follow-up.
 - **#2471** Manifest-anchored modality capability — declared per-agent modalities recorded in the permission manifest, validated against provider capabilities at build time.
 - **#2472** Multimodal memory — `MemoryBank` entries carry `ContentRef` for image/audio/video state.
 - **#2473** Testing fixtures + snapshot + mutation coverage.