Deep-CodeAI · Skobeltsyn · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,19 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a
 
 ## [Unreleased]
 
+### Added — Vision input across all providers (#2470 slice a)
+
+- **`LlmMessage.images: List<ImagePart>? = null`** — new optional field; back-compat default leaves the wire shape byte-identical to pre-#2470 for callers that don't pass images. Closed `ImagePart(base64, wireMime)` with `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`) — `String` mime is intentionally not accepted in the public ctor.
+- **Per-provider adapters** translate vision on `role = "user"` messages:
+  - Ollama: `{role:"user", content:"text", images:["<b64>", ...]}` — works with `qwen3-vl:8b`, `llava`, `llama3.2-vision`, etc. Non-vision models silently ignore the field.
+  - Claude: typed content array — `[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]`. Works with all Claude vision-capable models (Haiku 4.5, Sonnet 4.6, Opus 4.7).
+  - OpenAI: typed content array — `[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]`. Works with gpt-4o, gpt-4o-mini, gpt-4-turbo, the o* reasoning models.
+  - DeepSeek: inherits the OpenAI adapter shape; current DeepSeek models lack vision and silently ignore the field. Shape-tested; no live call to avoid spending on a no-op.
+- **Role-gated:** non-user messages (system/assistant/tool) with non-null `images` ignore the field on the wire — no provider's API accepts images on those roles. Pinned by tests.
+- **Programmatic fixtures** in `src/test`: `VisionFixtures.threeSquaresPng()` (256×256 red/blue/green squares for "count the squares" eval) and `VisionFixtures.housePng()` (256×256 cartoon house for "what is this?" eval). Rendered via `BufferedImage` + `ImageIO` — reproducible byte-for-byte across machines and CI, no external assets in the repo.
+- **Live integration tests** (`VisionLiveTest`) cover all three vision-capable providers with cost discipline (`temperature = 0`, `maxTokens = 80`, single-turn, ~5KB base64 payloads): Ollama `qwen3-vl:8b` (tagged `live-llm`, runs via `:integrationTest`), Claude `claude-haiku-4-5` and OpenAI `gpt-4o-mini` (tagged `live-cloud-api`, runs in default `:test` with `assumeTrue` skipping when no key). Model names overridable via env. Assertion shape is loose keyword-match — robust against per-model phrasing variance.
+- 8 wire-format unit tests pin per-provider JSON shape + the no-images back-compat path. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a).
+
 ### Added — Multimodal foundation (#2465 epic, Stage 1)
 
 - **Typed `Content` hierarchy (#2466)** — `sealed interface Content` with variants `Text`, `Image`, `Audio`, `Video`, `Document` in package `agents_engine.content`. Each non-text variant carries a `ContentRef` plus a typed mime (`ImageMime`, `AudioMime`, `VideoMime`, `DocMime`). Mime types are closed sealed interfaces with `wireMime: String` accessors — no `String` mime in any public API. Extension property `Content.modality: String` is the audit-stable per-variant name. Stage 1 wires Image + Document end-to-end (the modalities the 0.8 spec → product loop consumes); Audio + Video are modelled now and exercised through provider adapters in Stage 2 (#2470, deferred).

diff --git a/README.md b/README.md
@@ -155,7 +155,8 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
 - **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay).
 - **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up.
 - **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval<IN, OUT>("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md).
-- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end; Audio + Video modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). See [docs/multimodal.md](docs/multimodal.md).
+- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end. See [docs/multimodal.md](docs/multimodal.md).
+- **Vision input to models** — `LlmMessage(role = "user", content = "...", images = listOf(ImagePart(base64, ImagePart.WireMime.Png)))` (#2470 slice a) reaches all four built-in adapters: Ollama emits `images: [<b64>...]`, Claude emits `{type:"image", source:{type:"base64",...}}` content blocks, OpenAI emits `{type:"image_url", image_url:{url:"data:..."}}` content blocks, DeepSeek inherits OpenAI (silently ignored on non-vision models). Closed `ImagePart.WireMime { Png, Jpeg, Gif, Webp }` — no `String` mime. Programmatic `VisionFixtures.threeSquaresPng()` / `housePng()` (256×256, `BufferedImage`-rendered, ~5KB) + per-provider live tests (qwen3-vl:8b / Haiku 4.5 / gpt-4o-mini) with cost discipline. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a).
 - **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md).
 - **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md).
 - **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md).

diff --git a/docs/multimodal.md b/docs/multimodal.md
@@ -143,10 +143,71 @@ The same discipline applies (when wired) to the OTel / LangSmith / Langfuse brid
 
 Pairs with the #2754 manifest-hash restore guard: resume across an agent rebuild that changed tools (including the `BlobStore` wiring) fails closed unless the caller opts in.
 
-## What's coming (the rest of #2465)
+## Vision input — talking to the model (#2470 slice a)
+
+The foundation above answers "how do tools return mixed content?" The follow-on slice answers "how does the agent send an image to a vision-capable model?"
+
+`LlmMessage` gains an optional `images: List<ImagePart>` field. Adapters translate it per provider:
+
+```kotlin
+import agents_engine.model.LlmMessage
+import agents_engine.model.ImagePart
+
+val png: ByteArray = Files.readAllBytes(Path.of("screenshot.png"))
+val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0)
+
+client.chat(listOf(
+    LlmMessage(
+        role = "user",
+        content = "How many windows in this picture?",
+        images = listOf(ImagePart(
+            base64 = Base64.getEncoder().encodeToString(png),
+            wireMime = ImagePart.WireMime.Png,
+        )),
+    ),
+))
+```
+
+`ImagePart` carries the base64 payload + a closed `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`). `String` mime is not accepted in the public ctor — same closed-mime discipline as `Content.Image`. Caller base64-encodes upfront so the adapter can splat the payload straight onto the wire without re-encoding per provider.
+
+| Provider | User-message shape |
+|---|---|
+| Ollama | `{role:"user", content:"text", images:["<b64>", ...]}` |
+| Claude | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]}` |
+| OpenAI | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]}` |
+| DeepSeek | inherits OpenAI; current DeepSeek models lack vision and silently ignore the field |
+
+**Back-compat:** `images = null` (or empty) produces byte-identical wire shape to pre-#2470. Pinned by dedicated tests.
+
+**Role-gating:** images are emitted only on `role = "user"` messages. System / assistant / tool messages with non-null `images` ignore the field on the wire — no provider's API carries images on those roles.
+
+### Programmatic image fixtures
+
+`VisionFixtures` (in `src/test`) ships two reproducible PNGs rendered via `BufferedImage`:
+
+- `threeSquaresPng()` — 256×256, three colored squares (red/blue/green) on white with thick black outlines. Used for "count the squares" eval against cheap vision models.
+- `housePng()` — 256×256 cartoon house: triangle roof + body + door + two windows. Used for "what is this?" eval.
+
+No external assets, no binary blobs in the repo — every byte is reproducible across machines and CI runs.
+
+### Live tests
+
+`VisionLiveTest.kt` runs the two fixtures against all three vision-capable providers, with cost discipline (256×256 PNG ~5KB, `temperature = 0`, `maxTokens = 80`, single-turn):
+
+| Provider | Default model | How to run |
+|---|---|---|
+| Ollama | `qwen3-vl:8b` | `./gradlew integrationTest --tests "*VisionLiveTest*"` (tagged `live-llm`) |
+| Claude | `claude-haiku-4-5` | `./gradlew test --tests "*VisionLiveTest*"` (tagged `live-cloud-api`; `assumeTrue` skips when no key) |
+| OpenAI | `gpt-4o-mini` | same |
+
+Models overridable via env (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, `AGENTSKT_TEST_CLAUDE_VISION_MODEL`, `AGENTSKT_TEST_OPENAI_VISION_MODEL`).
+
+Assertion shape is loose: the test passes if the model's text response mentions one of a small acceptable keyword set (`3` / `three` for counting; `house` / `home` / `cottage` / `building` / `cabin` / `barn` for the house). Goal is "did the image reach the model and elicit a sensible reply" — not exact phrasing.
+
+## What's still coming (rest of #2465)
 
 - **#2468** Compile-time modality routing — `Agent<Image, X>` becomes a real type; cross-modality miswiring is a compile error. Multi-part `@Generable` inputs via KSP.
-- **#2470** Provider adapters — Claude vision, OpenAI vision, Gemini, Ollama multimodal. Translates `Content → provider-specific payload` at the wire.
+- **#2470 (slice b)** `Content` → `LlmMessage.images` translation at the agentic loop — currently the caller dereferences `ContentRef` → bytes → `ImagePart` manually. Sliced this way to land the wire format first; the loop hook is a small follow-up.
 - **#2471** Manifest-anchored modality capability — declared per-agent modalities recorded in the permission manifest, validated against provider capabilities at build time.
 - **#2472** Multimodal memory — `MemoryBank` entries carry `ContentRef` for image/audio/video state.
 - **#2473** Testing fixtures + snapshot + mutation coverage.

diff --git a/src/main/kotlin/agents_engine/model/ClaudeClient.kt b/src/main/kotlin/agents_engine/model/ClaudeClient.kt
@@ -389,7 +389,26 @@ open class ClaudeClient(
             val cacheControl = if (msg.cacheHint != null && consumeBreakpoint()) cacheControlJson(msg.cacheHint) else null
             when (msg.role) {
                 "user" -> {
-                    if (cacheControl == null) {
+                    val images = msg.images
+                    if (!images.isNullOrEmpty()) {
+                        // #2470 — vision input. Anthropic accepts a content
+                        // array of typed blocks; one text block + N image
+                        // blocks. Each image block is base64-source with a
+                        // typed media_type.
+                        val textBlock = """{"type":"text","text":${msg.content.toJsonString()}}"""
+                        val imageBlocks = images.joinToString(",") { part ->
+                            """{"type":"image","source":{"type":"base64","media_type":${part.wireMime.value.toJsonString()},"data":${part.base64.toJsonString()}}}"""
+                        }
+                        val allBlocks = "$textBlock,$imageBlocks"
+                        val withCache = if (cacheControl != null) {
+                            // Attach cache_control to the LAST block.
+                            val splitAt = allBlocks.lastIndexOf("}")
+                            allBlocks.substring(0, splitAt) + ",$cacheControl" + allBlocks.substring(splitAt)
+                        } else {
+                            allBlocks
+                        }
+                        """{"role":"user","content":[$withCache]}"""
+                    } else if (cacheControl == null) {
                         """{"role":"user","content":${msg.content.toJsonString()}}"""
                     } else {
                         // Single text content block with cache_control attached.

diff --git a/src/main/kotlin/agents_engine/model/ModelClient.kt b/src/main/kotlin/agents_engine/model/ModelClient.kt
@@ -26,8 +26,58 @@ data class LlmMessage(
      * unchanged on the wire.
      */
     val cacheHint: CacheHint? = null,
+    /**
+     * #2470 — optional vision input. When non-null and the role is
+     * `"user"`, adapters translate each [ImagePart] into the provider's
+     * native image payload alongside [content]:
+     *
+     *   - Ollama (e.g. qwen3-vl:8b) — `images: [<base64>, ...]` array
+     *     on the user message; [content] stays the text prompt.
+     *   - Anthropic Claude — `content: [{type:"text",...},
+     *     {type:"image", source:{type:"base64", media_type:"image/png",
+     *     data:"<base64>"}}, ...]`.
+     *   - OpenAI — `content: [{type:"text",...},
+     *     {type:"image_url", image_url:{url:"data:image/png;base64,
+     *     <base64>"}}, ...]`.
+     *
+     * Null = no vision parts; wire shape is byte-identical to pre-#2470.
+     * Vision works on the FIRST user turn (most common case for "describe
+     * this image" prompts); subsequent user-turn images compose naturally
+     * if the model supports multi-turn vision.
+     *
+     * Non-user roles ignore this field — system / assistant / tool
+     * messages don't carry images in any provider's API.
+     */
+    val images: List<ImagePart>? = null,
 )
 
+/**
+ * #2470 — base64-encoded image payload for vision input. The caller is
+ * responsible for the encoding so the adapter can splat the bytes onto
+ * the wire without re-encoding per provider. Wire MIME is closed via
+ * the [ImagePart.WireMime] sealed type — `String` mime is intentionally
+ * not accepted in the public ctor.
+ *
+ * Small, allocation-cheap. Equatability: `base64` is a `String`, so
+ * structural equals/hashCode work — unlike `ByteArray`, which uses
+ * identity equals (the trap we avoid by base64-encoding upfront).
+ */
+data class ImagePart(
+    /** Base64-encoded image bytes, no `data:` URL prefix. Adapter formats per-provider. */
+    val base64: String,
+    /** Closed wire MIME — `image/png`, `image/jpeg`, `image/gif`, `image/webp`. */
+    val wireMime: WireMime,
+) {
+    sealed interface WireMime {
+        val value: String
+
+        object Png : WireMime { override val value: String = "image/png" }
+        object Jpeg : WireMime { override val value: String = "image/jpeg" }
+        object Gif : WireMime { override val value: String = "image/gif" }
+        object Webp : WireMime { override val value: String = "image/webp" }
+    }
+}
+
 data class ToolCall(
     val name: String,
     val arguments: Map<String, Any?> = emptyMap(),

diff --git a/src/main/kotlin/agents_engine/model/OllamaClient.kt b/src/main/kotlin/agents_engine/model/OllamaClient.kt
@@ -398,6 +398,19 @@ open class OllamaClient(
                     })
                     append("]")
                 }
+                // #2470 — vision input. Ollama's chat API accepts an `images`
+                // array of base64-encoded payloads (no data: prefix) on user
+                // messages. Vision-capable models (qwen3-vl, llama3.2-vision,
+                // llava) consume it; non-vision models ignore it without
+                // error. Mime is not on the wire — Ollama infers from the
+                // bytes; we keep the typed wireMime on ImagePart for audit
+                // + caller debugging only.
+                val images = msg.images
+                if (msg.role == "user" && !images.isNullOrEmpty()) {
+                    append(""","images":[""")
+                    append(images.joinToString(",") { it.base64.toJsonString() })
+                    append("]")
+                }
                 append("}")
             }
         }