Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,19 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a

## [Unreleased]

### Added — Vision input across all providers (#2470 slice a)

- **`LlmMessage.images: List<ImagePart>? = null`** — new optional field; back-compat default leaves the wire shape byte-identical to pre-#2470 for callers that don't pass images. Closed `ImagePart(base64, wireMime)` with `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`) — `String` mime is intentionally not accepted in the public ctor.
- **Per-provider adapters** translate vision on `role = "user"` messages:
- Ollama: `{role:"user", content:"text", images:["<b64>", ...]}` — works with `qwen3-vl:8b`, `llava`, `llama3.2-vision`, etc. Non-vision models silently ignore the field.
- Claude: typed content array — `[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]`. Works with all Claude vision-capable models (Haiku 4.5, Sonnet 4.6, Opus 4.7).
- OpenAI: typed content array — `[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]`. Works with gpt-4o, gpt-4o-mini, gpt-4-turbo, the o* reasoning models.
- DeepSeek: inherits the OpenAI adapter shape; current DeepSeek models lack vision and silently ignore the field. Shape-tested; no live call to avoid spending on a no-op.
- **Role-gated:** non-user messages (system/assistant/tool) with non-null `images` ignore the field on the wire — no provider's API accepts images on those roles. Pinned by tests.
- **Programmatic fixtures** in `src/test`: `VisionFixtures.threeSquaresPng()` (256×256 red/blue/green squares for "count the squares" eval) and `VisionFixtures.housePng()` (256×256 cartoon house for "what is this?" eval). Rendered via `BufferedImage` + `ImageIO` — reproducible byte-for-byte across machines and CI, no external assets in the repo.
- **Live integration tests** (`VisionLiveTest`) cover all three vision-capable providers with cost discipline (`temperature = 0`, `maxTokens = 80`, single-turn, ~5KB base64 payloads): Ollama `qwen3-vl:8b` (tagged `live-llm`, runs via `:integrationTest`), Claude `claude-haiku-4-5` and OpenAI `gpt-4o-mini` (tagged `live-cloud-api`, runs in default `:test` with `assumeTrue` skipping when no key). Model names overridable via env. Assertion shape is loose keyword-match — robust against per-model phrasing variance.
- 8 wire-format unit tests pin per-provider JSON shape + the no-images back-compat path. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a).

### Added — Multimodal foundation (#2465 epic, Stage 1)

- **Typed `Content` hierarchy (#2466)** — `sealed interface Content` with variants `Text`, `Image`, `Audio`, `Video`, `Document` in package `agents_engine.content`. Each non-text variant carries a `ContentRef` plus a typed mime (`ImageMime`, `AudioMime`, `VideoMime`, `DocMime`). Mime types are closed sealed interfaces with `wireMime: String` accessors — no `String` mime in any public API. Extension property `Content.modality: String` is the audit-stable per-variant name. Stage 1 wires Image + Document end-to-end (the modalities the 0.8 spec → product loop consumes); Audio + Video are modelled now and exercised through provider adapters in Stage 2 (#2470, deferred).
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,8 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
- **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay).
- **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up.
- **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval<IN, OUT>("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md).
- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end; Audio + Video modelled now and exercised through provider adapters in Stage 2 (#2470, deferred). See [docs/multimodal.md](docs/multimodal.md).
- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end. See [docs/multimodal.md](docs/multimodal.md).
- **Vision input to models** — `LlmMessage(role = "user", content = "...", images = listOf(ImagePart(base64, ImagePart.WireMime.Png)))` (#2470 slice a) reaches all four built-in adapters: Ollama emits `images: [<b64>...]`, Claude emits `{type:"image", source:{type:"base64",...}}` content blocks, OpenAI emits `{type:"image_url", image_url:{url:"data:..."}}` content blocks, DeepSeek inherits OpenAI (silently ignored on non-vision models). Closed `ImagePart.WireMime { Png, Jpeg, Gif, Webp }` — no `String` mime. Programmatic `VisionFixtures.threeSquaresPng()` / `housePng()` (256×256, `BufferedImage`-rendered, ~5KB) + per-provider live tests (qwen3-vl:8b / Haiku 4.5 / gpt-4o-mini) with cost discipline. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a).
- **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md).
- **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md).
- **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md).
Expand Down
65 changes: 63 additions & 2 deletions docs/multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,10 +143,71 @@ The same discipline applies (when wired) to the OTel / LangSmith / Langfuse brid

Pairs with the #2754 manifest-hash restore guard: resume across an agent rebuild that changed tools (including the `BlobStore` wiring) fails closed unless the caller opts in.

## What's coming (the rest of #2465)
## Vision input — talking to the model (#2470 slice a)

The foundation above answers "how do tools return mixed content?" The follow-on slice answers "how does the agent send an image to a vision-capable model?"

`LlmMessage` gains an optional `images: List<ImagePart>` field. Adapters translate it per provider:

```kotlin
import agents_engine.model.LlmMessage
import agents_engine.model.ImagePart

val png: ByteArray = Files.readAllBytes(Path.of("screenshot.png"))
val client = OllamaClient(model = "qwen3-vl:8b", temperature = 0.0)

client.chat(listOf(
LlmMessage(
role = "user",
content = "How many windows in this picture?",
images = listOf(ImagePart(
base64 = Base64.getEncoder().encodeToString(png),
wireMime = ImagePart.WireMime.Png,
)),
),
))
```

`ImagePart` carries the base64 payload + a closed `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`). `String` mime is not accepted in the public ctor — same closed-mime discipline as `Content.Image`. Caller base64-encodes upfront so the adapter can splat the payload straight onto the wire without re-encoding per provider.

| Provider | User-message shape |
|---|---|
| Ollama | `{role:"user", content:"text", images:["<b64>", ...]}` |
| Claude | `{role:"user", content:[{type:"text"}, {type:"image", source:{type:"base64", media_type:"image/png", data:"<b64>"}}, ...]}` |
| OpenAI | `{role:"user", content:[{type:"text"}, {type:"image_url", image_url:{url:"data:image/png;base64,<b64>"}}, ...]}` |
| DeepSeek | inherits OpenAI; current DeepSeek models lack vision and silently ignore the field |

**Back-compat:** `images = null` (or empty) produces byte-identical wire shape to pre-#2470. Pinned by dedicated tests.

**Role-gating:** images are emitted only on `role = "user"` messages. System / assistant / tool messages with non-null `images` ignore the field on the wire — no provider's API carries images on those roles.

### Programmatic image fixtures

`VisionFixtures` (in `src/test`) ships two reproducible PNGs rendered via `BufferedImage`:

- `threeSquaresPng()` — 256×256, three colored squares (red/blue/green) on white with thick black outlines. Used for "count the squares" eval against cheap vision models.
- `housePng()` — 256×256 cartoon house: triangle roof + body + door + two windows. Used for "what is this?" eval.

No external assets, no binary blobs in the repo — every byte is reproducible across machines and CI runs.

### Live tests

`VisionLiveTest.kt` runs the two fixtures against all three vision-capable providers, with cost discipline (256×256 PNG ~5KB, `temperature = 0`, `maxTokens = 80`, single-turn):

| Provider | Default model | How to run |
|---|---|---|
| Ollama | `qwen3-vl:8b` | `./gradlew integrationTest --tests "*VisionLiveTest*"` (tagged `live-llm`) |
| Claude | `claude-haiku-4-5` | `./gradlew test --tests "*VisionLiveTest*"` (tagged `live-cloud-api`; `assumeTrue` skips when no key) |
| OpenAI | `gpt-4o-mini` | same |

Models overridable via env (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, `AGENTSKT_TEST_CLAUDE_VISION_MODEL`, `AGENTSKT_TEST_OPENAI_VISION_MODEL`).

Assertion shape is loose: the test passes if the model's text response mentions one of a small acceptable keyword set (`3` / `three` for counting; `house` / `home` / `cottage` / `building` / `cabin` / `barn` for the house). Goal is "did the image reach the model and elicit a sensible reply" — not exact phrasing.

## What's still coming (rest of #2465)

- **#2468** Compile-time modality routing — `Agent<Image, X>` becomes a real type; cross-modality miswiring is a compile error. Multi-part `@Generable` inputs via KSP.
- **#2470** Provider adapters — Claude vision, OpenAI vision, Gemini, Ollama multimodal. Translates `Contentprovider-specific payload` at the wire.
- **#2470 (slice b)** `Content` → `LlmMessage.images` translation at the agentic loop — currently the caller dereferences `ContentRef`bytes → `ImagePart` manually. Sliced this way to land the wire format first; the loop hook is a small follow-up.
- **#2471** Manifest-anchored modality capability — declared per-agent modalities recorded in the permission manifest, validated against provider capabilities at build time.
- **#2472** Multimodal memory — `MemoryBank` entries carry `ContentRef` for image/audio/video state.
- **#2473** Testing fixtures + snapshot + mutation coverage.
Expand Down
21 changes: 20 additions & 1 deletion src/main/kotlin/agents_engine/model/ClaudeClient.kt
Original file line number Diff line number Diff line change
Expand Up @@ -389,7 +389,26 @@ open class ClaudeClient(
val cacheControl = if (msg.cacheHint != null && consumeBreakpoint()) cacheControlJson(msg.cacheHint) else null
when (msg.role) {
"user" -> {
if (cacheControl == null) {
val images = msg.images
if (!images.isNullOrEmpty()) {
// #2470 — vision input. Anthropic accepts a content
// array of typed blocks; one text block + N image
// blocks. Each image block is base64-source with a
// typed media_type.
val textBlock = """{"type":"text","text":${msg.content.toJsonString()}}"""
val imageBlocks = images.joinToString(",") { part ->
"""{"type":"image","source":{"type":"base64","media_type":${part.wireMime.value.toJsonString()},"data":${part.base64.toJsonString()}}}"""
}
val allBlocks = "$textBlock,$imageBlocks"
val withCache = if (cacheControl != null) {
// Attach cache_control to the LAST block.
val splitAt = allBlocks.lastIndexOf("}")
allBlocks.substring(0, splitAt) + ",$cacheControl" + allBlocks.substring(splitAt)
} else {
allBlocks
}
"""{"role":"user","content":[$withCache]}"""
} else if (cacheControl == null) {
"""{"role":"user","content":${msg.content.toJsonString()}}"""
} else {
// Single text content block with cache_control attached.
Expand Down
50 changes: 50 additions & 0 deletions src/main/kotlin/agents_engine/model/ModelClient.kt
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,58 @@ data class LlmMessage(
* unchanged on the wire.
*/
val cacheHint: CacheHint? = null,
/**
* #2470 — optional vision input. When non-null and the role is
* `"user"`, adapters translate each [ImagePart] into the provider's
* native image payload alongside [content]:
*
* - Ollama (e.g. qwen3-vl:8b) — `images: [<base64>, ...]` array
* on the user message; [content] stays the text prompt.
* - Anthropic Claude — `content: [{type:"text",...},
* {type:"image", source:{type:"base64", media_type:"image/png",
* data:"<base64>"}}, ...]`.
* - OpenAI — `content: [{type:"text",...},
* {type:"image_url", image_url:{url:"data:image/png;base64,
* <base64>"}}, ...]`.
*
* Null = no vision parts; wire shape is byte-identical to pre-#2470.
* Vision works on the FIRST user turn (most common case for "describe
* this image" prompts); subsequent user-turn images compose naturally
* if the model supports multi-turn vision.
*
* Non-user roles ignore this field — system / assistant / tool
* messages don't carry images in any provider's API.
*/
val images: List<ImagePart>? = null,
)

/**
* #2470 — base64-encoded image payload for vision input. The caller is
* responsible for the encoding so the adapter can splat the bytes onto
* the wire without re-encoding per provider. Wire MIME is closed via
* the [ImagePart.WireMime] sealed type — `String` mime is intentionally
* not accepted in the public ctor.
*
* Small, allocation-cheap. Equatability: `base64` is a `String`, so
* structural equals/hashCode work — unlike `ByteArray`, which uses
* identity equals (the trap we avoid by base64-encoding upfront).
*/
data class ImagePart(
/** Base64-encoded image bytes, no `data:` URL prefix. Adapter formats per-provider. */
val base64: String,
/** Closed wire MIME — `image/png`, `image/jpeg`, `image/gif`, `image/webp`. */
val wireMime: WireMime,
) {
sealed interface WireMime {
val value: String

object Png : WireMime { override val value: String = "image/png" }
object Jpeg : WireMime { override val value: String = "image/jpeg" }
object Gif : WireMime { override val value: String = "image/gif" }
object Webp : WireMime { override val value: String = "image/webp" }
}
}

data class ToolCall(
val name: String,
val arguments: Map<String, Any?> = emptyMap(),
Expand Down
13 changes: 13 additions & 0 deletions src/main/kotlin/agents_engine/model/OllamaClient.kt
Original file line number Diff line number Diff line change
Expand Up @@ -398,6 +398,19 @@ open class OllamaClient(
})
append("]")
}
// #2470 — vision input. Ollama's chat API accepts an `images`
// array of base64-encoded payloads (no data: prefix) on user
// messages. Vision-capable models (qwen3-vl, llama3.2-vision,
// llava) consume it; non-vision models ignore it without
// error. Mime is not on the wire — Ollama infers from the
// bytes; we keep the typed wireMime on ImagePart for audit
// + caller debugging only.
val images = msg.images
if (msg.role == "user" && !images.isNullOrEmpty()) {
append(""","images":[""")
append(images.joinToString(",") { it.base64.toJsonString() })
append("]")
}
append("}")
}
}
Expand Down
Loading
Loading