Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,17 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a

## [Unreleased]

### Added — Typed agent attachments (#2470 slice b)

- **`agent.invokeWithAttachments(input, attachments)`** + suspending sibling `invokeSuspendWithAttachments` — user-facing API for vision input via typed `Content.Image`. The runtime dereferences each ref against the agent's injected `BlobStore`, base64-encodes once, and attaches `ImagePart` to the first user `LlmMessage`. Per-provider wire translation is the slice-a work — this commit routes the typed surface into it.
- **`Agent.blobStore: BlobStore?` + `blobStore(store)` DSL** — optional injection; null when the agent doesn't take attachments. Passing attachments to an agent with no `blobStore` errors fast at invoke time with a clear message — caller misconfiguration surfaces before any provider HTTP.
- **Closed mime mapping** — `ImageMime → ImagePart.WireMime` for all four variants (`Png`, `Jpeg`, `Gif`, `Webp`). No `String` conversion at any boundary.
- **Forensic-friendly errors** — when a ref's blob is missing from the store, the error names the ref's hash prefix. Helps debug snapshot resumes against partially-purged stores.
- **Non-image variants skipped in v1** — `Content.Text` / `Document` / `Audio` / `Video` flow through the attachment path as no-ops. Slice c will wire Document via provider doc-input adapters; Audio/Video land in Stage 2.
- **Empty / all-skipped attachments → null images** — no provider sees an empty array; legacy wire shape preserved.
- **Resume composition** — `attachments` argument is ignored on resume because the restored conversation already carries the original `LlmMessage.images` on the saved user turn.
- **Tests:** 8 unit cases (`AgentAttachmentsTest`) + 6 live cases (`AgentVisionLiveTest`) running the same `VisionFixtures` from slice a through the agent surface on Ollama qwen3-vl:8b, Claude Haiku 4.5, OpenAI gpt-4o-mini. See [docs/multimodal.md](docs/multimodal.md#agent-attachments--typed-contentimage-at-the-invoke-surface-2470-slice-b).

### Added — Vision input across all providers (#2470 slice a)

- **`LlmMessage.images: List<ImagePart>? = null`** — new optional field; back-compat default leaves the wire shape byte-identical to pre-#2470 for callers that don't pass images. Closed `ImagePart(base64, wireMime)` with `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`) — `String` mime is intentionally not accepted in the public ctor.
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
- **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval<IN, OUT>("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md).
- **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List<Content>)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`<modality>:<hash-prefix>:<size>:<mime>`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end. See [docs/multimodal.md](docs/multimodal.md).
- **Vision input to models** — `LlmMessage(role = "user", content = "...", images = listOf(ImagePart(base64, ImagePart.WireMime.Png)))` (#2470 slice a) reaches all four built-in adapters: Ollama emits `images: [<b64>...]`, Claude emits `{type:"image", source:{type:"base64",...}}` content blocks, OpenAI emits `{type:"image_url", image_url:{url:"data:..."}}` content blocks, DeepSeek inherits OpenAI (silently ignored on non-vision models). Closed `ImagePart.WireMime { Png, Jpeg, Gif, Webp }` — no `String` mime. Programmatic `VisionFixtures.threeSquaresPng()` / `housePng()` (256×256, `BufferedImage`-rendered, ~5KB) + per-provider live tests (qwen3-vl:8b / Haiku 4.5 / gpt-4o-mini) with cost discipline. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a).
- **Typed `Content.Image` at the agent surface** — `agent.invokeWithAttachments("describe", attachments = listOf(Content.Image(ref, ImageMime.Png)))` (#2470 slice b). Inject a `BlobStore` via `blobStore(store)` in the agent DSL; the runtime dereferences each `Content.Image` against the store, base64-encodes once, and attaches `ImagePart` to the first user message. Closed `ImageMime → ImagePart.WireMime` mapping covers all four variants. Misconfiguration errors fast (no `blobStore` configured, missing blob for a ref). Composes with snapshot/resume — refs travel in the snapshot; the same store dereferences on resume. Suspending sibling `invokeSuspendWithAttachments`. Live tests across all three vision providers via the agent surface. See [docs/multimodal.md](docs/multimodal.md#agent-attachments--typed-contentimage-at-the-invoke-surface-2470-slice-b).
- **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md).
- **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md).
- **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md).
Expand Down
50 changes: 49 additions & 1 deletion docs/multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,10 +204,58 @@ Models overridable via env (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, `AGENTSKT_TEST_

Assertion shape is loose: the test passes if the model's text response mentions one of a small acceptable keyword set (`3` / `three` for counting; `house` / `home` / `cottage` / `building` / `cabin` / `barn` for the house). Goal is "did the image reach the model and elicit a sensible reply" — not exact phrasing.

## Agent attachments — typed `Content.Image` at the invoke surface (#2470 slice b)

Slice a wires the per-provider wire format on `LlmMessage.images`. Slice b puts a clean user-facing API on top: the caller passes typed `Content.Image` (carrying a `ContentRef`) at the agent's invoke surface; the runtime dereferences against an injected `BlobStore`, base64-encodes once, and attaches `ImagePart` to the first user message.

```kotlin
import agents_engine.content.Content
import agents_engine.content.ImageMime
import agents_engine.content.FileBlobStore

val store = FileBlobStore(Path.of("snapshots/blobs"))

val agent = agent<String, String>("vision") {
model { ollama("qwen3-vl:8b") }
blobStore(store)
skills { skill<String, String>("describe", "") { tools() } }
}

val ref = store.put(pngBytes, ImageMime.Png.wireMime)
val reply = agent.invokeWithAttachments(
"What is in this image?",
attachments = listOf(Content.Image(ref, ImageMime.Png)),
)
```

### What the runtime guarantees

- **First user message only.** Attachments ride on the initial user turn. Multi-turn vision is composable but each invocation owns its own first-turn attachments.
- **Closed-mime mapping.** `ImageMime → ImagePart.WireMime` for all four variants — `Png`, `Jpeg`, `Gif`, `Webp`. No String conversions anywhere.
- **Fail-fast on misconfiguration.** Passing attachments to an agent with no `blobStore` configured errors fast at invoke time with `Agent '<name>' has attachments but no blobStore`. A ref pointing at a missing blob errors fast with the ref's hash prefix in the message for forensics.
- **Skip non-image variants.** `Content.Text` / `Content.Document` / `Content.Audio` / `Content.Video` are silently skipped in v1 — slice c will wire Document via provider doc-input adapters, audio/video as part of Stage 2.
- **Back-compat.** `agent.invokeSuspend(input)` (without attachments) stays byte-identical on the wire. The attachments path is purely additive — opt in via the new `invokeWithAttachments` / `invokeSuspendWithAttachments` entry points.
- **Snapshot/resume composition.** On resume the saved user turn already carries the original `LlmMessage.images`; the runtime ignores the `attachments` argument because the conversation was restored intact.

### Suspending vs blocking

| Entry point | Use when |
|---|---|
| `agent.invokeSuspendWithAttachments(input, attachments)` | Inside coroutine scopes — composition operators, structured concurrency. |
| `agent.invokeWithAttachments(input, attachments)` | Outside coroutine scopes — quick scripts, REPL, blocking glue. Thin `runBlocking` shim. |

Mirrors the existing `invokeSuspend` / `invoke` split.

### Live tests

`AgentVisionLiveTest.kt` runs the two `VisionFixtures` (`threeSquaresPng()` + `housePng()`) through the agent surface on all three vision-capable providers. Same cost discipline as slice a — 256×256 PNG, `temperature = 0`, `maxTokens = 80`, single-turn. Tagged `live-llm` (Ollama) / `live-cloud-api` (Claude + OpenAI); `assumeTrue` skips per-provider when no key.

The slice-b live tests complement the slice-a tests: the slice-a `VisionLiveTest` exercises the raw `ModelClient`; slice-b's `AgentVisionLiveTest` exercises the full agent loop including BlobStore deref.

## What's still coming (rest of #2465)

- **#2468** Compile-time modality routing — `Agent<Image, X>` becomes a real type; cross-modality miswiring is a compile error. Multi-part `@Generable` inputs via KSP.
- **#2470 (slice b)** `Content` → `LlmMessage.images` translation at the agentic loop — currently the caller dereferences `ContentRef` → bytes → `ImagePart` manually. Sliced this way to land the wire format first; the loop hook is a small follow-up.
- **#2470 slice c** Document/Audio/Video provider-input adapters — currently only images flow through the wire; Document/Audio/Video Content variants are skipped on the attachment path.
- **#2471** Manifest-anchored modality capability — declared per-agent modalities recorded in the permission manifest, validated against provider capabilities at build time.
- **#2472** Multimodal memory — `MemoryBank` entries carry `ContentRef` for image/audio/video state.
- **#2473** Testing fixtures + snapshot + mutation coverage.
Expand Down
92 changes: 92 additions & 0 deletions src/main/kotlin/agents_engine/core/Agent.kt
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,18 @@ class Agent<IN, OUT>(
private set
var skillChosenListener: ((name: String) -> Unit)? = null
private set
/**
* #2470 slice b — optional [agents_engine.content.BlobStore] for
* dereferencing `Content.Image` attachments at the agent invoke
* surface. When the caller passes `attachments = listOf(Content.Image(
* ref, mime))`, the runtime reads the bytes from this store and builds
* the corresponding [agents_engine.model.ImagePart] for the first user
* LlmMessage. Null when the agent doesn't accept image attachments —
* passing attachments to such an agent errors fast at invoke time
* with a clear message.
*/
var blobStore: agents_engine.content.BlobStore? = null
private set
var memoryBank: MemoryBank? = null
private set
var routerRationaleListener: ((rationale: String) -> Unit)? = null
Expand Down Expand Up @@ -547,6 +559,36 @@ class Agent<IN, OUT>(
}
}

/**
* #2470 slice b — inject a [agents_engine.content.BlobStore] so the
* agent can dereference `Content.Image` attachments at invoke time.
*
* ```kotlin
* val store = FileBlobStore(Path.of("blobs"))
* val agent = agent<String, String>("vision") {
* model { ollama("qwen3-vl:8b") }
* blobStore(store)
* skills { skill<String, String>("describe", "") { tools() } }
* }
*
* val ref = store.put(pngBytes, ImageMime.Png.wireMime)
* val out = agent.invokeWithAttachments(
* "What is in this image?",
* attachments = listOf(Content.Image(ref, ImageMime.Png)),
* )
* ```
*
* The runtime reads the blob from this store, base64-encodes once,
* and attaches it to the first user LlmMessage as
* `images: List<ImagePart>`. Per-provider wire translation is the
* #2470 slice-a work in `OllamaClient` / `ClaudeClient` /
* `OpenAiClient`.
*/
fun blobStore(store: agents_engine.content.BlobStore) {
checkNotFrozen()
blobStore = store
}

fun tools(block: ToolsBuilder.() -> Unit) {
checkNotFrozen()
val builder = ToolsBuilder()
Expand Down Expand Up @@ -602,6 +644,46 @@ class Agent<IN, OUT>(
invokeSuspendForSession(input, emitter = null) { /* no-op */ }
}

/**
* #2470 slice b — suspending entry point with image attachments. The
* caller passes `attachments = listOf(Content.Image(ref, mime), ...)`;
* the runtime dereferences each ref against [blobStore], base64-encodes
* once, and attaches them to the first user LlmMessage. Per-provider
* wire translation is the slice-a work (Ollama / Claude / OpenAI all
* already implement the wire format for `LlmMessage.images`).
*
* Errors fast with a clear message when:
* - [blobStore] is null but [attachments] are passed
* - A ref's blob is missing from the store (purged / rewired)
*
* Document / Audio / Video variants in [attachments] are silently
* skipped in v1 — they'll be wired through provider doc/audio/video
* adapters in later slices of #2470.
*/
suspend fun invokeSuspendWithAttachments(
input: IN,
attachments: List<agents_engine.content.Content>,
): OUT =
withAgentRuntimeContext(newRuntimeContext()) {
invokeSuspendForSession(
input = input,
emitter = null,
attachments = attachments,
) { /* no-op */ }
}

/**
* #2470 slice b — blocking shim over [invokeSuspendWithAttachments]
* for callers outside coroutine scopes. Mirrors the [invoke] /
* [invokeSuspend] split.
*/
fun invokeWithAttachments(
input: IN,
attachments: List<agents_engine.content.Content>,
): OUT = kotlinx.coroutines.runBlocking {
invokeSuspendWithAttachments(input, attachments)
}

/**
* #2749 — public snapshot/resume seam.
*
Expand Down Expand Up @@ -716,6 +798,15 @@ class Agent<IN, OUT>(
* #2754 — opt out of the snapshot manifest-hash restore guard.
*/
allowManifestMismatch: Boolean = false,
/**
* #2470 slice b — image attachments to ride on the FIRST user
* LlmMessage. Runtime dereferences each `Content.Image` against
* [Agent.blobStore] (errors fast when null) and renders into
* [agents_engine.model.ImagePart]. Non-image variants in the
* list (Document / Audio / Video) are deferred — Stage 2 with
* provider adapters. Null = no attachments; wire shape unchanged.
*/
attachments: List<agents_engine.content.Content>? = null,
onSkillCompleted: (agents_engine.model.TokenUsage?) -> Unit = { /* no-op */ },
onSkillStarted: (String) -> Unit,
): OUT {
Expand Down Expand Up @@ -744,6 +835,7 @@ class Agent<IN, OUT>(
onTurnCheckpoint = onTurnCheckpoint,
resumeWith = resumeWith,
allowManifestMismatch = allowManifestMismatch,
attachments = attachments,
)
// #1740: surface cumulative usage on the way out. Non-agentic
// skills don't go through executeAgentic, so onSkillCompleted
Expand Down
Loading
Loading