diff --git a/CHANGELOG.md b/CHANGELOG.md index e7d3dc9..90f4318 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,17 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a ## [Unreleased] +### Added — Typed agent attachments (#2470 slice b) + +- **`agent.invokeWithAttachments(input, attachments)`** + suspending sibling `invokeSuspendWithAttachments` — user-facing API for vision input via typed `Content.Image`. The runtime dereferences each ref against the agent's injected `BlobStore`, base64-encodes once, and attaches `ImagePart` to the first user `LlmMessage`. Per-provider wire translation is the slice-a work — this commit routes the typed surface into it. +- **`Agent.blobStore: BlobStore?` + `blobStore(store)` DSL** — optional injection; null when the agent doesn't take attachments. Passing attachments to an agent with no `blobStore` errors fast at invoke time with a clear message — caller misconfiguration surfaces before any provider HTTP. +- **Closed mime mapping** — `ImageMime → ImagePart.WireMime` for all four variants (`Png`, `Jpeg`, `Gif`, `Webp`). No `String` conversion at any boundary. +- **Forensic-friendly errors** — when a ref's blob is missing from the store, the error names the ref's hash prefix. Helps debug snapshot resumes against partially-purged stores. +- **Non-image variants skipped in v1** — `Content.Text` / `Document` / `Audio` / `Video` flow through the attachment path as no-ops. Slice c will wire Document via provider doc-input adapters; Audio/Video land in Stage 2. +- **Empty / all-skipped attachments → null images** — no provider sees an empty array; legacy wire shape preserved. +- **Resume composition** — `attachments` argument is ignored on resume because the restored conversation already carries the original `LlmMessage.images` on the saved user turn. +- **Tests:** 8 unit cases (`AgentAttachmentsTest`) + 6 live cases (`AgentVisionLiveTest`) running the same `VisionFixtures` from slice a through the agent surface on Ollama qwen3-vl:8b, Claude Haiku 4.5, OpenAI gpt-4o-mini. See [docs/multimodal.md](docs/multimodal.md#agent-attachments--typed-contentimage-at-the-invoke-surface-2470-slice-b). + ### Added — Vision input across all providers (#2470 slice a) - **`LlmMessage.images: List? = null`** — new optional field; back-compat default leaves the wire shape byte-identical to pre-#2470 for callers that don't pass images. Closed `ImagePart(base64, wireMime)` with `WireMime` sealed type (`Png`, `Jpeg`, `Gif`, `Webp`) — `String` mime is intentionally not accepted in the public ctor. diff --git a/README.md b/README.md index a7ca4c3..0984ff5 100644 --- a/README.md +++ b/README.md @@ -157,6 +157,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes - **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md). - **Multimodal foundation** — `sealed Content { Text, Image(ref, ImageMime), Audio, Video, Document(ref, DocMime) }` (#2466) with closed mime types per modality (no `String` mime). Content-addressed `ContentRef(hash, sizeBytes, wireMime)` + `BlobStore` interface, `InMemoryBlobStore` / `FileBlobStore` impls — SHA-256 keys match the manifest-hash family, atomic tmp+rename, process-restart-safe, idempotent put (#2467). Tools can return `ToolResult(parts: List)` for mixed text + image + document outputs; JSONL audit exporter records `outputParts` per-part summary (`:::`) with no blob bytes in the audit row (#2469). Stage 1 wires Image + Document end-to-end. See [docs/multimodal.md](docs/multimodal.md). - **Vision input to models** — `LlmMessage(role = "user", content = "...", images = listOf(ImagePart(base64, ImagePart.WireMime.Png)))` (#2470 slice a) reaches all four built-in adapters: Ollama emits `images: [...]`, Claude emits `{type:"image", source:{type:"base64",...}}` content blocks, OpenAI emits `{type:"image_url", image_url:{url:"data:..."}}` content blocks, DeepSeek inherits OpenAI (silently ignored on non-vision models). Closed `ImagePart.WireMime { Png, Jpeg, Gif, Webp }` — no `String` mime. Programmatic `VisionFixtures.threeSquaresPng()` / `housePng()` (256×256, `BufferedImage`-rendered, ~5KB) + per-provider live tests (qwen3-vl:8b / Haiku 4.5 / gpt-4o-mini) with cost discipline. See [docs/multimodal.md](docs/multimodal.md#vision-input--talking-to-the-model-2470-slice-a). +- **Typed `Content.Image` at the agent surface** — `agent.invokeWithAttachments("describe", attachments = listOf(Content.Image(ref, ImageMime.Png)))` (#2470 slice b). Inject a `BlobStore` via `blobStore(store)` in the agent DSL; the runtime dereferences each `Content.Image` against the store, base64-encodes once, and attaches `ImagePart` to the first user message. Closed `ImageMime → ImagePart.WireMime` mapping covers all four variants. Misconfiguration errors fast (no `blobStore` configured, missing blob for a ref). Composes with snapshot/resume — refs travel in the snapshot; the same store dereferences on resume. Suspending sibling `invokeSuspendWithAttachments`. Live tests across all three vision providers via the agent surface. See [docs/multimodal.md](docs/multimodal.md#agent-attachments--typed-contentimage-at-the-invoke-surface-2470-slice-b). - **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md). - **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md). - **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md). diff --git a/docs/multimodal.md b/docs/multimodal.md index ee10a41..dd18af3 100644 --- a/docs/multimodal.md +++ b/docs/multimodal.md @@ -204,10 +204,58 @@ Models overridable via env (`AGENTSKT_TEST_OLLAMA_VISION_MODEL`, `AGENTSKT_TEST_ Assertion shape is loose: the test passes if the model's text response mentions one of a small acceptable keyword set (`3` / `three` for counting; `house` / `home` / `cottage` / `building` / `cabin` / `barn` for the house). Goal is "did the image reach the model and elicit a sensible reply" — not exact phrasing. +## Agent attachments — typed `Content.Image` at the invoke surface (#2470 slice b) + +Slice a wires the per-provider wire format on `LlmMessage.images`. Slice b puts a clean user-facing API on top: the caller passes typed `Content.Image` (carrying a `ContentRef`) at the agent's invoke surface; the runtime dereferences against an injected `BlobStore`, base64-encodes once, and attaches `ImagePart` to the first user message. + +```kotlin +import agents_engine.content.Content +import agents_engine.content.ImageMime +import agents_engine.content.FileBlobStore + +val store = FileBlobStore(Path.of("snapshots/blobs")) + +val agent = agent("vision") { + model { ollama("qwen3-vl:8b") } + blobStore(store) + skills { skill("describe", "") { tools() } } +} + +val ref = store.put(pngBytes, ImageMime.Png.wireMime) +val reply = agent.invokeWithAttachments( + "What is in this image?", + attachments = listOf(Content.Image(ref, ImageMime.Png)), +) +``` + +### What the runtime guarantees + +- **First user message only.** Attachments ride on the initial user turn. Multi-turn vision is composable but each invocation owns its own first-turn attachments. +- **Closed-mime mapping.** `ImageMime → ImagePart.WireMime` for all four variants — `Png`, `Jpeg`, `Gif`, `Webp`. No String conversions anywhere. +- **Fail-fast on misconfiguration.** Passing attachments to an agent with no `blobStore` configured errors fast at invoke time with `Agent '' has attachments but no blobStore`. A ref pointing at a missing blob errors fast with the ref's hash prefix in the message for forensics. +- **Skip non-image variants.** `Content.Text` / `Content.Document` / `Content.Audio` / `Content.Video` are silently skipped in v1 — slice c will wire Document via provider doc-input adapters, audio/video as part of Stage 2. +- **Back-compat.** `agent.invokeSuspend(input)` (without attachments) stays byte-identical on the wire. The attachments path is purely additive — opt in via the new `invokeWithAttachments` / `invokeSuspendWithAttachments` entry points. +- **Snapshot/resume composition.** On resume the saved user turn already carries the original `LlmMessage.images`; the runtime ignores the `attachments` argument because the conversation was restored intact. + +### Suspending vs blocking + +| Entry point | Use when | +|---|---| +| `agent.invokeSuspendWithAttachments(input, attachments)` | Inside coroutine scopes — composition operators, structured concurrency. | +| `agent.invokeWithAttachments(input, attachments)` | Outside coroutine scopes — quick scripts, REPL, blocking glue. Thin `runBlocking` shim. | + +Mirrors the existing `invokeSuspend` / `invoke` split. + +### Live tests + +`AgentVisionLiveTest.kt` runs the two `VisionFixtures` (`threeSquaresPng()` + `housePng()`) through the agent surface on all three vision-capable providers. Same cost discipline as slice a — 256×256 PNG, `temperature = 0`, `maxTokens = 80`, single-turn. Tagged `live-llm` (Ollama) / `live-cloud-api` (Claude + OpenAI); `assumeTrue` skips per-provider when no key. + +The slice-b live tests complement the slice-a tests: the slice-a `VisionLiveTest` exercises the raw `ModelClient`; slice-b's `AgentVisionLiveTest` exercises the full agent loop including BlobStore deref. + ## What's still coming (rest of #2465) - **#2468** Compile-time modality routing — `Agent` becomes a real type; cross-modality miswiring is a compile error. Multi-part `@Generable` inputs via KSP. -- **#2470 (slice b)** `Content` → `LlmMessage.images` translation at the agentic loop — currently the caller dereferences `ContentRef` → bytes → `ImagePart` manually. Sliced this way to land the wire format first; the loop hook is a small follow-up. +- **#2470 slice c** Document/Audio/Video provider-input adapters — currently only images flow through the wire; Document/Audio/Video Content variants are skipped on the attachment path. - **#2471** Manifest-anchored modality capability — declared per-agent modalities recorded in the permission manifest, validated against provider capabilities at build time. - **#2472** Multimodal memory — `MemoryBank` entries carry `ContentRef` for image/audio/video state. - **#2473** Testing fixtures + snapshot + mutation coverage. diff --git a/src/main/kotlin/agents_engine/core/Agent.kt b/src/main/kotlin/agents_engine/core/Agent.kt index 497e6e2..df10f68 100644 --- a/src/main/kotlin/agents_engine/core/Agent.kt +++ b/src/main/kotlin/agents_engine/core/Agent.kt @@ -191,6 +191,18 @@ class Agent( private set var skillChosenListener: ((name: String) -> Unit)? = null private set + /** + * #2470 slice b — optional [agents_engine.content.BlobStore] for + * dereferencing `Content.Image` attachments at the agent invoke + * surface. When the caller passes `attachments = listOf(Content.Image( + * ref, mime))`, the runtime reads the bytes from this store and builds + * the corresponding [agents_engine.model.ImagePart] for the first user + * LlmMessage. Null when the agent doesn't accept image attachments — + * passing attachments to such an agent errors fast at invoke time + * with a clear message. + */ + var blobStore: agents_engine.content.BlobStore? = null + private set var memoryBank: MemoryBank? = null private set var routerRationaleListener: ((rationale: String) -> Unit)? = null @@ -547,6 +559,36 @@ class Agent( } } + /** + * #2470 slice b — inject a [agents_engine.content.BlobStore] so the + * agent can dereference `Content.Image` attachments at invoke time. + * + * ```kotlin + * val store = FileBlobStore(Path.of("blobs")) + * val agent = agent("vision") { + * model { ollama("qwen3-vl:8b") } + * blobStore(store) + * skills { skill("describe", "") { tools() } } + * } + * + * val ref = store.put(pngBytes, ImageMime.Png.wireMime) + * val out = agent.invokeWithAttachments( + * "What is in this image?", + * attachments = listOf(Content.Image(ref, ImageMime.Png)), + * ) + * ``` + * + * The runtime reads the blob from this store, base64-encodes once, + * and attaches it to the first user LlmMessage as + * `images: List`. Per-provider wire translation is the + * #2470 slice-a work in `OllamaClient` / `ClaudeClient` / + * `OpenAiClient`. + */ + fun blobStore(store: agents_engine.content.BlobStore) { + checkNotFrozen() + blobStore = store + } + fun tools(block: ToolsBuilder.() -> Unit) { checkNotFrozen() val builder = ToolsBuilder() @@ -602,6 +644,46 @@ class Agent( invokeSuspendForSession(input, emitter = null) { /* no-op */ } } + /** + * #2470 slice b — suspending entry point with image attachments. The + * caller passes `attachments = listOf(Content.Image(ref, mime), ...)`; + * the runtime dereferences each ref against [blobStore], base64-encodes + * once, and attaches them to the first user LlmMessage. Per-provider + * wire translation is the slice-a work (Ollama / Claude / OpenAI all + * already implement the wire format for `LlmMessage.images`). + * + * Errors fast with a clear message when: + * - [blobStore] is null but [attachments] are passed + * - A ref's blob is missing from the store (purged / rewired) + * + * Document / Audio / Video variants in [attachments] are silently + * skipped in v1 — they'll be wired through provider doc/audio/video + * adapters in later slices of #2470. + */ + suspend fun invokeSuspendWithAttachments( + input: IN, + attachments: List, + ): OUT = + withAgentRuntimeContext(newRuntimeContext()) { + invokeSuspendForSession( + input = input, + emitter = null, + attachments = attachments, + ) { /* no-op */ } + } + + /** + * #2470 slice b — blocking shim over [invokeSuspendWithAttachments] + * for callers outside coroutine scopes. Mirrors the [invoke] / + * [invokeSuspend] split. + */ + fun invokeWithAttachments( + input: IN, + attachments: List, + ): OUT = kotlinx.coroutines.runBlocking { + invokeSuspendWithAttachments(input, attachments) + } + /** * #2749 — public snapshot/resume seam. * @@ -716,6 +798,15 @@ class Agent( * #2754 — opt out of the snapshot manifest-hash restore guard. */ allowManifestMismatch: Boolean = false, + /** + * #2470 slice b — image attachments to ride on the FIRST user + * LlmMessage. Runtime dereferences each `Content.Image` against + * [Agent.blobStore] (errors fast when null) and renders into + * [agents_engine.model.ImagePart]. Non-image variants in the + * list (Document / Audio / Video) are deferred — Stage 2 with + * provider adapters. Null = no attachments; wire shape unchanged. + */ + attachments: List? = null, onSkillCompleted: (agents_engine.model.TokenUsage?) -> Unit = { /* no-op */ }, onSkillStarted: (String) -> Unit, ): OUT { @@ -744,6 +835,7 @@ class Agent( onTurnCheckpoint = onTurnCheckpoint, resumeWith = resumeWith, allowManifestMismatch = allowManifestMismatch, + attachments = attachments, ) // #1740: surface cumulative usage on the way out. Non-agentic // skills don't go through executeAgentic, so onSkillCompleted diff --git a/src/main/kotlin/agents_engine/model/AgenticLoop.kt b/src/main/kotlin/agents_engine/model/AgenticLoop.kt index 566460d..5402e99 100644 --- a/src/main/kotlin/agents_engine/model/AgenticLoop.kt +++ b/src/main/kotlin/agents_engine/model/AgenticLoop.kt @@ -130,6 +130,17 @@ internal suspend fun executeAgentic( * snapshot; ignored otherwise. */ resumeWith: Any? = null, + /** + * #2470 slice b — image attachments for the FIRST user LlmMessage. + * Each `Content.Image` is dereferenced against [Agent.blobStore] + * (errors fast when null), base64-encoded once, and rendered into + * an [ImagePart]. Non-image content variants are skipped — Document + * / Audio / Video flow through the wire only once #2470 slice c + * (provider doc/audio/video adapters) ships. Ignored on resume (the + * snapshot's restored conversation already carries the original + * attachments on the saved user turn). + */ + attachments: List? = null, ): AgenticResult { val config = requireNotNull(agent.modelConfig) { "Agent '${agent.name}' has no model configured. Add a model { } block." @@ -318,7 +329,54 @@ internal suspend fun executeAgentic( // User: serialized input. Typed @Generable inputs become JSON; primitives // and Strings render literally; non-Generable types fall back to toString. // See #937 / GenerableSupport.toLlmInput. - messages.add(LlmMessage("user", toLlmInput(input))) + // + // #2470 slice b — when the caller passes `attachments`, dereference + // each `Content.Image` against the agent's BlobStore, base64-encode + // once, and ride along on this first user message as `images: List< + // ImagePart>`. The slice-a per-provider adapters translate that to + // the right wire shape (Ollama `images: [...]`, Claude image blocks, + // OpenAI image_url blocks). Non-image content variants (Document / + // Audio / Video) skipped — provider doc/audio/video paths land in + // later slices. Image-less attachments lists are a fast-path no-op. + val attachedImages: List? = if (attachments.isNullOrEmpty()) { + null + } else { + val store = agent.blobStore + require(store != null) { + "Agent '${agent.name}' has attachments but no blobStore — call `blobStore(store)` " + + "inside the agent { } block so Content.Image refs can be dereferenced." + } + attachments.mapNotNull { content -> + when (content) { + is agents_engine.content.Content.Image -> { + val bytes = store.get(content.ref) + ?: error( + "BlobStore on agent '${agent.name}' has no entry for ContentRef(" + + "hash=${content.ref.hash.take(12)}…, size=${content.ref.sizeBytes}); " + + "did the store get rewired or the blob purged?", + ) + ImagePart( + base64 = java.util.Base64.getEncoder().encodeToString(bytes), + wireMime = when (content.mime) { + agents_engine.content.ImageMime.Png -> ImagePart.WireMime.Png + agents_engine.content.ImageMime.Jpeg -> ImagePart.WireMime.Jpeg + agents_engine.content.ImageMime.Gif -> ImagePart.WireMime.Gif + agents_engine.content.ImageMime.Webp -> ImagePart.WireMime.Webp + }, + ) + } + is agents_engine.content.Content.Text, + is agents_engine.content.Content.Audio, + is agents_engine.content.Content.Video, + is agents_engine.content.Content.Document -> { + // Not an image — skip in v1. Slice c (provider doc/ + // audio/video paths) covers the rest. + null + } + } + }.takeIf { it.isNotEmpty() } + } + messages.add(LlmMessage("user", toLlmInput(input), images = attachedImages)) } var turns = resumeFrom?.turns ?: 0 diff --git a/src/test/kotlin/agents_engine/core/AgentAttachmentsTest.kt b/src/test/kotlin/agents_engine/core/AgentAttachmentsTest.kt new file mode 100644 index 0000000..beb7a34 --- /dev/null +++ b/src/test/kotlin/agents_engine/core/AgentAttachmentsTest.kt @@ -0,0 +1,174 @@ +package agents_engine.core + +import agents_engine.content.Content +import agents_engine.content.ImageMime +import agents_engine.content.InMemoryBlobStore +import agents_engine.model.LlmMessage +import agents_engine.model.LlmResponse +import agents_engine.model.ModelClient +import org.junit.jupiter.api.assertThrows +import kotlin.test.Test +import kotlin.test.assertEquals +import kotlin.test.assertNotNull +import kotlin.test.assertTrue + +/** + * #2470 slice b — `agent.invokeWithAttachments(input, attachments)` + * dereferences Content.Image against the agent's BlobStore and rides + * along as `LlmMessage.images` on the first user message. Pins: + * + * 1. attachments = listOf(Content.Image(ref, mime)) → first user + * LlmMessage carries an ImagePart with the dereferenced base64 + + * typed wire mime. + * 2. The text input is unchanged — toLlmInput(input) still controls + * the message.content. + * 3. Closed mime mapping (ImageMime → ImagePart.WireMime) for all four + * variants. + * 4. Multiple images compose; ordering preserved. + * 5. Non-image Content variants (Document/Audio/Video/Text) are + * skipped in v1 — no provider-doc/audio/video path yet. + * 6. attachments without a configured BlobStore fails fast with a + * clear error. + * 7. Ref pointing at a missing blob fails fast with a forensic-friendly + * error message. + * 8. invokeSuspend (legacy entry, no attachments) stays byte-identical: + * user message carries `images = null`. + */ +class AgentAttachmentsTest { + + private val redPng = byteArrayOf( + 0x89.toByte(), 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, // PNG magic + 1, 2, 3, 4, 5, + ) + + private fun captureFirstUserMessage( + configure: Agent.() -> Unit = { }, + attachments: List? = null, + ): LlmMessage { + val captured = mutableListOf>() + val mock = ModelClient { msgs -> captured += msgs.toList(); LlmResponse.Text("done") } + val a = agent("a") { + model { ollama("t"); client = mock } + skills { skill("s", "") { tools() } } + configure() + } + if (attachments != null) a.invokeWithAttachments("hi", attachments) else a("hi") + return captured.first().last { it.role == "user" } + } + + @Test + fun `invokeWithAttachments dereferences Content Image into ImagePart on first user message`() { + val store = InMemoryBlobStore() + val ref = store.put(redPng, ImageMime.Png.wireMime) + val userMsg = captureFirstUserMessage( + configure = { blobStore(store) }, + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + val images = assertNotNull(userMsg.images, "first user msg carries images") + assertEquals(1, images.size) + val expectedBase64 = java.util.Base64.getEncoder().encodeToString(redPng) + assertEquals(expectedBase64, images[0].base64) + assertEquals("image/png", images[0].wireMime.value) + assertEquals("hi", userMsg.content, "text input untouched by attachments") + } + + @Test + fun `closed ImageMime maps to closed ImagePart WireMime for all four variants`() { + val store = InMemoryBlobStore() + val jpegBytes = byteArrayOf(1, 2, 3) + val gifBytes = byteArrayOf(4, 5, 6) + val webpBytes = byteArrayOf(7, 8, 9) + val pngBytes = byteArrayOf(10, 11, 12) + val refs = listOf( + Content.Image(store.put(pngBytes, ImageMime.Png.wireMime), ImageMime.Png), + Content.Image(store.put(jpegBytes, ImageMime.Jpeg.wireMime), ImageMime.Jpeg), + Content.Image(store.put(gifBytes, ImageMime.Gif.wireMime), ImageMime.Gif), + Content.Image(store.put(webpBytes, ImageMime.Webp.wireMime), ImageMime.Webp), + ) + val userMsg = captureFirstUserMessage( + configure = { blobStore(store) }, + attachments = refs, + ) + val images = assertNotNull(userMsg.images) + assertEquals(4, images.size) + assertEquals("image/png", images[0].wireMime.value) + assertEquals("image/jpeg", images[1].wireMime.value) + assertEquals("image/gif", images[2].wireMime.value) + assertEquals("image/webp", images[3].wireMime.value) + } + + @Test + fun `multiple images compose in order — non-image content variants are skipped in v1`() { + val store = InMemoryBlobStore() + val ref1 = store.put(byteArrayOf(1, 2), ImageMime.Png.wireMime) + val ref2 = store.put(byteArrayOf(3, 4), ImageMime.Png.wireMime) + val userMsg = captureFirstUserMessage( + configure = { blobStore(store) }, + attachments = listOf( + Content.Text("ignored — text variant"), + Content.Image(ref1, ImageMime.Png), + Content.Image(ref2, ImageMime.Png), + // Document / Audio / Video variants need refs we don't make + // here; the production runtime would skip them too in v1. + ), + ) + val images = assertNotNull(userMsg.images) + assertEquals(2, images.size, "two images, text skipped") + assertEquals(java.util.Base64.getEncoder().encodeToString(byteArrayOf(1, 2)), images[0].base64) + assertEquals(java.util.Base64.getEncoder().encodeToString(byteArrayOf(3, 4)), images[1].base64) + } + + @Test + fun `attachments without a blobStore fail fast with a clear message`() { + val ref = InMemoryBlobStore().put(byteArrayOf(1, 2), ImageMime.Png.wireMime) + val ex = assertThrows { + captureFirstUserMessage( + configure = { /* no blobStore */ }, + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + } + val msg = ex.message ?: "" + assertTrue("blobStore" in msg, "error names the missing config: $msg") + } + + @Test + fun `ref pointing at a missing blob fails fast with hash context for forensics`() { + val store = InMemoryBlobStore() + val ref = store.put(byteArrayOf(1, 2, 3), ImageMime.Png.wireMime) + store.delete(ref) + val ex = assertThrows { + captureFirstUserMessage( + configure = { blobStore(store) }, + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + } + val msg = ex.message ?: "" + assertTrue(ref.hash.take(8) in msg, "error names the ref's hash: $msg") + } + + @Test + fun `invokeSuspend without attachments stays byte-identical (back-compat)`() { + val userMsg = captureFirstUserMessage() + assertEquals("hi", userMsg.content) + assertEquals(null, userMsg.images, "no images = null field; wire shape unchanged") + } + + @Test + fun `empty attachments list is treated as no attachments`() { + val userMsg = captureFirstUserMessage( + configure = { blobStore(InMemoryBlobStore()) }, + attachments = emptyList(), + ) + assertEquals(null, userMsg.images, "empty list short-circuits the deref path") + } + + @Test + fun `attachments list with only non-image variants results in null images (no providers see empty array)`() { + val store = InMemoryBlobStore() + val userMsg = captureFirstUserMessage( + configure = { blobStore(store) }, + attachments = listOf(Content.Text("only text")), + ) + assertEquals(null, userMsg.images, "skipping all variants → null, not empty list") + } +} diff --git a/src/test/kotlin/agents_engine/core/AgentVisionLiveTest.kt b/src/test/kotlin/agents_engine/core/AgentVisionLiveTest.kt new file mode 100644 index 0000000..057ed27 --- /dev/null +++ b/src/test/kotlin/agents_engine/core/AgentVisionLiveTest.kt @@ -0,0 +1,231 @@ +package agents_engine.core + +import agents_engine.content.Content +import agents_engine.content.ImageMime +import agents_engine.content.InMemoryBlobStore +import agents_engine.model.VisionFixtures +import kotlinx.coroutines.runBlocking +import org.junit.jupiter.api.Assumptions.assumeTrue +import org.junit.jupiter.api.Tag +import org.junit.jupiter.api.Test +import java.io.File +import java.net.URI +import java.net.http.HttpClient +import java.net.http.HttpRequest +import java.net.http.HttpResponse +import java.time.Duration +import kotlin.test.assertTrue + +/** + * #2470 slice b — live tests that exercise `agent.invokeWithAttachments` + * end-to-end across all four built-in providers. Companion to + * `VisionLiveTest` (slice a) which hits the raw `ModelClient`; this + * test pushes the same fixtures through the agent surface so the + * BlobStore-deref path is exercised on live providers. + * + * Cost discipline matches slice a: 256×256 PNGs, `temperature = 0`, + * `maxTokens = 80`, single-turn. Each provider gated by + * `assumeTrue` so the suite skips cleanly without keys / reachable + * Ollama. + * + * Model overrides: + * AGENTSKT_TEST_OLLAMA_VISION_MODEL (default qwen3-vl:8b) + * AGENTSKT_TEST_CLAUDE_VISION_MODEL (default claude-haiku-4-5) + * AGENTSKT_TEST_OPENAI_VISION_MODEL (default gpt-4o-mini) + */ +class AgentVisionLiveTest { + + private val ollamaModel = System.getenv("AGENTSKT_TEST_OLLAMA_VISION_MODEL") ?: "qwen3-vl:8b" + private val claudeModel = System.getenv("AGENTSKT_TEST_CLAUDE_VISION_MODEL") ?: "claude-haiku-4-5" + private val openaiModel = System.getenv("AGENTSKT_TEST_OPENAI_VISION_MODEL") ?: "gpt-4o-mini" + + // ───────────────────────── Ollama ───────────────────────── + + @Tag("live-llm") + @Test + fun `Ollama agent invokeWithAttachments counts the three squares`() { + assumeTrue(isOllamaReachable(), "skipping: no Ollama at localhost:11434") + val store = InMemoryBlobStore() + val ref = store.put(VisionFixtures.threeSquaresPng(), ImageMime.Png.wireMime) + val a = agent("ollama-vision") { + model { ollama(ollamaModel); temperature = 0.0 } + blobStore(store) + skills { skill("describe", "") { tools() } } + } + val reply = runBlocking { + a.invokeSuspendWithAttachments( + input = "How many colored squares are in this image? Answer with just the digit.", + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + } + println("[Ollama agent vision] squares → $reply") + assertSquaresCountedAsThree(reply, "Ollama($ollamaModel) via agent") + } + + @Tag("live-llm") + @Test + fun `Ollama agent invokeWithAttachments identifies the house drawing`() { + assumeTrue(isOllamaReachable(), "skipping: no Ollama at localhost:11434") + val store = InMemoryBlobStore() + val ref = store.put(VisionFixtures.housePng(), ImageMime.Png.wireMime) + val a = agent("ollama-vision") { + model { ollama(ollamaModel); temperature = 0.0 } + blobStore(store) + skills { skill("describe", "") { tools() } } + } + val reply = runBlocking { + a.invokeSuspendWithAttachments( + input = "What is depicted in this image? Answer in one short phrase.", + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + } + println("[Ollama agent vision] house → $reply") + assertSeesHouse(reply, "Ollama($ollamaModel) via agent") + } + + // ───────────────────────── Anthropic ───────────────────────── + + @Tag("live-cloud-api") + @Test + fun `Claude agent invokeWithAttachments counts the three squares`() { + val apiKey = loadKey("ANTHROPIC_API_KEY", ".secrets/anthropic-key") + assumeTrue(apiKey != null, "skipping: no Anthropic key") + val store = InMemoryBlobStore() + val ref = store.put(VisionFixtures.threeSquaresPng(), ImageMime.Png.wireMime) + val a = agent("claude-vision") { + model { + claude(claudeModel) + this.apiKey = apiKey + temperature = 0.0 + maxTokens = 80 + } + blobStore(store) + skills { skill("describe", "") { tools() } } + } + val reply = runBlocking { + a.invokeSuspendWithAttachments( + input = "How many colored squares are in this image? Answer with just the digit.", + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + } + println("[Claude agent vision] squares → $reply") + assertSquaresCountedAsThree(reply, "Claude($claudeModel) via agent") + } + + @Tag("live-cloud-api") + @Test + fun `Claude agent invokeWithAttachments identifies the house drawing`() { + val apiKey = loadKey("ANTHROPIC_API_KEY", ".secrets/anthropic-key") + assumeTrue(apiKey != null, "skipping: no Anthropic key") + val store = InMemoryBlobStore() + val ref = store.put(VisionFixtures.housePng(), ImageMime.Png.wireMime) + val a = agent("claude-vision") { + model { + claude(claudeModel) + this.apiKey = apiKey + temperature = 0.0 + maxTokens = 80 + } + blobStore(store) + skills { skill("describe", "") { tools() } } + } + val reply = runBlocking { + a.invokeSuspendWithAttachments( + input = "What is depicted in this image? Answer in one short phrase.", + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + } + println("[Claude agent vision] house → $reply") + assertSeesHouse(reply, "Claude($claudeModel) via agent") + } + + // ───────────────────────── OpenAI ───────────────────────── + + @Tag("live-cloud-api") + @Test + fun `OpenAI agent invokeWithAttachments counts the three squares`() { + val apiKey = loadKey("OPENAI_API_KEY", ".secrets/openai-key") + assumeTrue(apiKey != null, "skipping: no OpenAI key") + val store = InMemoryBlobStore() + val ref = store.put(VisionFixtures.threeSquaresPng(), ImageMime.Png.wireMime) + val a = agent("openai-vision") { + model { + openai(openaiModel) + this.apiKey = apiKey + temperature = 0.0 + maxTokens = 80 + } + blobStore(store) + skills { skill("describe", "") { tools() } } + } + val reply = runBlocking { + a.invokeSuspendWithAttachments( + input = "How many colored squares are in this image? Answer with just the digit.", + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + } + println("[OpenAI agent vision] squares → $reply") + assertSquaresCountedAsThree(reply, "OpenAI($openaiModel) via agent") + } + + @Tag("live-cloud-api") + @Test + fun `OpenAI agent invokeWithAttachments identifies the house drawing`() { + val apiKey = loadKey("OPENAI_API_KEY", ".secrets/openai-key") + assumeTrue(apiKey != null, "skipping: no OpenAI key") + val store = InMemoryBlobStore() + val ref = store.put(VisionFixtures.housePng(), ImageMime.Png.wireMime) + val a = agent("openai-vision") { + model { + openai(openaiModel) + this.apiKey = apiKey + temperature = 0.0 + maxTokens = 80 + } + blobStore(store) + skills { skill("describe", "") { tools() } } + } + val reply = runBlocking { + a.invokeSuspendWithAttachments( + input = "What is depicted in this image? Answer in one short phrase.", + attachments = listOf(Content.Image(ref, ImageMime.Png)), + ) + } + println("[OpenAI agent vision] house → $reply") + assertSeesHouse(reply, "OpenAI($openaiModel) via agent") + } + + // ───────────────────────── Helpers ───────────────────────── + + private fun assertSquaresCountedAsThree(reply: String, providerLabel: String) { + val lowered = reply.lowercase() + val sees3 = "3" in reply || "three" in lowered + assertTrue(sees3, "$providerLabel did not count three squares; got: $reply") + } + + private fun assertSeesHouse(reply: String, providerLabel: String) { + val lowered = reply.lowercase() + val sees = listOf("house", "home", "cottage", "building", "cabin", "barn").any { it in lowered } + assertTrue(sees, "$providerLabel did not recognise the house drawing; got: $reply") + } + + private fun loadKey(envVar: String, secretFile: String): String? { + val envKey = System.getenv(envVar) + if (!envKey.isNullOrBlank()) return envKey + val file = File(secretFile) + return if (file.exists()) file.readText().trim().ifBlank { null } else null + } + + private fun isOllamaReachable(): Boolean = try { + val client = HttpClient.newBuilder().connectTimeout(Duration.ofMillis(500)).build() + val request = HttpRequest.newBuilder() + .uri(URI.create("http://localhost:11434/api/tags")) + .timeout(Duration.ofMillis(1500)) + .GET() + .build() + val response = client.send(request, HttpResponse.BodyHandlers.discarding()) + response.statusCode() in 200..299 + } catch (_: Throwable) { + false + } +}