Feat/2465 multimodal foundation#67
Merged
Merged
Conversation
First three subtickets of the 0.8 multimodal epic (#2465), shipped
together as a coherent foundation. No provider rendering and no KSP
routing in this commit — those are the sibling tickets (#2470,
#2468) and depend on what this commit establishes.
```kotlin
val store = InMemoryBlobStore() // or FileBlobStore(snapshotsDir / "blobs")
val pngRef = store.put(pngBytes, ImageMime.Png.wireMime)
tool("screenshot", "Take a screenshot") { args ->
val bytes = takeScreenshot(args["url"] as String)
val ref = store.put(bytes, ImageMime.Png.wireMime)
ToolResult(
Content.Text("Captured page."),
Content.Image(ref, ImageMime.Png),
)
}
```
`#2466 — Typed Content hierarchy + typed mime`:
- `agents_engine/content/Content.kt`. `sealed interface Content` with
variants `Text`, `Image`, `Audio`, `Video`, `Document`. Stage 1
wires Image + Document through the rest of the stack (the modalities
the 0.8 spec→product loop actually consumes); Audio + Video are
modelled now and exercised end-to-end through provider adapters in
Stage 2.
- Mime types are CLOSED sealed interfaces per modality — `ImageMime`,
`AudioMime`, `VideoMime`, `DocMime`. Each variant exposes a
`wireMime: String` for adapter serialisation but the public API
never accepts `String` mime. Extend by adding a variant.
- Non-text variants carry a `ContentRef`, not `ByteArray`. Avoids the
data-class equals/hashCode gotcha with byte arrays AND keeps
`Content` snapshot-safe (the #2386 / #2754 snapshot machinery
never inlines blobs).
- Extension property `Content.modality: String` is the audit-stable
per-variant name. Used by the JSONL audit exporter to write
per-part rows.
`#2467 — ContentRef + BlobStore + persistence`:
- `agents_engine/content/ContentRef.kt`. `data class ContentRef(hash,
sizeBytes, wireMime)`. Hash is SHA-256 hex — matches the
manifest-hash family used elsewhere (#1912, #2754), so the audit
story has a single hash algorithm.
- `interface BlobStore { put, get, open, exists, delete }`. Idempotent
put: putting the same bytes twice returns the same `ContentRef`;
the store keeps one copy on disk.
- `InMemoryBlobStore` — test / single-JVM. Defensive byte-array
copies on put + get so consumer mutation can't corrupt the store.
- `FileBlobStore(dir)` — one file per blob, filename = SHA-256 hex.
Survives process restart (fresh instance on the same dir sees
prior puts). Atomic via tmp + rename, matching the #2753 pattern
from `FileSnapshotStore`.
- Public `computeContentHash(bytes): String` for byte-level comparison
without a store.
`#2469 — Multimodal ToolResult + audit wiring`:
- `agents_engine/content/ToolResult.kt`. `data class ToolResult(parts:
List<Content>)`. Just another `Any?` the tool executor returns — no
ToolDef signature change; tools that return strings keep working
byte-for-byte. Requires at least one part (empty list fails fast).
- AgenticLoop's tool-message rendering detects `ToolResult` and
renders parts as `<text>\n[modality: <wireMime>] (<hash-prefix>,
<size>B)` placeholders for the LLM context. Provider-specific
multipart rendering is #2470 (deferred); the placeholder is good
enough until vision-capable adapters land.
- `untrustedOutput` (#642) still wraps the rendered text summary
in the JSON envelope — multimodal results compose with the
trust boundary.
- JSONL audit exporter (#1914) gains a new `outputParts` field on
audit rows. For `ToolResult` returns, emits one summary string per
part: `<modality>:<hash-prefix>:<sizeBytes>:<wireMime>`. Text parts
surface as `text:inline:<charCount>:text/plain`. **Blob bytes
never enter the audit row.** `outputType` still names the wrapper
type so column-positioned consumers see a stable shape. Field is
null for non-multimodal returns — legacy audit rows unchanged.
Composition with existing surfaces:
- Snapshot/resume (#2386 / #2754) — refs travel with snapshots; blobs
stay in the `BlobStore`. A resumed snapshot dereferences refs
against the same store. No inlined-blob explosion.
- Manifest-hash restore guard (#2754) — applies unchanged.
- `untrustedOutput` (#642) — applies to the text-summary rendering.
Tests:
- ContentAndRefTest.kt (8 cases): hash determinism, InMemoryBlobStore
round-trip + dedupe, defensive copies, exists/delete,
FileBlobStore process-restart safety + dedupe (one file on disk),
modality stability, mime wire forms.
- ToolResultIntegrationTest.kt (3 cases): tool returning ToolResult
end-to-end with text + image; empty ToolResult fails fast;
`PipelineEvent.ToolCalled.result` carries the typed `ToolResult`
for bridge consumers.
- JsonlAuditExporterTest.kt: schema-pinning EXPECTED_FIELDS updated
to include `outputParts`; new test "multimodal ToolResult writes
outputParts" pins the per-part summary format AND asserts no
argument values, no image bytes, ever enter the audit row.
Deferred (carried as siblings, not this commit's scope):
- #2468 Compile-time modality routing via KSP
- #2470 Provider adapters (Claude/OpenAI/Gemini/Ollama) for
multipart `Content` → provider payload
- #2471 Manifest-anchored modality capability validation
- #2472 Multimodal memory (ContentRef-backed MemoryBank entries)
- #2473 Multimodal testing fixtures
Full suite: 1792 tests across 7 modules, 0 failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ct, README, CHANGELOG - docs/multimodal.md (new) — user-facing multimodal doc. Three pieces walked through: typed Content variants + closed mime types, ContentRef + BlobStore (InMemory + File) with hash-family rationale and process-restart safety, ToolResult with the v1 placeholder rendering + audit-row discipline. What's coming section names the five sibling tickets (#2468 KSP routing, #2470 provider adapters, #2471 manifest-anchored capability, #2472 multimodal memory, #2473 testing fixtures). Stage 1 vs Stage 2 split explicit. - src/main/resources/internals-agent/content/Multimodal.md (new) — IDE-side LLM adjunct covering all three pieces. Signatures, hash family rationale, idempotent put semantics, audit-row column format, snapshot composition, deferral list. - README.md — adds a "Multimodal foundation" bullet under "Implemented today" right after the eval harness bullet. Names all three sub-tickets and the Stage 1 / Stage 2 split. - CHANGELOG.md `## [Unreleased]` — opens with three paragraph entries under "Multimodal foundation (#2465 epic, Stage 1)" covering #2466 / #2467 / #2469 with their AC and composition story. Calls out the EXPECTED_FIELDS schema-pin update so audit-row consumers see the wire-format change. Eval harness section preserved below. No source changes. Full suite stays at 1792 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multimodal