Deep-CodeAI · Skobeltsyn · May 30, 2026 · May 30, 2026 · May 30, 2026 · May 30, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,12 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a
 
 ## [Unreleased]
 
+### Added — Eval harness (#2491 epic, feature-complete)
+
+- **`DeterministicModelClient` (#2492)** — `agents_engine.testing.DeterministicModelClient(scripted: List<LlmResponse>)` (or vararg ctor) hands back pre-scripted responses one per `chat` call. No network, byte-deterministic. `requests` records every message list the agent built up; `remaining()` reports unconsumed responses. Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)`. Streaming uses the default `ModelClient.chatStream` wrap. Out of scope for v1: record-from-live HTTP capture (mentioned in the ticket — needs an HTTP-fixture story we'll write when there's demand) and per-token chunk replay.
+- **`eval { }` DSL (#2493)** — `agents_engine.testing.eval<IN, OUT>("name") { input(...); expect { ... } }` builds a typed eval case. Three expectation styles: `expect("label") { predicate }` (typed predicate over `OUT`), `expectSnapshot(snapshot = "...")` (pin canonical `toLlmInput(output)` JSON; diff on regression), `expectFieldEquals(field, value)` (single-field substring on rendered JSON). Multiple expects compose — all must pass. `EvalResult.failureMessage` is null on pass, structured on fail with per-expectation diagnostics. `evalSuite("name") { + case; + case }.runAll(agent)` bundles cases; type-homogeneous over the agent type at call time (mixed-shape suite is a compile error). Composes with `DeterministicModelClient` for fully reproducible end-to-end agentic-loop eval against typed `OUT`. See [docs/eval.md](docs/eval.md).
+- **LLM-as-judge scorer (#2494)** — `agents_engine.testing.JudgeRubric(criteria, scoreRange, judgeModel)` + `@Generable JudgeVerdict(score, rationale)`. Opt-in via `eval { ... judge("tone", rubric) }`. Verdicts surface on `EvalResult.judgeVerdicts: Map<String, JudgeOutcome>` keyed by label; sealed `JudgeOutcome { Scored(verdict) | Errored(detail) }` so parse failures or out-of-range scores surface without aborting. `EvalResult.passed` is structurally restricted to deterministic `outcomes` + `invocationError` — judges NEVER gate pass/fail. `EvalResult.judgeSummary` renders `[advisory] <label>: <score> — <rationale>` lines for test reports. The judge model is independent of the production agent's model: use `DeterministicModelClient` for unit tests (so the judge itself is reproducible) or a pinned cloud model for live eval. Judges don't run when the agent itself fails (no output to score). See [docs/eval.md](docs/eval.md).
+
 ## [0.6.4] — 2026-05-30
 
 **"Trust patch."** Outside auditor reviewed 0.6.3 at 7.5/10 with the verdict *"useful hardening release, but not a repositioning release."* 0.6.4 is the deliberate response: boring on features, focused on closing every real boundary gap the audit found. The tagline:

diff --git a/README.md b/README.md
@@ -154,6 +154,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
 - **Tool error recovery** — per-tool `onError`, per-skill default, agent default; built-in `escalate` and `throwException` agents. See [docs/error-recovery.md](docs/error-recovery.md).
 - **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay).
 - **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up.
+- **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval<IN, OUT>("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md).
 - **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md).
 - **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md).
 - **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md).

diff --git a/docs/eval.md b/docs/eval.md
@@ -0,0 +1,224 @@
+[← Back to README](../README.md)
+
+# Eval harness
+
+Three pieces ship today, layered:
+
+- **`DeterministicModelClient`** (#2492) — a `ModelClient` that scripts responses, no network. Pairs with any agent so you can run the full agentic loop deterministically.
+- **`eval { }` DSL** (#2493) — declarative cases with typed assertions over the agent's `OUT`. Supports per-field checks, full structural snapshots, and grouped suites.
+- **LLM-as-judge** (#2494) — opt-in advisory scorer for criteria that resist deterministic assertion (tone, relevance, completeness). Typed rubric, structured `JudgeVerdict`, explicitly separate from the deterministic pass/fail contract.
+
+All three live in package `agents_engine.testing` and ship in the main module — usable from any consumer's test source set without an extra artifact.
+
+---
+
+## `DeterministicModelClient`
+
+Hand back a pre-scripted sequence of `LlmResponse`s, one per `chat` call. The agent's loop runs end-to-end against the script, with the same Started → ArgsDelta → Finished → End chunk sequence on the streaming side (the default `ModelClient.chatStream` wraps `chat`).
+
+```kotlin
+import agents_engine.testing.DeterministicModelClient
+import agents_engine.model.LlmResponse
+import agents_engine.model.ToolCall
+
+val mock = DeterministicModelClient(
+    LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))),
+    LlmResponse.Text("found 42"),
+)
+val agent = agent<String, String>("test") {
+    model { ollama("t"); client = mock }
+    tools { tool("lookup", "lookup") { args -> "value-${args["id"]}" } }
+    skills { skill<String, String>("s", "") { tools("lookup") } }
+}
+
+agent("what is 42?")    // → "found 42"
+mock.remaining()        // → 0 (both scripted responses consumed)
+mock.requests           // List<List<LlmMessage>> — every `chat` call's input
+```
+
+### What you get
+
+- **Byte-determinism.** Two runs against the same script + same agent + same input produce identical output.
+- **Request history.** `mock.requests` records every message list the agent built up across turns. Useful for asserting on conversation shape.
+- **Clear exhaustion errors.** If the agent calls `chat` more times than there are scripted responses, the client throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)` naming the offending turn.
+
+### Out of scope (v1)
+
+- **Record-from-live.** The #2492 ticket mentions "record-once/replay-many." That needs an HTTP-fixture story we'll write when there's demand. For now: hand-script the responses or compose with a recording-decorator pattern in your own test code.
+- **Per-token streaming chunks.** `chatStream` uses the default chunk-from-chat wrap — good enough for asserting on the streaming `AgentEvent` shape, not useful for testing provider-specific mid-stream edge cases.
+
+---
+
+## LLM-as-judge (advisory)
+
+For criteria that resist deterministic assertion — tone, relevance, completeness — opt into a `judge`. The judge runs after the agent succeeds, scores the (input, output) pair with a typed `@Generable` verdict, and surfaces on `EvalResult.judgeVerdicts`. **Judges never gate the case's pass/fail** — only deterministic `expect { }` blocks do.
+
+```kotlin
+import agents_engine.testing.JudgeRubric
+
+val toneRubric = JudgeRubric(
+    criteria = "Tone: warm, professional, no jargon.",
+    judgeModel = DeterministicModelClient(
+        LlmResponse.Text("""{"score":8,"rationale":"clear and warm"}"""),
+    ),
+)
+
+val case = eval<String, Review>("repo-review") {
+    input(spec)
+    expect("approved") { it.approved }       // ← gates pass/fail
+    judge("tone", toneRubric)                // ← advisory only
+}
+
+val result = case.run(reviewAgent)
+result.passed                                // depends ONLY on `expect` blocks
+result.judgeVerdicts["tone"]                 // JudgeOutcome.Scored(JudgeVerdict)
+println(result.judgeSummary)
+// [advisory] tone: 8 — clear and warm
+```
+
+### Why opt-in and advisory
+
+LLM judges are themselves nondeterministic and prompt-sensitive. Treating them as gating regression checks would import the same flakiness the deterministic harness is designed to eliminate. The split is intentional:
+
+- **Deterministic `expect`** ⇒ pass/fail contract. Reproducible across runs.
+- **`judge`** ⇒ qualitative score for the report. Useful as a quality trend over time; never as a fail signal.
+
+### Pinning the judge model
+
+The `judgeModel` in `JudgeRubric` is a regular `ModelClient`:
+
+- **Unit tests:** use `DeterministicModelClient` with a scripted verdict JSON. The judge call itself becomes reproducible.
+- **Live eval:** use a pinned cloud model — explicit version + low temperature. Even then, drift between runs is expected; that's why the judge is advisory.
+
+### Failure modes
+
+`EvalResult.judgeVerdicts` carries `JudgeOutcome` for each registered judge — a sealed type:
+
+| Variant | When |
+|---|---|
+| `JudgeOutcome.Scored(verdict: JudgeVerdict)` | Judge model returned valid JSON; score in range. |
+| `JudgeOutcome.Errored(errorDetail: String)` | Judge model returned non-JSON, or returned a score outside `rubric.scoreRange`. |
+
+Both surface in the report. Neither affects `EvalResult.passed`.
+
+### Judges and agent failures
+
+If the agent invocation itself throws (`EvalResult.invocationError` is set), no judges run — there's no output to score. The `judgeVerdicts` map is empty in that case.
+
+---
+
+## `eval { }` DSL
+
+Declarative cases with typed predicates over the agent's `OUT`.
+
+```kotlin
+import agents_engine.testing.eval
+
+val case = eval<String, Review>("repo-review") {
+    input(SpecText("review this repository"))
+    expect("nonempty risks") { it.risks.isNotEmpty() }
+    expect("at least 3 risks") { it.risks.size >= 3 }
+}
+
+val result = case.run(reviewAgent)
+assertTrue(result.passed) { result.failureMessage }
+```
+
+### Three expectation styles
+
+```kotlin
+// 1. Typed predicate — runs against the parsed OUT, not a string.
+expect("approved") { it.approved == true }
+
+// 2. Snapshot — pins the canonical toLlmInput(output) JSON.
+expectSnapshot(snapshot = """{"text":"Hello","approved":true}""")
+
+// 3. Single-field substring on the rendered JSON — quick for one field.
+expectFieldEquals("approved", true)
+```
+
+All three compose: multiple `expect` blocks must all pass for the case to pass. The failure message names every failing label and renders the typed output for diagnosis.
+
+### Suite mode
+
+Group cases:
+
+```kotlin
+import agents_engine.testing.evalSuite
+
+class GreetingEvalTest {
+    @Test
+    fun `greeting suite`() {
+        val suite = evalSuite("greeting") {
+            + eval<String, String>("nonempty") {
+                input("hi")
+                expect("nonempty") { it.isNotEmpty() }
+            }
+            + eval<String, String>("polite") {
+                input("hi")
+                expect("contains hello") { "hello" in it.lowercase() }
+            }
+        }
+        val result = suite.runAll(greetingAgent)
+        assertTrue(result.passed) { result.failureSummary }
+    }
+}
+```
+
+Suites are **type-homogeneous over the agent type at call time** — `EvalSuite.runAll<IN, OUT>(agent: Agent<IN, OUT>)` binds the case types at the call site. A mixed-shape suite is a compile error.
+
+### Failure shape
+
+`EvalResult.failureMessage` is `null` on pass, structured on fail:
+
+```
+eval case "multi-fail" failed:
+  - starts with goodbye: [starts with goodbye] failed for output: "hello world"
+```
+
+When the agent throws during invocation, the result carries `invocationError` and the message names the exception. Use as `assertTrue(result.passed) { result.failureMessage }` in JUnit / kotlin-test.
+
+---
+
+## Composition: deterministic eval end-to-end
+
+```kotlin
+class RepoReviewEvalTest {
+    @Test
+    fun `repo review hits the audit criteria`() {
+        val mock = DeterministicModelClient(
+            LlmResponse.Text("""{"text":"All good","approved":true,"risks":[]}"""),
+        )
+        val agent = agent<String, Review>("review") {
+            model { ollama("test"); client = mock }
+            skills { skill<String, Review>("review", "") { tools() } }
+        }
+        val case = eval<String, Review>("approved-no-risks") {
+            input("review the repo")
+            expect("approved") { it.approved }
+            expect("no risks") { it.risks.isEmpty() }
+        }
+        val result = case.run(agent)
+        assertTrue(result.passed, result.failureMessage)
+    }
+}
+```
+
+The combination of `DeterministicModelClient` + `eval { }` gives you:
+
+- No network, no live LLM, no nondeterminism.
+- Typed assertions against the agent's `OUT` (not regex on the wire).
+- Pinning the model's response in source — when the prompt or schema changes, you update the script *and* the snapshot in the same diff.
+
+For real-model regression coverage there's the existing `live-llm` / `live-cloud-api` tagged tests; those are nondeterministic by design and out of scope for the eval harness.
+
+---
+
+## Related docs
+
+- [`docs/testing.md`](testing.md) — existing testing conventions (task names, integration test setup, mutation testing).
+- [`docs/observability.md`](observability.md) — the bridges that consume `AgentEvent` and `PipelineEvent` — useful when you're asserting on the streaming flow during eval.
+
+Sources: `agents_engine/testing/DeterministicModelClient.kt`, `agents_engine/testing/EvalDsl.kt`.
+
+Tests: `DeterministicModelClientTest.kt`, `EvalDslTest.kt`.