Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a

## [Unreleased]

### Added — Eval harness (#2491 epic, feature-complete)

- **`DeterministicModelClient` (#2492)** — `agents_engine.testing.DeterministicModelClient(scripted: List<LlmResponse>)` (or vararg ctor) hands back pre-scripted responses one per `chat` call. No network, byte-deterministic. `requests` records every message list the agent built up; `remaining()` reports unconsumed responses. Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)`. Streaming uses the default `ModelClient.chatStream` wrap. Out of scope for v1: record-from-live HTTP capture (mentioned in the ticket — needs an HTTP-fixture story we'll write when there's demand) and per-token chunk replay.
- **`eval { }` DSL (#2493)** — `agents_engine.testing.eval<IN, OUT>("name") { input(...); expect { ... } }` builds a typed eval case. Three expectation styles: `expect("label") { predicate }` (typed predicate over `OUT`), `expectSnapshot(snapshot = "...")` (pin canonical `toLlmInput(output)` JSON; diff on regression), `expectFieldEquals(field, value)` (single-field substring on rendered JSON). Multiple expects compose — all must pass. `EvalResult.failureMessage` is null on pass, structured on fail with per-expectation diagnostics. `evalSuite("name") { + case; + case }.runAll(agent)` bundles cases; type-homogeneous over the agent type at call time (mixed-shape suite is a compile error). Composes with `DeterministicModelClient` for fully reproducible end-to-end agentic-loop eval against typed `OUT`. See [docs/eval.md](docs/eval.md).
- **LLM-as-judge scorer (#2494)** — `agents_engine.testing.JudgeRubric(criteria, scoreRange, judgeModel)` + `@Generable JudgeVerdict(score, rationale)`. Opt-in via `eval { ... judge("tone", rubric) }`. Verdicts surface on `EvalResult.judgeVerdicts: Map<String, JudgeOutcome>` keyed by label; sealed `JudgeOutcome { Scored(verdict) | Errored(detail) }` so parse failures or out-of-range scores surface without aborting. `EvalResult.passed` is structurally restricted to deterministic `outcomes` + `invocationError` — judges NEVER gate pass/fail. `EvalResult.judgeSummary` renders `[advisory] <label>: <score> — <rationale>` lines for test reports. The judge model is independent of the production agent's model: use `DeterministicModelClient` for unit tests (so the judge itself is reproducible) or a pinned cloud model for live eval. Judges don't run when the agent itself fails (no output to score). See [docs/eval.md](docs/eval.md).

## [0.6.4] — 2026-05-30

**"Trust patch."** Outside auditor reviewed 0.6.3 at 7.5/10 with the verdict *"useful hardening release, but not a repositioning release."* 0.6.4 is the deliberate response: boring on features, focused on closing every real boundary gap the audit found. The tagline:
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes
- **Tool error recovery** — per-tool `onError`, per-skill default, agent default; built-in `escalate` and `throwException` agents. See [docs/error-recovery.md](docs/error-recovery.md).
- **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay).
- **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up.
- **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval<IN, OUT>("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. Optional `judge("tone", rubric)` (#2494) runs an advisory LLM-as-judge scorer with a typed `@Generable` `JudgeVerdict` — explicitly separate from the deterministic pass/fail contract (judges never gate). See [docs/eval.md](docs/eval.md).
- **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md).
- **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md).
- **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md).
Expand Down
224 changes: 224 additions & 0 deletions docs/eval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
[← Back to README](../README.md)

# Eval harness

Three pieces ship today, layered:

- **`DeterministicModelClient`** (#2492) — a `ModelClient` that scripts responses, no network. Pairs with any agent so you can run the full agentic loop deterministically.
- **`eval { }` DSL** (#2493) — declarative cases with typed assertions over the agent's `OUT`. Supports per-field checks, full structural snapshots, and grouped suites.
- **LLM-as-judge** (#2494) — opt-in advisory scorer for criteria that resist deterministic assertion (tone, relevance, completeness). Typed rubric, structured `JudgeVerdict`, explicitly separate from the deterministic pass/fail contract.

All three live in package `agents_engine.testing` and ship in the main module — usable from any consumer's test source set without an extra artifact.

---

## `DeterministicModelClient`

Hand back a pre-scripted sequence of `LlmResponse`s, one per `chat` call. The agent's loop runs end-to-end against the script, with the same Started → ArgsDelta → Finished → End chunk sequence on the streaming side (the default `ModelClient.chatStream` wraps `chat`).

```kotlin
import agents_engine.testing.DeterministicModelClient
import agents_engine.model.LlmResponse
import agents_engine.model.ToolCall

val mock = DeterministicModelClient(
LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))),
LlmResponse.Text("found 42"),
)
val agent = agent<String, String>("test") {
model { ollama("t"); client = mock }
tools { tool("lookup", "lookup") { args -> "value-${args["id"]}" } }
skills { skill<String, String>("s", "") { tools("lookup") } }
}

agent("what is 42?") // → "found 42"
mock.remaining() // → 0 (both scripted responses consumed)
mock.requests // List<List<LlmMessage>> — every `chat` call's input
```

### What you get

- **Byte-determinism.** Two runs against the same script + same agent + same input produce identical output.
- **Request history.** `mock.requests` records every message list the agent built up across turns. Useful for asserting on conversation shape.
- **Clear exhaustion errors.** If the agent calls `chat` more times than there are scripted responses, the client throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)` naming the offending turn.

### Out of scope (v1)

- **Record-from-live.** The #2492 ticket mentions "record-once/replay-many." That needs an HTTP-fixture story we'll write when there's demand. For now: hand-script the responses or compose with a recording-decorator pattern in your own test code.
- **Per-token streaming chunks.** `chatStream` uses the default chunk-from-chat wrap — good enough for asserting on the streaming `AgentEvent` shape, not useful for testing provider-specific mid-stream edge cases.

---

## LLM-as-judge (advisory)

For criteria that resist deterministic assertion — tone, relevance, completeness — opt into a `judge`. The judge runs after the agent succeeds, scores the (input, output) pair with a typed `@Generable` verdict, and surfaces on `EvalResult.judgeVerdicts`. **Judges never gate the case's pass/fail** — only deterministic `expect { }` blocks do.

```kotlin
import agents_engine.testing.JudgeRubric

val toneRubric = JudgeRubric(
criteria = "Tone: warm, professional, no jargon.",
judgeModel = DeterministicModelClient(
LlmResponse.Text("""{"score":8,"rationale":"clear and warm"}"""),
),
)

val case = eval<String, Review>("repo-review") {
input(spec)
expect("approved") { it.approved } // ← gates pass/fail
judge("tone", toneRubric) // ← advisory only
}

val result = case.run(reviewAgent)
result.passed // depends ONLY on `expect` blocks
result.judgeVerdicts["tone"] // JudgeOutcome.Scored(JudgeVerdict)
println(result.judgeSummary)
// [advisory] tone: 8 — clear and warm
```

### Why opt-in and advisory

LLM judges are themselves nondeterministic and prompt-sensitive. Treating them as gating regression checks would import the same flakiness the deterministic harness is designed to eliminate. The split is intentional:

- **Deterministic `expect`** ⇒ pass/fail contract. Reproducible across runs.
- **`judge`** ⇒ qualitative score for the report. Useful as a quality trend over time; never as a fail signal.

### Pinning the judge model

The `judgeModel` in `JudgeRubric` is a regular `ModelClient`:

- **Unit tests:** use `DeterministicModelClient` with a scripted verdict JSON. The judge call itself becomes reproducible.
- **Live eval:** use a pinned cloud model — explicit version + low temperature. Even then, drift between runs is expected; that's why the judge is advisory.

### Failure modes

`EvalResult.judgeVerdicts` carries `JudgeOutcome` for each registered judge — a sealed type:

| Variant | When |
|---|---|
| `JudgeOutcome.Scored(verdict: JudgeVerdict)` | Judge model returned valid JSON; score in range. |
| `JudgeOutcome.Errored(errorDetail: String)` | Judge model returned non-JSON, or returned a score outside `rubric.scoreRange`. |

Both surface in the report. Neither affects `EvalResult.passed`.

### Judges and agent failures

If the agent invocation itself throws (`EvalResult.invocationError` is set), no judges run — there's no output to score. The `judgeVerdicts` map is empty in that case.

---

## `eval { }` DSL

Declarative cases with typed predicates over the agent's `OUT`.

```kotlin
import agents_engine.testing.eval

val case = eval<String, Review>("repo-review") {
input(SpecText("review this repository"))
expect("nonempty risks") { it.risks.isNotEmpty() }
expect("at least 3 risks") { it.risks.size >= 3 }
}

val result = case.run(reviewAgent)
assertTrue(result.passed) { result.failureMessage }
```

### Three expectation styles

```kotlin
// 1. Typed predicate — runs against the parsed OUT, not a string.
expect("approved") { it.approved == true }

// 2. Snapshot — pins the canonical toLlmInput(output) JSON.
expectSnapshot(snapshot = """{"text":"Hello","approved":true}""")

// 3. Single-field substring on the rendered JSON — quick for one field.
expectFieldEquals("approved", true)
```

All three compose: multiple `expect` blocks must all pass for the case to pass. The failure message names every failing label and renders the typed output for diagnosis.

### Suite mode

Group cases:

```kotlin
import agents_engine.testing.evalSuite

class GreetingEvalTest {
@Test
fun `greeting suite`() {
val suite = evalSuite("greeting") {
+ eval<String, String>("nonempty") {
input("hi")
expect("nonempty") { it.isNotEmpty() }
}
+ eval<String, String>("polite") {
input("hi")
expect("contains hello") { "hello" in it.lowercase() }
}
}
val result = suite.runAll(greetingAgent)
assertTrue(result.passed) { result.failureSummary }
}
}
```

Suites are **type-homogeneous over the agent type at call time** — `EvalSuite.runAll<IN, OUT>(agent: Agent<IN, OUT>)` binds the case types at the call site. A mixed-shape suite is a compile error.

### Failure shape

`EvalResult.failureMessage` is `null` on pass, structured on fail:

```
eval case "multi-fail" failed:
- starts with goodbye: [starts with goodbye] failed for output: "hello world"
```

When the agent throws during invocation, the result carries `invocationError` and the message names the exception. Use as `assertTrue(result.passed) { result.failureMessage }` in JUnit / kotlin-test.

---

## Composition: deterministic eval end-to-end

```kotlin
class RepoReviewEvalTest {
@Test
fun `repo review hits the audit criteria`() {
val mock = DeterministicModelClient(
LlmResponse.Text("""{"text":"All good","approved":true,"risks":[]}"""),
)
val agent = agent<String, Review>("review") {
model { ollama("test"); client = mock }
skills { skill<String, Review>("review", "") { tools() } }
}
val case = eval<String, Review>("approved-no-risks") {
input("review the repo")
expect("approved") { it.approved }
expect("no risks") { it.risks.isEmpty() }
}
val result = case.run(agent)
assertTrue(result.passed, result.failureMessage)
}
}
```

The combination of `DeterministicModelClient` + `eval { }` gives you:

- No network, no live LLM, no nondeterminism.
- Typed assertions against the agent's `OUT` (not regex on the wire).
- Pinning the model's response in source — when the prompt or schema changes, you update the script *and* the snapshot in the same diff.

For real-model regression coverage there's the existing `live-llm` / `live-cloud-api` tagged tests; those are nondeterministic by design and out of scope for the eval harness.

---

## Related docs

- [`docs/testing.md`](testing.md) — existing testing conventions (task names, integration test setup, mutation testing).
- [`docs/observability.md`](observability.md) — the bridges that consume `AgentEvent` and `PipelineEvent` — useful when you're asserting on the streaming flow during eval.

Sources: `agents_engine/testing/DeterministicModelClient.kt`, `agents_engine/testing/EvalDsl.kt`.

Tests: `DeterministicModelClientTest.kt`, `EvalDslTest.kt`.
Loading
Loading