diff --git a/CHANGELOG.md b/CHANGELOG.md index 4033de5..86f9065 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,12 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a ## [Unreleased] +### Added — Eval harness (#2491 epic, feature-complete) + +- **`DeterministicModelClient` (#2492)** — `agents_engine.testing.DeterministicModelClient(scripted: List)` (or vararg ctor) hands back pre-scripted responses one per `chat` call. No network, byte-deterministic. `requests` records every message list the agent built up; `remaining()` reports unconsumed responses. Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)`. Streaming uses the default `ModelClient.chatStream` wrap. Out of scope for v1: record-from-live HTTP capture (mentioned in the ticket — needs an HTTP-fixture story we'll write when there's demand) and per-token chunk replay. +- **`eval { }` DSL (#2493)** — `agents_engine.testing.eval("name") { input(...); expect { ... } }` builds a typed eval case. Three expectation styles: `expect("label") { predicate }` (typed predicate over `OUT`), `expectSnapshot(snapshot = "...")` (pin canonical `toLlmInput(output)` JSON; diff on regression), `expectFieldEquals(field, value)` (single-field substring on rendered JSON). Multiple expects compose — all must pass. `EvalResult.failureMessage` is null on pass, structured on fail with per-expectation diagnostics. `evalSuite("name") { + case; + case }.runAll(agent)` bundles cases; type-homogeneous over the agent type at call time (mixed-shape suite is a compile error). Composes with `DeterministicModelClient` for fully reproducible end-to-end agentic-loop eval against typed `OUT`. See [docs/eval.md](docs/eval.md). +- **LLM-as-judge scorer (#2494)** — `agents_engine.testing.JudgeRubric(criteria, scoreRange, judgeModel)` + `@Generable JudgeVerdict(score, rationale)`. Opt-in via `eval { ... judge("tone", rubric) }`. Verdicts surface on `EvalResult.judgeVerdicts: Map` keyed by label; sealed `JudgeOutcome { Scored(verdict) | Errored(detail) }` so parse failures or out-of-range scores surface without aborting. `EvalResult.passed` is structurally restricted to deterministic `outcomes` + `invocationError` — judges NEVER gate pass/fail. `EvalResult.judgeSummary` renders `[advisory]