Feat/2491 eval harness#66
Merged
Merged
Conversation
#2491 epic — first two children landed together (#2492 + #2493).
Reproducible eval without live providers, with typed assertions over
the agent's `OUT`.
```kotlin
val mock = DeterministicModelClient(
LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))),
LlmResponse.Text("found 42"),
)
val agent = agent<String, String>("test") {
model { ollama("t"); client = mock }
tools { tool("lookup", "lookup") { args -> "value-${args["id"]}" } }
skills { skill<String, String>("s", "") { tools("lookup") } }
}
val case = eval<String, String>("answer-contains-42") {
input("what is forty-two?")
expect("nonempty") { it.isNotEmpty() }
expect("mentions 42") { "42" in it }
}
val result = case.run(agent)
assertTrue(result.passed) { result.failureMessage }
```
`#2492 — DeterministicModelClient`:
- `agents_engine/testing/DeterministicModelClient.kt`. A `ModelClient`
that scripts responses in order, one per `chat` call.
- Construction: `DeterministicModelClient(LlmResponse, LlmResponse, ...)`
or `DeterministicModelClient(scripted: List<LlmResponse>)`.
- `requests` exposes the full message-list-per-call history so tests
can assert on the agent's conversation shape across turns.
- `remaining()` reports unconsumed responses — lets tests pin "agent
consumed exactly N turns."
- Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize,
lastMessages)` so a test failure clearly names "turn N had no
scripted response."
- Streaming uses the default `ModelClient.chatStream` wrap (scripted
responses fold into the same Started → ArgsDelta → Finished → End
chunk sequence a native streaming provider would emit).
- Out of scope for v1: record-from-live capture (mentioned in the
ticket; needs an HTTP-fixture story we'll write when there's demand).
`#2493 — eval { } DSL`:
- `agents_engine/testing/EvalDsl.kt`. Builder-DSL for typed eval cases.
- `eval<IN, OUT>("name") { input(...); expect { ... } }` — typed
predicates over `OUT`, not string matching.
- `expectSnapshot(snapshot)` — pins the rendered `toLlmInput(output)`
JSON against a known string. Diff on regression.
- `expectFieldEquals(fieldPath, expected)` — single-field JSON
substring check, no full snapshot.
- Multiple `expect` blocks compose — all must pass; failure reports
name each failing label and renders the typed output for diagnosis.
- Agent invocation exceptions captured as hard failures (the case
can't evaluate without an output).
- `evalSuite("name") { + case; + case; ... }.runAll(agent)` bundles
cases. Suite is type-homogeneous over the agent type at the call
site, so a mixed-shape suite is a compile error (good — catches
copy-paste bugs).
- `EvalResult.failureMessage` is null on pass, structured on fail —
drops straight into `assertTrue(result.passed) { result.failureMessage }`
in JUnit tests.
Tests:
- DeterministicModelClientTest.kt (6 cases): scripted text response;
multi-turn tool round-trip; requests recording; exhaustion error;
remaining(); byte-determinism across two runs against the same
script.
- EvalDslTest.kt (10 cases): passing predicate; multi-expect (mix of
pass/fail); invocation error capture; snapshot pass; snapshot fail
with typed diff; expectFieldEquals; suite mode; input(...) required;
expect(...) required.
Full suite: 1772 tests across 7 modules, 0 failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…E, CHANGELOG
- docs/eval.md (new) — user-facing eval doc. DeterministicModelClient
walked through with the request-history + remaining()
+ exhaustion-error contract; the three expectation styles (typed
predicate / snapshot / expectFieldEquals); suite mode with the
type-homogeneity constraint; failure shape; the
no-network-end-to-end composition pattern; and the v1-scope
deferrals (record-from-live, per-token chunks).
- src/main/resources/internals-agent/testing/EvalHarness.md (new) —
IDE-side LLM adjunct covering both files in one place (eval-harness
is conceptually a single unit). Signatures, composition story,
failure modes, scope.
- README.md — adds an "Eval harness" bullet under "Implemented today"
between the public snapshot/resume and prompt-caching bullets.
- CHANGELOG.md `## [Unreleased]` — opens with two entries under
"Eval harness (#2491 epic, in progress)" — #2492
DeterministicModelClient and #2493 eval { } DSL — with the v1
scope notes inline.
No source changes. Full suite stays at 1772 / 0 failures from the
prior commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/fail
#2494 — completes the #2491 eval epic. Adds an opt-in judge for
criteria that resist deterministic assertion (tone, relevance,
completeness). Explicitly advisory by design.
```kotlin
val rubric = JudgeRubric(
criteria = "Tone: warm, professional, no jargon.",
judgeModel = DeterministicModelClient(
LlmResponse.Text("""{"score":8,"rationale":"clear and warm"}"""),
),
)
val case = eval<String, Review>("repo-review") {
input(spec)
expect("approved") { it.approved } // ← gates pass/fail
judge("tone", rubric) // ← advisory; never gates
}
val result = case.run(agent)
result.passed // ← depends ONLY on expect blocks
result.judgeVerdicts["tone"] // ← JudgeOutcome.Scored(JudgeVerdict)
result.judgeSummary // ← "[advisory] tone: 8 — clear and warm"
```
Implementation:
- `agents_engine/testing/LlmJudge.kt` (new):
* `JudgeRubric(criteria, scoreRange = 0..10, judgeModel)` — typed
rubric config. The judge model is independent of the production
agent's model — for unit tests use `DeterministicModelClient`;
for live eval use a pinned cloud model.
* `JudgeVerdict(score: Int, rationale: String)` — `@Generable` so
the judge model returns structured JSON that the framework parses
through the existing `fromLlmOutput` pipeline. No free-text
judge prompts → free-text verdicts.
* Internal `LlmJudge(rubric).score(input, output)` — renders a
system prompt + user message ("Input: X, Output: Y"), invokes
the judge model, parses the verdict, validates `score` is in
`rubric.scoreRange`, returns the typed verdict.
- `agents_engine/testing/EvalDsl.kt` (extended):
* `EvalCaseBuilder.judge(label, rubric)` — registers an advisory
scorer. Duplicate labels fail fast at builder time.
* `EvalCase` carries an immutable `judges: List<JudgeBinding>`.
Runs each after the agent succeeds; judges do NOT run when the
agent invocation itself fails (no output to score).
* `EvalResult.judgeVerdicts: Map<String, JudgeOutcome>` — captured
verdicts keyed by label. Sealed `JudgeOutcome { Scored(verdict)
| Errored(detail) }` — parse failures or out-of-range scores
surface as `Errored` but never gate `passed`.
* `EvalResult.passed` and `EvalResult.failureMessage` consider
ONLY deterministic `outcomes` and `invocationError`. Judges are
structurally excluded from the gating contract.
* `EvalResult.judgeSummary: String` — multi-line `[advisory]
<label>: <score> — <rationale>` per-judge summary for test
reports. Marked `[advisory]` so report consumers don't confuse
judges with the deterministic pass/fail.
Tests (LlmJudgeTest.kt — 8 cases):
- Verdict captured on `judgeVerdicts[label]` as Scored
- Low judge score does NOT fail the case (advisory only)
- Judge parse error surfaces as Errored, doesn't gate pass/fail
- Out-of-range score surfaces as Errored
- Multiple judges per case keyed by label
- judgeSummary renders the [advisory] marker
- Judges do not run on agent invocation failure
- Duplicate judge labels fail fast at builder time
Eval epic (#2491) now feature-complete: deterministic mocks (#2492)
+ typed eval DSL (#2493) + advisory judge (#2494) ship as one
cohesive `agents_engine.testing` package.
Full suite: 1780 tests across 7 modules, 0 failures.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…HANGELOG - docs/eval.md — adds a new "LLM-as-judge (advisory)" section after DeterministicModelClient and before the eval DSL. Walks through: the example with tone scoring; why advisory + opt-in (LLM judges are themselves nondeterministic; gating them imports flakiness); pinning the judge model (DeterministicModelClient for unit tests, pinned cloud model for live eval); the sealed JudgeOutcome Scored/Errored failure modes; what happens when the agent itself fails (no judges run). Header summary updated to "three pieces." - src/main/resources/internals-agent/testing/EvalHarness.md — adjunct description string updated to cover all three pieces and the judge-doesn't-gate constraint explicitly. Code-shape block adds the judge() DSL line + JudgeOutcome / JudgeRubric / JudgeVerdict types. Failure-modes section gains the judge errored case. - README.md — extends the "Eval harness" bullet with the optional judge(...) sentence and the explicit "judges never gate" callout. - CHANGELOG.md — adds a third entry under "Eval harness (#2491 epic)" for #2494 with the advisory-only semantics, sealed JudgeOutcome, judge-model pinning, and agent-failure interaction. Header changes from "in progress" to "feature-complete." No source changes. Full suite stays at 1780 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.