Feat/2491 eval harness by Skobeltsyn · Pull Request #66 · Deep-CodeAI/Agents.KT

Skobeltsyn · 2026-05-30T09:20:50Z

No description provided.

#2491 epic — first two children landed together (#2492 + #2493). Reproducible eval without live providers, with typed assertions over the agent's `OUT`. ```kotlin val mock = DeterministicModelClient( LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))), LlmResponse.Text("found 42"), ) val agent = agent<String, String>("test") { model { ollama("t"); client = mock } tools { tool("lookup", "lookup") { args -> "value-${args["id"]}" } } skills { skill<String, String>("s", "") { tools("lookup") } } } val case = eval<String, String>("answer-contains-42") { input("what is forty-two?") expect("nonempty") { it.isNotEmpty() } expect("mentions 42") { "42" in it } } val result = case.run(agent) assertTrue(result.passed) { result.failureMessage } ``` `#2492 — DeterministicModelClient`: - `agents_engine/testing/DeterministicModelClient.kt`. A `ModelClient` that scripts responses in order, one per `chat` call. - Construction: `DeterministicModelClient(LlmResponse, LlmResponse, ...)` or `DeterministicModelClient(scripted: List<LlmResponse>)`. - `requests` exposes the full message-list-per-call history so tests can assert on the agent's conversation shape across turns. - `remaining()` reports unconsumed responses — lets tests pin "agent consumed exactly N turns." - Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)` so a test failure clearly names "turn N had no scripted response." - Streaming uses the default `ModelClient.chatStream` wrap (scripted responses fold into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit). - Out of scope for v1: record-from-live capture (mentioned in the ticket; needs an HTTP-fixture story we'll write when there's demand). `#2493 — eval { } DSL`: - `agents_engine/testing/EvalDsl.kt`. Builder-DSL for typed eval cases. - `eval<IN, OUT>("name") { input(...); expect { ... } }` — typed predicates over `OUT`, not string matching. - `expectSnapshot(snapshot)` — pins the rendered `toLlmInput(output)` JSON against a known string. Diff on regression. - `expectFieldEquals(fieldPath, expected)` — single-field JSON substring check, no full snapshot. - Multiple `expect` blocks compose — all must pass; failure reports name each failing label and renders the typed output for diagnosis. - Agent invocation exceptions captured as hard failures (the case can't evaluate without an output). - `evalSuite("name") { + case; + case; ... }.runAll(agent)` bundles cases. Suite is type-homogeneous over the agent type at the call site, so a mixed-shape suite is a compile error (good — catches copy-paste bugs). - `EvalResult.failureMessage` is null on pass, structured on fail — drops straight into `assertTrue(result.passed) { result.failureMessage }` in JUnit tests. Tests: - DeterministicModelClientTest.kt (6 cases): scripted text response; multi-turn tool round-trip; requests recording; exhaustion error; remaining(); byte-determinism across two runs against the same script. - EvalDslTest.kt (10 cases): passing predicate; multi-expect (mix of pass/fail); invocation error capture; snapshot pass; snapshot fail with typed diff; expectFieldEquals; suite mode; input(...) required; expect(...) required. Full suite: 1772 tests across 7 modules, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…E, CHANGELOG - docs/eval.md (new) — user-facing eval doc. DeterministicModelClient walked through with the request-history + remaining() + exhaustion-error contract; the three expectation styles (typed predicate / snapshot / expectFieldEquals); suite mode with the type-homogeneity constraint; failure shape; the no-network-end-to-end composition pattern; and the v1-scope deferrals (record-from-live, per-token chunks). - src/main/resources/internals-agent/testing/EvalHarness.md (new) — IDE-side LLM adjunct covering both files in one place (eval-harness is conceptually a single unit). Signatures, composition story, failure modes, scope. - README.md — adds an "Eval harness" bullet under "Implemented today" between the public snapshot/resume and prompt-caching bullets. - CHANGELOG.md `## [Unreleased]` — opens with two entries under "Eval harness (#2491 epic, in progress)" — #2492 DeterministicModelClient and #2493 eval { } DSL — with the v1 scope notes inline. No source changes. Full suite stays at 1772 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…/fail #2494 — completes the #2491 eval epic. Adds an opt-in judge for criteria that resist deterministic assertion (tone, relevance, completeness). Explicitly advisory by design. ```kotlin val rubric = JudgeRubric( criteria = "Tone: warm, professional, no jargon.", judgeModel = DeterministicModelClient( LlmResponse.Text("""{"score":8,"rationale":"clear and warm"}"""), ), ) val case = eval<String, Review>("repo-review") { input(spec) expect("approved") { it.approved } // ← gates pass/fail judge("tone", rubric) // ← advisory; never gates } val result = case.run(agent) result.passed // ← depends ONLY on expect blocks result.judgeVerdicts["tone"] // ← JudgeOutcome.Scored(JudgeVerdict) result.judgeSummary // ← "[advisory] tone: 8 — clear and warm" ``` Implementation: - `agents_engine/testing/LlmJudge.kt` (new): * `JudgeRubric(criteria, scoreRange = 0..10, judgeModel)` — typed rubric config. The judge model is independent of the production agent's model — for unit tests use `DeterministicModelClient`; for live eval use a pinned cloud model. * `JudgeVerdict(score: Int, rationale: String)` — `@Generable` so the judge model returns structured JSON that the framework parses through the existing `fromLlmOutput` pipeline. No free-text judge prompts → free-text verdicts. * Internal `LlmJudge(rubric).score(input, output)` — renders a system prompt + user message ("Input: X, Output: Y"), invokes the judge model, parses the verdict, validates `score` is in `rubric.scoreRange`, returns the typed verdict. - `agents_engine/testing/EvalDsl.kt` (extended): * `EvalCaseBuilder.judge(label, rubric)` — registers an advisory scorer. Duplicate labels fail fast at builder time. * `EvalCase` carries an immutable `judges: List<JudgeBinding>`. Runs each after the agent succeeds; judges do NOT run when the agent invocation itself fails (no output to score). * `EvalResult.judgeVerdicts: Map<String, JudgeOutcome>` — captured verdicts keyed by label. Sealed `JudgeOutcome { Scored(verdict) | Errored(detail) }` — parse failures or out-of-range scores surface as `Errored` but never gate `passed`. * `EvalResult.passed` and `EvalResult.failureMessage` consider ONLY deterministic `outcomes` and `invocationError`. Judges are structurally excluded from the gating contract. * `EvalResult.judgeSummary: String` — multi-line `[advisory] <label>: <score> — <rationale>` per-judge summary for test reports. Marked `[advisory]` so report consumers don't confuse judges with the deterministic pass/fail. Tests (LlmJudgeTest.kt — 8 cases): - Verdict captured on `judgeVerdicts[label]` as Scored - Low judge score does NOT fail the case (advisory only) - Judge parse error surfaces as Errored, doesn't gate pass/fail - Out-of-range score surfaces as Errored - Multiple judges per case keyed by label - judgeSummary renders the [advisory] marker - Judges do not run on agent invocation failure - Duplicate judge labels fail fast at builder time Eval epic (#2491) now feature-complete: deterministic mocks (#2492) + typed eval DSL (#2493) + advisory judge (#2494) ship as one cohesive `agents_engine.testing` package. Full suite: 1780 tests across 7 modules, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…HANGELOG - docs/eval.md — adds a new "LLM-as-judge (advisory)" section after DeterministicModelClient and before the eval DSL. Walks through: the example with tone scoring; why advisory + opt-in (LLM judges are themselves nondeterministic; gating them imports flakiness); pinning the judge model (DeterministicModelClient for unit tests, pinned cloud model for live eval); the sealed JudgeOutcome Scored/Errored failure modes; what happens when the agent itself fails (no judges run). Header summary updated to "three pieces." - src/main/resources/internals-agent/testing/EvalHarness.md — adjunct description string updated to cover all three pieces and the judge-doesn't-gate constraint explicitly. Code-shape block adds the judge() DSL line + JudgeOutcome / JudgeRubric / JudgeVerdict types. Failure-modes section gains the judge errored case. - README.md — extends the "Eval harness" bullet with the optional judge(...) sentence and the explicit "judges never gate" callout. - CHANGELOG.md — adds a third entry under "Eval harness (#2491 epic)" for #2494 with the advisory-only semantics, sealed JudgeOutcome, judge-model pinning, and agent-failure interaction. Header changes from "in progress" to "feature-complete." No source changes. Full suite stays at 1780 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Skobeltsyn and others added 4 commits May 30, 2026 11:18

Skobeltsyn merged commit 510585c into main May 30, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/2491 eval harness#66

Feat/2491 eval harness#66
Skobeltsyn merged 4 commits into
mainfrom
feat/2491-eval-harness

Skobeltsyn commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Skobeltsyn commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant