Skip to content

Feat/2491 eval harness#66

Merged
Skobeltsyn merged 4 commits into
mainfrom
feat/2491-eval-harness
May 30, 2026
Merged

Feat/2491 eval harness#66
Skobeltsyn merged 4 commits into
mainfrom
feat/2491-eval-harness

Conversation

@Skobeltsyn
Copy link
Copy Markdown
Contributor

No description provided.

Skobeltsyn and others added 4 commits May 30, 2026 11:18
#2491 epic — first two children landed together (#2492 + #2493).
Reproducible eval without live providers, with typed assertions over
the agent's `OUT`.

```kotlin
val mock = DeterministicModelClient(
    LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))),
    LlmResponse.Text("found 42"),
)
val agent = agent<String, String>("test") {
    model { ollama("t"); client = mock }
    tools { tool("lookup", "lookup") { args -> "value-${args["id"]}" } }
    skills { skill<String, String>("s", "") { tools("lookup") } }
}

val case = eval<String, String>("answer-contains-42") {
    input("what is forty-two?")
    expect("nonempty") { it.isNotEmpty() }
    expect("mentions 42") { "42" in it }
}
val result = case.run(agent)
assertTrue(result.passed) { result.failureMessage }
```

`#2492 — DeterministicModelClient`:

- `agents_engine/testing/DeterministicModelClient.kt`. A `ModelClient`
  that scripts responses in order, one per `chat` call.
- Construction: `DeterministicModelClient(LlmResponse, LlmResponse, ...)`
  or `DeterministicModelClient(scripted: List<LlmResponse>)`.
- `requests` exposes the full message-list-per-call history so tests
  can assert on the agent's conversation shape across turns.
- `remaining()` reports unconsumed responses — lets tests pin "agent
  consumed exactly N turns."
- Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize,
  lastMessages)` so a test failure clearly names "turn N had no
  scripted response."
- Streaming uses the default `ModelClient.chatStream` wrap (scripted
  responses fold into the same Started → ArgsDelta → Finished → End
  chunk sequence a native streaming provider would emit).
- Out of scope for v1: record-from-live capture (mentioned in the
  ticket; needs an HTTP-fixture story we'll write when there's demand).

`#2493 — eval { } DSL`:

- `agents_engine/testing/EvalDsl.kt`. Builder-DSL for typed eval cases.
- `eval<IN, OUT>("name") { input(...); expect { ... } }` — typed
  predicates over `OUT`, not string matching.
- `expectSnapshot(snapshot)` — pins the rendered `toLlmInput(output)`
  JSON against a known string. Diff on regression.
- `expectFieldEquals(fieldPath, expected)` — single-field JSON
  substring check, no full snapshot.
- Multiple `expect` blocks compose — all must pass; failure reports
  name each failing label and renders the typed output for diagnosis.
- Agent invocation exceptions captured as hard failures (the case
  can't evaluate without an output).
- `evalSuite("name") { + case; + case; ... }.runAll(agent)` bundles
  cases. Suite is type-homogeneous over the agent type at the call
  site, so a mixed-shape suite is a compile error (good — catches
  copy-paste bugs).
- `EvalResult.failureMessage` is null on pass, structured on fail —
  drops straight into `assertTrue(result.passed) { result.failureMessage }`
  in JUnit tests.

Tests:
- DeterministicModelClientTest.kt (6 cases): scripted text response;
  multi-turn tool round-trip; requests recording; exhaustion error;
  remaining(); byte-determinism across two runs against the same
  script.
- EvalDslTest.kt (10 cases): passing predicate; multi-expect (mix of
  pass/fail); invocation error capture; snapshot pass; snapshot fail
  with typed diff; expectFieldEquals; suite mode; input(...) required;
  expect(...) required.

Full suite: 1772 tests across 7 modules, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…E, CHANGELOG

- docs/eval.md (new) — user-facing eval doc. DeterministicModelClient
  walked through with the request-history + remaining()
  + exhaustion-error contract; the three expectation styles (typed
  predicate / snapshot / expectFieldEquals); suite mode with the
  type-homogeneity constraint; failure shape; the
  no-network-end-to-end composition pattern; and the v1-scope
  deferrals (record-from-live, per-token chunks).
- src/main/resources/internals-agent/testing/EvalHarness.md (new) —
  IDE-side LLM adjunct covering both files in one place (eval-harness
  is conceptually a single unit). Signatures, composition story,
  failure modes, scope.
- README.md — adds an "Eval harness" bullet under "Implemented today"
  between the public snapshot/resume and prompt-caching bullets.
- CHANGELOG.md `## [Unreleased]` — opens with two entries under
  "Eval harness (#2491 epic, in progress)" — #2492
  DeterministicModelClient and #2493 eval { } DSL — with the v1
  scope notes inline.

No source changes. Full suite stays at 1772 / 0 failures from the
prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/fail

#2494 — completes the #2491 eval epic. Adds an opt-in judge for
criteria that resist deterministic assertion (tone, relevance,
completeness). Explicitly advisory by design.

```kotlin
val rubric = JudgeRubric(
    criteria = "Tone: warm, professional, no jargon.",
    judgeModel = DeterministicModelClient(
        LlmResponse.Text("""{"score":8,"rationale":"clear and warm"}"""),
    ),
)
val case = eval<String, Review>("repo-review") {
    input(spec)
    expect("approved") { it.approved }          // ← gates pass/fail
    judge("tone", rubric)                       // ← advisory; never gates
}
val result = case.run(agent)
result.passed                                   // ← depends ONLY on expect blocks
result.judgeVerdicts["tone"]                    // ← JudgeOutcome.Scored(JudgeVerdict)
result.judgeSummary                             // ← "[advisory] tone: 8 — clear and warm"
```

Implementation:

- `agents_engine/testing/LlmJudge.kt` (new):
  * `JudgeRubric(criteria, scoreRange = 0..10, judgeModel)` — typed
    rubric config. The judge model is independent of the production
    agent's model — for unit tests use `DeterministicModelClient`;
    for live eval use a pinned cloud model.
  * `JudgeVerdict(score: Int, rationale: String)` — `@Generable` so
    the judge model returns structured JSON that the framework parses
    through the existing `fromLlmOutput` pipeline. No free-text
    judge prompts → free-text verdicts.
  * Internal `LlmJudge(rubric).score(input, output)` — renders a
    system prompt + user message ("Input: X, Output: Y"), invokes
    the judge model, parses the verdict, validates `score` is in
    `rubric.scoreRange`, returns the typed verdict.

- `agents_engine/testing/EvalDsl.kt` (extended):
  * `EvalCaseBuilder.judge(label, rubric)` — registers an advisory
    scorer. Duplicate labels fail fast at builder time.
  * `EvalCase` carries an immutable `judges: List<JudgeBinding>`.
    Runs each after the agent succeeds; judges do NOT run when the
    agent invocation itself fails (no output to score).
  * `EvalResult.judgeVerdicts: Map<String, JudgeOutcome>` — captured
    verdicts keyed by label. Sealed `JudgeOutcome { Scored(verdict)
    | Errored(detail) }` — parse failures or out-of-range scores
    surface as `Errored` but never gate `passed`.
  * `EvalResult.passed` and `EvalResult.failureMessage` consider
    ONLY deterministic `outcomes` and `invocationError`. Judges are
    structurally excluded from the gating contract.
  * `EvalResult.judgeSummary: String` — multi-line `[advisory]
    <label>: <score> — <rationale>` per-judge summary for test
    reports. Marked `[advisory]` so report consumers don't confuse
    judges with the deterministic pass/fail.

Tests (LlmJudgeTest.kt — 8 cases):
- Verdict captured on `judgeVerdicts[label]` as Scored
- Low judge score does NOT fail the case (advisory only)
- Judge parse error surfaces as Errored, doesn't gate pass/fail
- Out-of-range score surfaces as Errored
- Multiple judges per case keyed by label
- judgeSummary renders the [advisory] marker
- Judges do not run on agent invocation failure
- Duplicate judge labels fail fast at builder time

Eval epic (#2491) now feature-complete: deterministic mocks (#2492)
+ typed eval DSL (#2493) + advisory judge (#2494) ship as one
cohesive `agents_engine.testing` package.

Full suite: 1780 tests across 7 modules, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…HANGELOG

- docs/eval.md — adds a new "LLM-as-judge (advisory)" section after
  DeterministicModelClient and before the eval DSL. Walks through:
  the example with tone scoring; why advisory + opt-in (LLM judges
  are themselves nondeterministic; gating them imports flakiness);
  pinning the judge model (DeterministicModelClient for unit tests,
  pinned cloud model for live eval); the sealed JudgeOutcome
  Scored/Errored failure modes; what happens when the agent itself
  fails (no judges run). Header summary updated to "three pieces."
- src/main/resources/internals-agent/testing/EvalHarness.md — adjunct
  description string updated to cover all three pieces and the
  judge-doesn't-gate constraint explicitly. Code-shape block adds
  the judge() DSL line + JudgeOutcome / JudgeRubric / JudgeVerdict
  types. Failure-modes section gains the judge errored case.
- README.md — extends the "Eval harness" bullet with the optional
  judge(...) sentence and the explicit "judges never gate" callout.
- CHANGELOG.md — adds a third entry under "Eval harness (#2491 epic)"
  for #2494 with the advisory-only semantics, sealed JudgeOutcome,
  judge-model pinning, and agent-failure interaction. Header changes
  from "in progress" to "feature-complete."

No source changes. Full suite stays at 1780 / 0 failures from the
prior commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Skobeltsyn Skobeltsyn merged commit 510585c into main May 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant