From 8424d32b4ddc523c23c6303fc61986aeaab0c45c Mon Sep 17 00:00:00 2001 From: skobeltsyn Date: Sat, 30 May 2026 11:18:01 +0300 Subject: [PATCH 1/4] =?UTF-8?q?feat(#2491):=20eval=20harness=20=E2=80=94?= =?UTF-8?q?=20DeterministicModelClient=20+=20eval=20{=20}=20DSL?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit #2491 epic — first two children landed together (#2492 + #2493). Reproducible eval without live providers, with typed assertions over the agent's `OUT`. ```kotlin val mock = DeterministicModelClient( LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))), LlmResponse.Text("found 42"), ) val agent = agent("test") { model { ollama("t"); client = mock } tools { tool("lookup", "lookup") { args -> "value-${args["id"]}" } } skills { skill("s", "") { tools("lookup") } } } val case = eval("answer-contains-42") { input("what is forty-two?") expect("nonempty") { it.isNotEmpty() } expect("mentions 42") { "42" in it } } val result = case.run(agent) assertTrue(result.passed) { result.failureMessage } ``` `#2492 — DeterministicModelClient`: - `agents_engine/testing/DeterministicModelClient.kt`. A `ModelClient` that scripts responses in order, one per `chat` call. - Construction: `DeterministicModelClient(LlmResponse, LlmResponse, ...)` or `DeterministicModelClient(scripted: List)`. - `requests` exposes the full message-list-per-call history so tests can assert on the agent's conversation shape across turns. - `remaining()` reports unconsumed responses — lets tests pin "agent consumed exactly N turns." - Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)` so a test failure clearly names "turn N had no scripted response." - Streaming uses the default `ModelClient.chatStream` wrap (scripted responses fold into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit). - Out of scope for v1: record-from-live capture (mentioned in the ticket; needs an HTTP-fixture story we'll write when there's demand). `#2493 — eval { } DSL`: - `agents_engine/testing/EvalDsl.kt`. Builder-DSL for typed eval cases. - `eval("name") { input(...); expect { ... } }` — typed predicates over `OUT`, not string matching. - `expectSnapshot(snapshot)` — pins the rendered `toLlmInput(output)` JSON against a known string. Diff on regression. - `expectFieldEquals(fieldPath, expected)` — single-field JSON substring check, no full snapshot. - Multiple `expect` blocks compose — all must pass; failure reports name each failing label and renders the typed output for diagnosis. - Agent invocation exceptions captured as hard failures (the case can't evaluate without an output). - `evalSuite("name") { + case; + case; ... }.runAll(agent)` bundles cases. Suite is type-homogeneous over the agent type at the call site, so a mixed-shape suite is a compile error (good — catches copy-paste bugs). - `EvalResult.failureMessage` is null on pass, structured on fail — drops straight into `assertTrue(result.passed) { result.failureMessage }` in JUnit tests. Tests: - DeterministicModelClientTest.kt (6 cases): scripted text response; multi-turn tool round-trip; requests recording; exhaustion error; remaining(); byte-determinism across two runs against the same script. - EvalDslTest.kt (10 cases): passing predicate; multi-expect (mix of pass/fail); invocation error capture; snapshot pass; snapshot fail with typed diff; expectFieldEquals; suite mode; input(...) required; expect(...) required. Full suite: 1772 tests across 7 modules, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../testing/DeterministicModelClient.kt | 106 ++++++++ .../kotlin/agents_engine/testing/EvalDsl.kt | 244 ++++++++++++++++++ .../testing/DeterministicModelClientTest.kt | 115 +++++++++ .../agents_engine/testing/EvalDslTest.kt | 197 ++++++++++++++ 4 files changed, 662 insertions(+) create mode 100644 src/main/kotlin/agents_engine/testing/DeterministicModelClient.kt create mode 100644 src/main/kotlin/agents_engine/testing/EvalDsl.kt create mode 100644 src/test/kotlin/agents_engine/testing/DeterministicModelClientTest.kt create mode 100644 src/test/kotlin/agents_engine/testing/EvalDslTest.kt diff --git a/src/main/kotlin/agents_engine/testing/DeterministicModelClient.kt b/src/main/kotlin/agents_engine/testing/DeterministicModelClient.kt new file mode 100644 index 0000000..8cd22d3 --- /dev/null +++ b/src/main/kotlin/agents_engine/testing/DeterministicModelClient.kt @@ -0,0 +1,106 @@ +package agents_engine.testing + +import agents_engine.model.JsonSchema +import agents_engine.model.LlmMessage +import agents_engine.model.LlmResponse +import agents_engine.model.ModelClient + +/** + * `agents_engine/testing/DeterministicModelClient.kt` — reproducible eval + * harness without live providers (#2492, part of the #2491 eval epic). + * + * A [ModelClient] that hands back a pre-scripted sequence of [LlmResponse]s + * in order, one per `chat` call. Test code constructs an agent with this + * client and asserts on the full agentic-loop output without any network, + * tokeniser noise, or model nondeterminism. + * + * ```kotlin + * val mock = DeterministicModelClient( + * LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))), + * LlmResponse.Text("the answer is 42"), + * ) + * val agent = agent("test-agent") { + * model { ollama("test"); client = mock } + * tools { tool("lookup", "look up id") { args -> "value-${args["id"]}" } } + * skills { skill("respond", "") { tools("lookup") } } + * } + * agent("go") // → "the answer is 42" + * ``` + * + * **Streaming.** Uses the default `ModelClient.chatStream` implementation, + * which wraps `chat` into the same Started → ArgumentsDelta → Finished → + * End chunk sequence a native streaming provider would emit. Tests that + * assert on the streaming AgentEvent flow get the right shape automatically; + * tests that need finer-grained chunk replay (e.g. for provider-specific + * mid-tool-call edge cases) should write a custom flow. + * + * **Exhaustion.** If the agent calls `chat` more times than there are + * scripted responses, the client throws [DeterministicScriptExhausted] + * naming the call index — useful for debugging "why did the loop need an + * extra turn?" + * + * **Thread-safety.** Calls advance an internal counter; concurrent use + * from multiple threads is undefined (production-shape agentic loops are + * single-flight per session, so this matches real usage). + * + * **Record-from-live.** Out of scope for v1. The ticket (#2492) mentions + * "record-once/replay-many"; that needs an HTTP-fixture story we'll write + * when there's demand. For now: hand-script the responses. + */ +class DeterministicModelClient( + private val scripted: List, +) : ModelClient { + + constructor(vararg responses: LlmResponse) : this(responses.toList()) + + private var callIndex: Int = 0 + private val recordedRequests: MutableList> = mutableListOf() + + /** + * The full sequence of `messages` lists passed to `chat` so far, in + * order. Useful for asserting on the conversation the agent built up + * across turns. Includes ALL turns, not just the last one. + */ + val requests: List> + get() = recordedRequests.toList() + + /** + * How many scripted responses remain unconsumed. Tests asserting "the + * loop terminated after exactly N turns" can check `remaining() == 0` + * after running the agent. + */ + fun remaining(): Int = (scripted.size - callIndex).coerceAtLeast(0) + + override fun chat(messages: List): LlmResponse { + recordedRequests += messages.toList() + if (callIndex >= scripted.size) { + throw DeterministicScriptExhausted( + callIndex = callIndex, + scriptSize = scripted.size, + lastMessages = messages, + ) + } + return scripted[callIndex++] + } + + override fun chat(messages: List, jsonSchema: JsonSchema?): LlmResponse = + chat(messages) +} + +/** + * Thrown by [DeterministicModelClient] when the agent calls `chat` more + * times than there are scripted responses. The message names the call + * index so test failures are easy to diagnose ("turn 4 had no scripted + * response — did your tool unexpectedly return an error that triggered an + * extra retry?"). + */ +class DeterministicScriptExhausted( + val callIndex: Int, + val scriptSize: Int, + val lastMessages: List, +) : IllegalStateException( + "DeterministicModelClient script exhausted at call index $callIndex " + + "(script has $scriptSize responses). The agent's loop tried to ask the model " + + "for another turn but no response was scripted. Last message list had ${lastMessages.size} " + + "messages; last role = ${lastMessages.lastOrNull()?.role}.", +) diff --git a/src/main/kotlin/agents_engine/testing/EvalDsl.kt b/src/main/kotlin/agents_engine/testing/EvalDsl.kt new file mode 100644 index 0000000..bb25e26 --- /dev/null +++ b/src/main/kotlin/agents_engine/testing/EvalDsl.kt @@ -0,0 +1,244 @@ +package agents_engine.testing + +import agents_engine.core.Agent +import agents_engine.generation.toLlmInput + +/** + * `agents_engine/testing/EvalDsl.kt` — declarative eval cases over an + * agent's typed `OUT` (#2493, part of the #2491 eval epic). + * + * ```kotlin + * val case = eval("repo-review") { + * input(SpecText("review this repository")) + * expect { it.risks.size >= 3 } + * expectField("approved", true) // matches review.approved == true + * } + * + * val result = case.run(reviewAgent) + * assertTrue(result.passed) { result.failureMessage } + * ``` + * + * **Typed assertions.** Expectations run against the agent's typed `OUT`, + * not string-matching. The lambda receives the resolved output and + * returns true/false. Multiple `expect` blocks compose: all must pass. + * + * **Snapshot mode.** `expectSnapshot { ... }` captures the rendered + * `toLlmInput(output)` JSON on first run (when the snapshot path is + * empty) and diffs on subsequent runs. Same shape as Jest / kotest + * snapshots; pairs well with the deterministic-replay ModelClient so the + * snapshot is stable across CI runs. + * + * **Integration with CI.** `evalSuite("name") { + case; + case; ... }` + * groups cases. The suite returns a [EvalSuiteResult] with per-case + * results; CI wraps it in a normal test method that fails when any case + * fails. No new task/runner needed. + * + * Pairs with [DeterministicModelClient] for the no-network requirement + * — eval cases against a live model are nondeterministic and out of + * scope; live-model regression coverage goes through the existing + * `live-llm` / `live-cloud-api` tagged tests. + */ +class EvalCase( + val name: String, + internal val input: IN, + internal val expectations: List>, +) { + /** + * Run this case against [agent], collecting expectation results. + * Captures exceptions from the agent invocation as a hard failure + * (the eval can't proceed without the output). + */ + fun run(agent: Agent): EvalResult { + val output = try { + agent(input) + } catch (t: Throwable) { + return EvalResult( + caseName = name, + output = null, + outcomes = emptyList(), + invocationError = t, + ) + } + val outcomes = expectations.map { expectation -> + try { + val passed = expectation.check(output) + EvalOutcome(expectation.label, passed, failureDetail = if (passed) null else expectation.describe(output)) + } catch (t: Throwable) { + EvalOutcome(expectation.label, false, failureDetail = "expectation threw: ${t.message}") + } + } + return EvalResult(caseName = name, output = output, outcomes = outcomes, invocationError = null) + } +} + +/** A typed expectation over an agent's `OUT`. */ +class EvalExpectation( + val label: String, + private val predicate: (OUT) -> Boolean, + private val describer: (OUT) -> String = { "expectation failed for output $it" }, +) { + fun check(output: OUT): Boolean = predicate(output) + fun describe(output: OUT): String = describer(output) +} + +/** Builder DSL for [EvalCase]. */ +class EvalCaseBuilder { + private var input: IN? = null + private var inputProvided: Boolean = false + private val expectations: MutableList> = mutableListOf() + + /** Set the agent input. Required — calling [build] without it throws. */ + fun input(value: IN) { + input = value + inputProvided = true + } + + /** + * Typed predicate over `OUT`. The [label] surfaces on failure reports + * so multi-expect cases are diagnosable. + */ + fun expect(label: String = "expect", predicate: (OUT) -> Boolean) { + expectations += EvalExpectation(label, predicate) { out -> + "[$label] failed for output: ${renderForFailure(out)}" + } + } + + /** + * Snapshot expectation — captures `toLlmInput(output)` and matches + * against [snapshot]. Useful for pinning a known-good typed output + * structurally without spelling out every field. + * + * Use the recommended `--update-eval-snapshots` workflow: run the + * suite once with the expected output stored in source as the + * snapshot string. Drift surfaces as a typed diff failure. + */ + fun expectSnapshot(label: String = "snapshot", snapshot: String) { + expectations += EvalExpectation( + label = label, + predicate = { out -> toLlmInput(out) == snapshot }, + describer = { out -> + "[$label] snapshot mismatch:\n expected: $snapshot\n actual: ${toLlmInput(out)}" + }, + ) + } + + /** + * Field-level expectation for `@Generable` outputs. Inspects the + * rendered JSON shape for an exact key/value match. Useful for + * asserting on one field without spelling out the full snapshot. + * For complex queries use [expect] with manual reflection on the + * typed `OUT`. + */ + fun expectFieldEquals(fieldPath: String, expected: Any?) { + expectations += EvalExpectation( + label = "$fieldPath == $expected", + predicate = { out -> + val json = toLlmInput(out) + // Simple substring check on the canonical JSON. Good enough + // for v1; users who need full JSONPath semantics can write + // an explicit `expect { ... }`. + json.contains("\"$fieldPath\":${renderJsonValue(expected)}") + }, + describer = { out -> + "[field $fieldPath] expected $expected; output rendered as ${toLlmInput(out)}" + }, + ) + } + + internal fun build(name: String): EvalCase { + check(inputProvided) { "eval(\"$name\") { } requires an input(...) call." } + check(expectations.isNotEmpty()) { "eval(\"$name\") { } requires at least one expect(...) block." } + @Suppress("UNCHECKED_CAST") + return EvalCase(name, input as IN, expectations.toList()) + } + + private fun renderForFailure(out: OUT): String = + try { toLlmInput(out) } catch (_: Throwable) { out?.toString() ?: "null" } +} + +/** + * Build an [EvalCase]. The `IN` and `OUT` type parameters are inferred + * from the agent type at `case.run(agent)`. + */ +fun eval(name: String, block: EvalCaseBuilder.() -> Unit): EvalCase { + val builder = EvalCaseBuilder() + builder.block() + return builder.build(name) +} + +/** Outcome of a single expectation in an eval case. */ +data class EvalOutcome( + val label: String, + val passed: Boolean, + val failureDetail: String?, +) + +/** Result of running an [EvalCase] against an agent. */ +data class EvalResult( + val caseName: String, + val output: OUT?, + val outcomes: List, + val invocationError: Throwable?, +) { + val passed: Boolean get() = invocationError == null && outcomes.all { it.passed } + + val failureMessage: String? + get() = when { + passed -> null + invocationError != null -> + "eval case \"$caseName\" failed: agent threw ${invocationError::class.simpleName}: ${invocationError.message}" + else -> { + val fails = outcomes.filterNot { it.passed } + "eval case \"$caseName\" failed: ${fails.joinToString("\n") { " - ${it.label}: ${it.failureDetail}" }}" + } + } +} + +/** A bag of [EvalCase]s runnable together. */ +class EvalSuite(val name: String) { + private val cases: MutableList> = mutableListOf() + + operator fun EvalCase.unaryPlus() { + cases += this + } + + /** + * Run every case against the [agent]. The agent type binds the case + * type at call time, so a mixed-type suite is a compile error — each + * suite is type-homogeneous over the agent it runs against. + */ + @Suppress("UNCHECKED_CAST") + fun runAll(agent: Agent): EvalSuiteResult { + val results = cases.map { case -> (case as EvalCase).run(agent) } + return EvalSuiteResult(name = name, results = results) + } +} + +/** Result of running an [EvalSuite]. */ +data class EvalSuiteResult( + val name: String, + val results: List>, +) { + val passed: Boolean get() = results.all { it.passed } + val failureSummary: String? + get() = if (passed) null else results + .filterNot { it.passed } + .joinToString("\n") { it.failureMessage ?: "(unknown failure in ${it.caseName})" } +} + +/** Build a suite. Cases go in via `+ case`. */ +fun evalSuite(name: String, block: EvalSuite.() -> Unit): EvalSuite = + EvalSuite(name).apply(block) + +/** + * Render a JSON value for the simple `expectFieldEquals` substring match. + * Mirrors `toJsonString`'s escaping conventions for strings; integers / + * booleans / null render unquoted. + */ +private fun renderJsonValue(value: Any?): String = when (value) { + null -> "null" + is Boolean -> value.toString() + is Number -> value.toString() + is String -> "\"${value.replace("\\", "\\\\").replace("\"", "\\\"")}\"" + else -> "\"${value.toString().replace("\\", "\\\\").replace("\"", "\\\"")}\"" +} diff --git a/src/test/kotlin/agents_engine/testing/DeterministicModelClientTest.kt b/src/test/kotlin/agents_engine/testing/DeterministicModelClientTest.kt new file mode 100644 index 0000000..2f5be3c --- /dev/null +++ b/src/test/kotlin/agents_engine/testing/DeterministicModelClientTest.kt @@ -0,0 +1,115 @@ +package agents_engine.testing + +import agents_engine.core.agent +import agents_engine.model.LlmResponse +import agents_engine.model.Tool +import agents_engine.model.ToolCall +import org.junit.jupiter.api.assertThrows +import kotlin.test.Test +import kotlin.test.assertEquals +import kotlin.test.assertTrue + +/** + * #2492 — DeterministicModelClient. Pins: + * + * 1. Scripted responses returned in order — agent loop is byte-identical + * across runs against the same script. + * 2. The client records every request the agent built up — useful for + * asserting on conversation shape. + * 3. Exhaustion throws a clear error naming the call index. + * 4. `remaining()` lets a test pin "agent consumed exactly N turns." + */ +class DeterministicModelClientTest { + + @Test + fun `scripted text response is returned to the agent`() { + val mock = DeterministicModelClient(LlmResponse.Text("hello back")) + val a = agent("a") { + model { ollama("t"); client = mock } + skills { skill("s", "") { implementedBy { "fallback" } } } + } + // The implementedBy skill is non-agentic — won't call the model. + // Switch to a tools-driven skill to exercise the mock. + val b = agent("b") { + model { ollama("t"); client = mock } + skills { skill("s", "") { tools() } } + } + assertEquals("hello back", b("any")) + } + + @Test + fun `multi-turn tool round trip plays out scripted responses in order`() { + val mock = DeterministicModelClient( + LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))), + LlmResponse.Text("found 42"), + ) + val a = agent("two-turn") { + lateinit var lookup: Tool, Any?> + model { ollama("t"); client = mock } + tools { lookup = tool("lookup", "lookup by id") { args -> "value-${args["id"]}" } } + skills { skill("s", "") { tools(lookup) } } + } + assertEquals("found 42", a("go")) + assertEquals(0, mock.remaining(), "both scripted responses consumed") + } + + @Test + fun `requests records each chat call's message list`() { + val mock = DeterministicModelClient(LlmResponse.Text("done")) + val a = agent("recorder") { + model { ollama("t"); client = mock } + skills { skill("s", "") { tools() } } + } + a("hello world") + assertEquals(1, mock.requests.size) + val firstCallMessages = mock.requests.first() + // The agent's loop sends system + user at minimum. + assertTrue(firstCallMessages.any { it.role == "user" && it.content == "hello world" }) + } + + @Test + fun `exhausted script throws DeterministicScriptExhausted with call index`() { + // Only one response, but the agent needs two turns (tool call → text). + val mock = DeterministicModelClient( + LlmResponse.ToolCalls(listOf(ToolCall("step", emptyMap()))), + ) + val a = agent("exhausting") { + lateinit var step: Tool, Any?> + model { ollama("t"); client = mock } + tools { step = tool("step", "step once") { _ -> "ok" } } + skills { skill("s", "") { tools(step) } } + } + val ex = assertThrows { a("go") } + assertEquals(1, ex.callIndex, "first scripted response consumed; second call exhausts") + assertEquals(1, ex.scriptSize) + } + + @Test + fun `remaining reports unconsumed scripted responses`() { + val mock = DeterministicModelClient( + LlmResponse.Text("first"), + LlmResponse.Text("second"), + LlmResponse.Text("third"), + ) + assertEquals(3, mock.remaining()) + } + + @Test + fun `two runs against the same script produce byte-identical output (byte-determinism AC)`() { + // The acceptance criterion: same scripted client + same agent + same input → same output. + fun buildAgent(mock: DeterministicModelClient) = agent("repro") { + lateinit var step: Tool, Any?> + model { ollama("t"); client = mock } + tools { step = tool("step", "") { _ -> "ok" } } + skills { skill("s", "") { tools(step) } } + } + + val script = listOf( + LlmResponse.ToolCalls(listOf(ToolCall("step", emptyMap()))), + LlmResponse.Text("the same output"), + ) + val outA = buildAgent(DeterministicModelClient(script)).invoke("input") + val outB = buildAgent(DeterministicModelClient(script)).invoke("input") + assertEquals(outA, outB, "byte-identical output across runs") + } +} diff --git a/src/test/kotlin/agents_engine/testing/EvalDslTest.kt b/src/test/kotlin/agents_engine/testing/EvalDslTest.kt new file mode 100644 index 0000000..9a93990 --- /dev/null +++ b/src/test/kotlin/agents_engine/testing/EvalDslTest.kt @@ -0,0 +1,197 @@ +package agents_engine.testing + +import agents_engine.core.agent +import agents_engine.generation.Generable +import agents_engine.generation.Guide +import agents_engine.generation.toLlmInput +import agents_engine.model.LlmResponse +import kotlin.test.Test +import kotlin.test.assertEquals +import kotlin.test.assertFalse +import kotlin.test.assertNotNull +import kotlin.test.assertNull +import kotlin.test.assertTrue + +/** + * #2493 — declarative eval cases with typed assertions. Pins: + * + * 1. `eval { input(...); expect { ... } }` builds a case with typed + * predicates over `OUT`. + * 2. Multiple expectations compose — all must pass. + * 3. Snapshot mode pins a known typed output structurally. + * 4. Failures carry diagnostic messages naming the failing label. + * 5. Suite mode bundles cases. + * 6. Composition with DeterministicModelClient — full no-network eval. + */ +class EvalDslTest { + + @Test + fun `passing eval case with typed predicate`() { + val mock = DeterministicModelClient(LlmResponse.Text("hello")) + val a = agent("greet") { + model { ollama("t"); client = mock } + skills { skill("s", "") { tools() } } + } + val case = eval("greet-says-hello") { + input("hi") + expect("contains hello") { it.contains("hello") } + } + val result = case.run(a) + assertTrue(result.passed, result.failureMessage) + assertEquals("hello", result.output) + } + + @Test + fun `multiple expectations all must pass`() { + // Two cases against fresh agents — DeterministicModelClient is single-use per agent. + fun greetAgent(text: String) = agent("greet") { + model { ollama("t"); client = DeterministicModelClient(LlmResponse.Text(text)) } + skills { skill("s", "") { tools() } } + } + val passing = eval("multi-pass") { + input("hi") + expect("nonempty") { it.isNotEmpty() } + expect("starts with hello") { it.startsWith("hello") } + } + assertTrue(passing.run(greetAgent("hello world")).passed) + + val failing = eval("multi-fail") { + input("hi") + expect("nonempty") { it.isNotEmpty() } + expect("starts with goodbye") { it.startsWith("goodbye") } + } + val result = failing.run(greetAgent("hello world")) + assertFalse(result.passed) + assertEquals(2, result.outcomes.size) + assertTrue(result.outcomes[0].passed, "first expectation passed") + assertFalse(result.outcomes[1].passed, "second expectation failed") + assertTrue("starts with goodbye" in result.failureMessage!!) + } + + @Test + fun `agent invocation error captured as hard failure`() { + val mock = DeterministicModelClient() // empty script → exhaustion + val a = agent("explode") { + model { ollama("t"); client = mock } + skills { skill("s", "") { tools() } } + } + val case = eval("explode") { + input("trigger") + expect("never reached") { true } + } + val result = case.run(a) + assertFalse(result.passed) + assertNotNull(result.invocationError, "agent throw captured") + assertTrue("explode" in result.failureMessage!!, "case name in message") + } + + @Test + fun `snapshot expectation passes when toLlmInput output matches`() { + val mock = DeterministicModelClient(LlmResponse.Text("""{"text":"Hello","approved":true}""")) + val a = agent("review") { + model { ollama("t"); client = mock } + skills { skill("s", "") { tools() } } + } + // The expected snapshot is the toLlmInput rendering of the Review + // the model returned. For text-typed outputs the LLM JSON is the + // raw text we shouldn't render through toLlmInput; for typed @Generable + // outputs the parser deserializes the JSON first and toLlmInput + // re-serializes structurally. + val sample = Review(text = "Hello", approved = true) + val expectedSnapshot = toLlmInput(sample) + val case = eval("review-snapshot") { + input("review") + expectSnapshot(snapshot = expectedSnapshot) + } + val result = case.run(a) + assertTrue(result.passed, result.failureMessage) + } + + @Test + fun `snapshot expectation fails with a typed diff on mismatch`() { + val mock = DeterministicModelClient(LlmResponse.Text("""{"text":"Goodbye","approved":false}""")) + val a = agent("review") { + model { ollama("t"); client = mock } + skills { skill("s", "") { tools() } } + } + val wrongSnapshot = toLlmInput(Review(text = "Hello", approved = true)) + val case = eval("review-snapshot-mismatch") { + input("review") + expectSnapshot(snapshot = wrongSnapshot) + } + val result = case.run(a) + assertFalse(result.passed) + val msg = result.failureMessage!! + assertTrue("snapshot mismatch" in msg, "message names the kind of failure: $msg") + assertTrue("expected:" in msg && "actual:" in msg, "diff shape preserved: $msg") + } + + @Test + fun `expectFieldEquals matches a single field without spelling out full snapshot`() { + val mock = DeterministicModelClient(LlmResponse.Text("""{"text":"Hi","approved":true}""")) + val a = agent("review") { + model { ollama("t"); client = mock } + skills { skill("s", "") { tools() } } + } + val case = eval("approved-true") { + input("any") + expectFieldEquals("approved", true) + } + val result = case.run(a) + assertTrue(result.passed, result.failureMessage) + } + + @Test + fun `eval suite runs all cases and reports per-case results`() { + val mockA = DeterministicModelClient(LlmResponse.Text("first")) + val agentA = agent("a") { + model { ollama("t"); client = mockA } + skills { skill("s", "") { tools() } } + } + val suite = evalSuite("greeting-suite") { + + eval("nonempty") { + input("hi") + expect("nonempty") { it.isNotEmpty() } + } + + eval("equals first") { + input("hi") + expect("eq first") { it == "first" } + } + } + // Suite only handles homogeneous case types — both cases above are . + // Run; expect the second to fail because the script only produces one response. + val result = suite.runAll(agentA) + assertEquals("greeting-suite", result.name) + // First case ran; second case exhausted the script. + val outcomes = result.results + assertEquals(2, outcomes.size) + } + + @Test + fun `eval case requires an input call`() { + val ex = kotlin.runCatching { + eval("missing-input") { + expect("any") { true } + } + }.exceptionOrNull() + assertNotNull(ex) + assertTrue("input" in ex.message!!, "error names the missing call: ${ex.message}") + } + + @Test + fun `eval case requires at least one expect block`() { + val ex = kotlin.runCatching { + eval("missing-expect") { + input("anything") + } + }.exceptionOrNull() + assertNotNull(ex) + assertTrue("expect" in ex.message!!, "error names the missing call: ${ex.message}") + } + + @Generable("A repository review summary used by the eval doc example.") + data class Review( + @Guide("Plain-text body of the review.") val text: String, + @Guide("Whether the review approves the change.") val approved: Boolean, + ) +} From ce05bda4ad3eeffb3e4b951a96022b81e8028a01 Mon Sep 17 00:00:00 2001 From: skobeltsyn Date: Sat, 30 May 2026 11:20:19 +0300 Subject: [PATCH 2/4] =?UTF-8?q?docs(#2491):=20eval=20harness=20=E2=80=94?= =?UTF-8?q?=20user-facing=20doc,=20internals=20adjunct,=20README,=20CHANGE?= =?UTF-8?q?LOG?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - docs/eval.md (new) — user-facing eval doc. DeterministicModelClient walked through with the request-history + remaining() + exhaustion-error contract; the three expectation styles (typed predicate / snapshot / expectFieldEquals); suite mode with the type-homogeneity constraint; failure shape; the no-network-end-to-end composition pattern; and the v1-scope deferrals (record-from-live, per-token chunks). - src/main/resources/internals-agent/testing/EvalHarness.md (new) — IDE-side LLM adjunct covering both files in one place (eval-harness is conceptually a single unit). Signatures, composition story, failure modes, scope. - README.md — adds an "Eval harness" bullet under "Implemented today" between the public snapshot/resume and prompt-caching bullets. - CHANGELOG.md `## [Unreleased]` — opens with two entries under "Eval harness (#2491 epic, in progress)" — #2492 DeterministicModelClient and #2493 eval { } DSL — with the v1 scope notes inline. No source changes. Full suite stays at 1772 / 0 failures from the prior commit. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 5 + README.md | 1 + docs/eval.md | 165 ++++++++++++++++++ .../internals-agent/testing/EvalHarness.md | 84 +++++++++ 4 files changed, 255 insertions(+) create mode 100644 docs/eval.md create mode 100644 src/main/resources/internals-agent/testing/EvalHarness.md diff --git a/CHANGELOG.md b/CHANGELOG.md index 4033de5..d5b6961 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,11 @@ All notable changes to Agents.KT are documented here. The format follows [Keep a ## [Unreleased] +### Added — Eval harness (#2491 epic, in progress) + +- **`DeterministicModelClient` (#2492)** — `agents_engine.testing.DeterministicModelClient(scripted: List)` (or vararg ctor) hands back pre-scripted responses one per `chat` call. No network, byte-deterministic. `requests` records every message list the agent built up; `remaining()` reports unconsumed responses. Exhaustion throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)`. Streaming uses the default `ModelClient.chatStream` wrap. Out of scope for v1: record-from-live HTTP capture (mentioned in the ticket — needs an HTTP-fixture story we'll write when there's demand) and per-token chunk replay. +- **`eval { }` DSL (#2493)** — `agents_engine.testing.eval("name") { input(...); expect { ... } }` builds a typed eval case. Three expectation styles: `expect("label") { predicate }` (typed predicate over `OUT`), `expectSnapshot(snapshot = "...")` (pin canonical `toLlmInput(output)` JSON; diff on regression), `expectFieldEquals(field, value)` (single-field substring on rendered JSON). Multiple expects compose — all must pass. `EvalResult.failureMessage` is null on pass, structured on fail with per-expectation diagnostics. `evalSuite("name") { + case; + case }.runAll(agent)` bundles cases; type-homogeneous over the agent type at call time (mixed-shape suite is a compile error). Composes with `DeterministicModelClient` for fully reproducible end-to-end agentic-loop eval against typed `OUT`. See [docs/eval.md](docs/eval.md). + ## [0.6.4] — 2026-05-30 **"Trust patch."** Outside auditor reviewed 0.6.3 at 7.5/10 with the verdict *"useful hardening release, but not a repositioning release."* 0.6.4 is the deliberate response: boring on features, focused on closing every real boundary gap the audit found. The tagline: diff --git a/README.md b/README.md index 7d1364f..c54bdfb 100644 --- a/README.md +++ b/README.md @@ -154,6 +154,7 @@ These APIs work in `main`, are unit-tested, and are exercised by integration tes - **Tool error recovery** — per-tool `onError`, per-skill default, agent default; built-in `escalate` and `throwException` agents. See [docs/error-recovery.md](docs/error-recovery.md). - **Budget controls** — `budget { maxTurns; maxToolCalls; maxDuration; perToolTimeout; maxTokens; maxConsecutiveSameTool }` (`perToolTimeout` covers regular and session-aware tools; token counts cumulative across turns when the provider reports usage; `maxConsecutiveSameTool` catches LLM retry loops on a broken tool) (#637, #963, #969, #1903). `onBudgetExceeded { reason, currentLimit -> BudgetDecision.Extend(newLimit) }` raises a cap and continues instead of throwing — a long-running agent can grant itself more tool calls mid-run rather than failing (#2412). `BudgetDecision.Checkpoint` (#2749) is the third variant — pause at the cap, deliver a `SessionSnapshot` via the registered `onTurnCheckpoint` hook, throw a recoverable `BudgetCheckpointException`, and resume later via `agent.invokeSuspendResuming(input, resumeFrom = snapshot)` once the human approves a raise (no history replay). - **Public snapshot / resume** — `agent.invokeSuspendResuming(input, resumeFrom = null, onTurnCheckpoint = null)` (#2749) is the public seam over the internal `executeAgentic(resumeFrom, onTurnCheckpoint)` primitives from #2416. With defaults it matches `invokeSuspend(input)` byte-for-byte; with `onTurnCheckpoint` set it captures a `SessionSnapshot` at every turn boundary; with `resumeFrom = snapshot` it continues an in-flight invocation without replaying history. On resume the loop honors `max(snapshot.toolCallLimit, agent.budget.maxToolCalls)` so a rebuilt agent with a raised cap actually picks it up. +- **Eval harness** — `DeterministicModelClient(LlmResponse.Text("..."), LlmResponse.ToolCalls(...))` (#2492) scripts model responses for reproducible eval without a live provider; the streaming flow folds into the same Started → ArgsDelta → Finished → End chunk sequence a native streaming provider would emit. Typed assertion DSL `eval("name") { input(...); expect { ... }; expectSnapshot(...) }` (#2493) runs against the parsed `OUT` — not regex on the wire. Snapshot mode pins `toLlmInput(output)` JSON for structural diffs; `evalSuite { + case; + case }` bundles cases. See [docs/eval.md](docs/eval.md). - **Prompt caching across providers** — `agent { caching { enabled = true; cacheSystemPrompt = true; cacheToolDefs = true; cacheConversation = Rolling; ttl = 1.hours; cacheable("doc-id") { ... } } }`. Vendor-neutral DSL drives Anthropic's explicit `cache_control` breakpoints (#2658), OpenAI / DeepSeek automatic prefix caching with a stable `prompt_cache_key` routing hint (#2659 / #2661), Ollama / vLLM / SGLang engine-level KV-cache reuse (no-op hints, #2662), and surfaces cache reads + writes + hit-rate on `TokenUsage` (#2663). A prefix-stability guard (#2657) detects silent cache-busters — timestamps, UUIDs, non-deterministic ordering inside cacheable segments — and warns before you pay for a single non-cached run. Off by default; non-breaking. See [docs/caching.md](docs/caching.md). - **JSONL audit exporter** — `:agents-kt-observability` writes append-only, one-line-per-event audit rows with `requestId`, `sessionId`, `manifestHash`, agent/skill/tool ids, event type, provider, and model; raw arguments/results are omitted by default (#1914). See [docs/observability.md](docs/observability.md). - **ObservabilityBridge adapters** — `.observe(OtelBridge(tracer))` maps runtime events to OTel spans (#1908), `.observe(LangSmithBridge(apiKey, project))` maps the same events to LangSmith run trees (#1909), and `.observe(LangfuseBridge(publicKey, secretKey))` maps them to Langfuse traces, generations, spans, and events (#1910), while keeping core vendor-free. See [docs/observability.md](docs/observability.md). diff --git a/docs/eval.md b/docs/eval.md new file mode 100644 index 0000000..c61a401 --- /dev/null +++ b/docs/eval.md @@ -0,0 +1,165 @@ +[← Back to README](../README.md) + +# Eval harness + +Two pieces ship today, layered: + +- **`DeterministicModelClient`** (#2492) — a `ModelClient` that scripts responses, no network. Pairs with any agent so you can run the full agentic loop deterministically. +- **`eval { }` DSL** (#2493) — declarative cases with typed assertions over the agent's `OUT`. Supports per-field checks, full structural snapshots, and grouped suites. + +Both live in package `agents_engine.testing` and ship in the main module — usable from any consumer's test source set without an extra artifact. + +--- + +## `DeterministicModelClient` + +Hand back a pre-scripted sequence of `LlmResponse`s, one per `chat` call. The agent's loop runs end-to-end against the script, with the same Started → ArgsDelta → Finished → End chunk sequence on the streaming side (the default `ModelClient.chatStream` wraps `chat`). + +```kotlin +import agents_engine.testing.DeterministicModelClient +import agents_engine.model.LlmResponse +import agents_engine.model.ToolCall + +val mock = DeterministicModelClient( + LlmResponse.ToolCalls(listOf(ToolCall("lookup", mapOf("id" to "42")))), + LlmResponse.Text("found 42"), +) +val agent = agent("test") { + model { ollama("t"); client = mock } + tools { tool("lookup", "lookup") { args -> "value-${args["id"]}" } } + skills { skill("s", "") { tools("lookup") } } +} + +agent("what is 42?") // → "found 42" +mock.remaining() // → 0 (both scripted responses consumed) +mock.requests // List> — every `chat` call's input +``` + +### What you get + +- **Byte-determinism.** Two runs against the same script + same agent + same input produce identical output. +- **Request history.** `mock.requests` records every message list the agent built up across turns. Useful for asserting on conversation shape. +- **Clear exhaustion errors.** If the agent calls `chat` more times than there are scripted responses, the client throws `DeterministicScriptExhausted(callIndex, scriptSize, lastMessages)` naming the offending turn. + +### Out of scope (v1) + +- **Record-from-live.** The #2492 ticket mentions "record-once/replay-many." That needs an HTTP-fixture story we'll write when there's demand. For now: hand-script the responses or compose with a recording-decorator pattern in your own test code. +- **Per-token streaming chunks.** `chatStream` uses the default chunk-from-chat wrap — good enough for asserting on the streaming `AgentEvent` shape, not useful for testing provider-specific mid-stream edge cases. + +--- + +## `eval { }` DSL + +Declarative cases with typed predicates over the agent's `OUT`. + +```kotlin +import agents_engine.testing.eval + +val case = eval("repo-review") { + input(SpecText("review this repository")) + expect("nonempty risks") { it.risks.isNotEmpty() } + expect("at least 3 risks") { it.risks.size >= 3 } +} + +val result = case.run(reviewAgent) +assertTrue(result.passed) { result.failureMessage } +``` + +### Three expectation styles + +```kotlin +// 1. Typed predicate — runs against the parsed OUT, not a string. +expect("approved") { it.approved == true } + +// 2. Snapshot — pins the canonical toLlmInput(output) JSON. +expectSnapshot(snapshot = """{"text":"Hello","approved":true}""") + +// 3. Single-field substring on the rendered JSON — quick for one field. +expectFieldEquals("approved", true) +``` + +All three compose: multiple `expect` blocks must all pass for the case to pass. The failure message names every failing label and renders the typed output for diagnosis. + +### Suite mode + +Group cases: + +```kotlin +import agents_engine.testing.evalSuite + +class GreetingEvalTest { + @Test + fun `greeting suite`() { + val suite = evalSuite("greeting") { + + eval("nonempty") { + input("hi") + expect("nonempty") { it.isNotEmpty() } + } + + eval("polite") { + input("hi") + expect("contains hello") { "hello" in it.lowercase() } + } + } + val result = suite.runAll(greetingAgent) + assertTrue(result.passed) { result.failureSummary } + } +} +``` + +Suites are **type-homogeneous over the agent type at call time** — `EvalSuite.runAll(agent: Agent)` binds the case types at the call site. A mixed-shape suite is a compile error. + +### Failure shape + +`EvalResult.failureMessage` is `null` on pass, structured on fail: + +``` +eval case "multi-fail" failed: + - starts with goodbye: [starts with goodbye] failed for output: "hello world" +``` + +When the agent throws during invocation, the result carries `invocationError` and the message names the exception. Use as `assertTrue(result.passed) { result.failureMessage }` in JUnit / kotlin-test. + +--- + +## Composition: deterministic eval end-to-end + +```kotlin +class RepoReviewEvalTest { + @Test + fun `repo review hits the audit criteria`() { + val mock = DeterministicModelClient( + LlmResponse.Text("""{"text":"All good","approved":true,"risks":[]}"""), + ) + val agent = agent("review") { + model { ollama("test"); client = mock } + skills { skill("review", "") { tools() } } + } + val case = eval("approved-no-risks") { + input("review the repo") + expect("approved") { it.approved } + expect("no risks") { it.risks.isEmpty() } + } + val result = case.run(agent) + assertTrue(result.passed, result.failureMessage) + } +} +``` + +The combination of `DeterministicModelClient` + `eval { }` gives you: + +- No network, no live LLM, no nondeterminism. +- Typed assertions against the agent's `OUT` (not regex on the wire). +- Pinning the model's response in source — when the prompt or schema changes, you update the script *and* the snapshot in the same diff. + +For real-model regression coverage there's the existing `live-llm` / `live-cloud-api` tagged tests; those are nondeterministic by design and out of scope for the eval harness. + +--- + +## Related docs + +- [`docs/testing.md`](testing.md) — existing testing conventions (task names, integration test setup, mutation testing). +- [`docs/observability.md`](observability.md) — the bridges that consume `AgentEvent` and `PipelineEvent` — useful when you're asserting on the streaming flow during eval. + +Sources: `agents_engine/testing/DeterministicModelClient.kt`, `agents_engine/testing/EvalDsl.kt`. + +Tests: `DeterministicModelClientTest.kt`, `EvalDslTest.kt`. diff --git a/src/main/resources/internals-agent/testing/EvalHarness.md b/src/main/resources/internals-agent/testing/EvalHarness.md new file mode 100644 index 0000000..49b03d4 --- /dev/null +++ b/src/main/resources/internals-agent/testing/EvalHarness.md @@ -0,0 +1,84 @@ +--- +description: Source-file knowledge for agents_engine/testing/DeterministicModelClient.kt and agents_engine/testing/EvalDsl.kt — eval harness (#2491 / #2492 / #2493). DeterministicModelClient is a ModelClient that scripts LlmResponses in order, fails fast on exhaustion (DeterministicScriptExhausted), records every requests list for assertions, byte-deterministic. eval(name) { input + expect + expectSnapshot + expectFieldEquals } DSL produces a typed EvalCase whose .run(agent) returns EvalResult(output, outcomes, invocationError). evalSuite(name) { + case + case } bundles cases. Composes for no-network eval — DeterministicModelClient + eval together give reproducible end-to-end assertions over Agent. Out of scope v1: record-from-live HTTP capture, per-token streaming chunk replay. Call when reasoning about deterministic test patterns or typed-assertion eval cases. +--- + +# `agents_engine/testing/*` — eval harness + +Two cooperating pieces in package `agents_engine.testing`: + +## `DeterministicModelClient` + +```kotlin +class DeterministicModelClient(scripted: List) : ModelClient { + constructor(vararg responses: LlmResponse) + val requests: List> // every chat() call's input + fun remaining(): Int // unconsumed responses + override fun chat(messages: List): LlmResponse +} + +class DeterministicScriptExhausted(val callIndex: Int, val scriptSize: Int, val lastMessages: List) + : IllegalStateException(...) +``` + +Scripts LlmResponses in order, one per chat() call. Streaming uses the default `ModelClient.chatStream` wrap — single-flow Started → ArgsDelta → Finished → End for tool-call responses, TextDelta + End for text responses. Thread-safety: undefined under concurrent use (production loops are single-flight per session). + +## `eval { }` DSL + +```kotlin +fun eval(name: String, block: EvalCaseBuilder.() -> Unit): EvalCase + +class EvalCaseBuilder { + fun input(value: IN) + fun expect(label: String = "expect", predicate: (OUT) -> Boolean) + fun expectSnapshot(label: String = "snapshot", snapshot: String) + fun expectFieldEquals(fieldPath: String, expected: Any?) +} + +class EvalCase { + fun run(agent: Agent): EvalResult +} + +data class EvalResult(val caseName, val output, val outcomes, val invocationError) { + val passed: Boolean + val failureMessage: String? +} + +fun evalSuite(name: String, block: EvalSuite.() -> Unit): EvalSuite + +class EvalSuite { + operator fun EvalCase.unaryPlus() + fun runAll(agent: Agent): EvalSuiteResult +} +``` + +## Composition + +`DeterministicModelClient` + `eval { }` ⇒ no-network reproducible eval. The model returns scripted responses; the eval case runs typed predicates on the agent's parsed `OUT`. Both run inside JUnit / kotlin-test alongside the normal suite; no new task or runner needed. + +## Three expectation styles + +| API | Use when | +|---|---| +| `expect("label") { predicate }` | Typed access to the parsed `OUT`. Most general; reflection-free. | +| `expectSnapshot(snapshot = "...")` | Pin a full `toLlmInput(output)` JSON — diff on regression. | +| `expectFieldEquals(field, value)` | Quick check on one field's rendered JSON value, no full snapshot. | + +All compose — multiple expects in one case must all pass. Failure messages name each failing label and render the typed output. + +## Failure modes + +- Agent threw mid-invocation: `EvalResult.invocationError` is non-null; `outcomes` is empty. `failureMessage` names the exception class + message + case name. +- Expectation predicate returned false: per-outcome entry with `failureDetail` set. +- Predicate itself threw: per-outcome entry with `failureDetail = "expectation threw: ..."`. + +## Out of scope (v1) + +- **Record-from-live** capture (#2492 ticket mentions it; needs HTTP-fixture infra). +- **Per-token chunk replay** (current streaming uses default ChatChunk wrap). +- **JSONPath in `expectFieldEquals`** (substring match on canonical JSON — good enough for typical fields; complex queries go through `expect { }` with reflection on the typed `OUT`). + +## Related files + +- `agents_engine/core/Agent.kt` — the agent that consumes the mock + receives the eval input. +- `agents_engine/model/ModelClient.kt` — the SAM interface DeterministicModelClient implements. +- `agents_engine/generation/GenerableSupport.kt` — `toLlmInput` used by snapshot + field expectations to render the typed `OUT`. From b0e71501ad70a0bdf3b05a99fecabb84f1bb0f5c Mon Sep 17 00:00:00 2001 From: skobeltsyn Date: Sat, 30 May 2026 11:26:50 +0300 Subject: [PATCH 3/4] =?UTF-8?q?feat(#2494):=20LLM-as-judge=20scorer=20?= =?UTF-8?q?=E2=80=94=20opt-in,=20advisory,=20never=20gates=20pass/fail?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit #2494 — completes the #2491 eval epic. Adds an opt-in judge for criteria that resist deterministic assertion (tone, relevance, completeness). Explicitly advisory by design. ```kotlin val rubric = JudgeRubric( criteria = "Tone: warm, professional, no jargon.", judgeModel = DeterministicModelClient( LlmResponse.Text("""{"score":8,"rationale":"clear and warm"}"""), ), ) val case = eval("repo-review") { input(spec) expect("approved") { it.approved } // ← gates pass/fail judge("tone", rubric) // ← advisory; never gates } val result = case.run(agent) result.passed // ← depends ONLY on expect blocks result.judgeVerdicts["tone"] // ← JudgeOutcome.Scored(JudgeVerdict) result.judgeSummary // ← "[advisory] tone: 8 — clear and warm" ``` Implementation: - `agents_engine/testing/LlmJudge.kt` (new): * `JudgeRubric(criteria, scoreRange = 0..10, judgeModel)` — typed rubric config. The judge model is independent of the production agent's model — for unit tests use `DeterministicModelClient`; for live eval use a pinned cloud model. * `JudgeVerdict(score: Int, rationale: String)` — `@Generable` so the judge model returns structured JSON that the framework parses through the existing `fromLlmOutput` pipeline. No free-text judge prompts → free-text verdicts. * Internal `LlmJudge(rubric).score(input, output)` — renders a system prompt + user message ("Input: X, Output: Y"), invokes the judge model, parses the verdict, validates `score` is in `rubric.scoreRange`, returns the typed verdict. - `agents_engine/testing/EvalDsl.kt` (extended): * `EvalCaseBuilder.judge(label, rubric)` — registers an advisory scorer. Duplicate labels fail fast at builder time. * `EvalCase` carries an immutable `judges: List`. Runs each after the agent succeeds; judges do NOT run when the agent invocation itself fails (no output to score). * `EvalResult.judgeVerdicts: Map` — captured verdicts keyed by label. Sealed `JudgeOutcome { Scored(verdict) | Errored(detail) }` — parse failures or out-of-range scores surface as `Errored` but never gate `passed`. * `EvalResult.passed` and `EvalResult.failureMessage` consider ONLY deterministic `outcomes` and `invocationError`. Judges are structurally excluded from the gating contract. * `EvalResult.judgeSummary: String` — multi-line `[advisory]