feat(core): support structured llm-grader context#1342
Merged
Conversation
Deploying agentv with
|
| Latest commit: |
02bd698
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://7c4d1c28.agentv.pages.dev |
| Branch Preview URL: | https://feat-llm-grader-structured-i.agentv.pages.dev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AgentV can now express Dexter-style semantic grading directly with
llm-graderinstead of routing through a code-grader shim. Eval suites can share source metadata once, structured task data stays in the existinginputmessage model, and reusable llm-grader prompt files can receive metadata and rubric entries as template variables.This keeps the feature AgentV-native and non-breaking: all new YAML fields are optional, wire-format fields stay snake_case, and existing
governanceinheritance remains supported.Design notes
metadata:is inherited into each test'smetadata; per-test metadata merges over suite metadata.inputfield. Object-valuedinputnow expands to a single user message with JSON object content, matching the existing code-grader pattern of receiving canonicalinputmessages.llm-gradercustom prompts now receive structuredmetadataandrubricsvariables in freeform, rubric, and score-range modes.criteria:as an alias for canonicaloutcome:, so native Dexter rows like{ operator, criteria }do not need dataset-specific transformation.input_object/inputObjectfield or template variable is introduced.av-zk0.3 handoff
Use suite-level metadata once in the financial-research-agent eval:
Each test inherits this into
metadata. Per-testmetadatamerges over it: arrays concatenate suite-first with de-duping, nested objects merge recursively, and per-test scalar values override suite scalars. Top-levelgovernance:still overridesmetadata.governancefor the governance-specific path.Use a reusable grader prompt file by referencing it explicitly:
If the full grader config is shared, keep it under suite-level
assertionsor an assertion include. If rubrics vary per test, place thellm-graderassertion on each test while pointing all of them at the samefile://...prompt.Put structured task input in
input:Available llm-grader template variables for the prompt file:
{{input}},{{output}},{{expected_output}},{{criteria}}{{metadata}},{{metadata_json}},{{rubrics}},{{rubrics_json}}_jsonstructured values are formatted JSON;_jsonvalues are compact JSON. Missing values render as an empty string.inputcontains a JSON object,{{input}}renders that object as formatted JSON.Suggested av-zk0.3 validation:
bun run buildbefore CLI checks because the CLI imports@agentv/corefromdist.metadata, object-valuedinputrendered intoquestion, and rubricoutcomepopulated fromcriteria.llm-graderand inspect exported JSONL:scores[].typeshould bellm-grader, and the grader prompt should include the expected source metadata, structured input text, and Dexter rubric entries.Red/green UAT
Red on
origin/main: a Dexter-like suite with top-levelmetadata, object-valuedinput, and rubriccriteriadid not surface the suite metadata and skipped the rubric because it was missing canonicaloutcome.Green on this branch: the same shape parses with inherited source metadata, object-valued input preserved as canonical user message content and rendered into
question, and a rubric outcome populated fromcriteria:{ "metadata": { "source_repo": "https://github.com/virattt/dexter", "source_commit": "8d9419829f443f84b804d033bb2c3b1fbd788629", "source_file": "src/evals/dataset/finance_agent.csv" }, "input": { "role": "user", "content": { "company": "Apple", "ticker": "AAPL" } }, "question": "{\n \"company\": \"Apple\",\n \"ticker\": \"AAPL\"\n}", "rubric": { "id": "rubric-1", "outcome": "Uses the provided ticker.", "operator": "correctness", "weight": 1, "required": true } }Verification
bun test packages/core/test/evaluation/yaml-parser-metadata.test.ts packages/core/test/evaluation/evaluators_variables.test.ts packages/core/test/evaluation/graders/prompt-resolution.test.tsbun run lintbun run typecheckbun run testbun run validate:examplesgit diff --checkPost-Deploy Monitoring & Validation
No additional production monitoring is required; this changes local eval parsing and grader prompt assembly only. Validation window is the first CI run plus the av-zk0.3 Dexter follow-up. Healthy signals: the Dexter eval YAML parses without rubric warnings, prompt material includes
metadata/rubricsand structured task input via{{input}}, and result JSONL usesllm-graderscores. Failure signals: missing-outcome warnings, unresolved-template-variable warnings, or skipped rubric entries; mitigation is to revert this PR or temporarily keep the code-grader shim while av-zk0.3 is adjusted.