Skip to content

feat(core): support structured llm-grader context#1342

Merged
christso merged 3 commits into
mainfrom
feat/llm-grader-structured-input
Jun 10, 2026
Merged

feat(core): support structured llm-grader context#1342
christso merged 3 commits into
mainfrom
feat/llm-grader-structured-input

Conversation

@christso

@christso christso commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

AgentV can now express Dexter-style semantic grading directly with llm-grader instead of routing through a code-grader shim. Eval suites can share source metadata once, structured task data stays in the existing input message model, and reusable llm-grader prompt files can receive metadata and rubric entries as template variables.

This keeps the feature AgentV-native and non-breaking: all new YAML fields are optional, wire-format fields stay snake_case, and existing governance inheritance remains supported.

Design notes

  • Suite-level metadata: is inherited into each test's metadata; per-test metadata merges over suite metadata.
  • Structured task input reuses the existing input field. Object-valued input now expands to a single user message with JSON object content, matching the existing code-grader pattern of receiving canonical input messages.
  • llm-grader custom prompts now receive structured metadata and rubrics variables in freeform, rubric, and score-range modes.
  • Rubric objects accept criteria: as an alias for canonical outcome:, so native Dexter rows like { operator, criteria } do not need dataset-specific transformation.
  • No separate input_object / inputObject field or template variable is introduced.

av-zk0.3 handoff

Use suite-level metadata once in the financial-research-agent eval:

metadata:
  source_repo: https://github.com/virattt/dexter
  source_commit: 8d9419829f443f84b804d033bb2c3b1fbd788629
  source_file: src/evals/dataset/finance_agent.csv

Each test inherits this into metadata. Per-test metadata merges over it: arrays concatenate suite-first with de-duping, nested objects merge recursively, and per-test scalar values override suite scalars. Top-level governance: still overrides metadata.governance for the governance-specific path.

Use a reusable grader prompt file by referencing it explicitly:

assertions:
  - type: llm-grader
    prompt: file://prompts/dexter-grader.md

If the full grader config is shared, keep it under suite-level assertions or an assertion include. If rubrics vary per test, place the llm-grader assertion on each test while pointing all of them at the same file://... prompt.

Put structured task input in input:

tests:
  - id: apple-finance
    input:
      company: Apple
      ticker: AAPL
    assertions:
      - type: llm-grader
        prompt: file://prompts/dexter-grader.md
        rubrics:
          - operator: correctness
            criteria: Uses the provided ticker.
          - operator: contradiction
            criteria: Does not contradict the source data.

Available llm-grader template variables for the prompt file:

  • Existing text variables: {{input}}, {{output}}, {{expected_output}}, {{criteria}}
  • Structured variables: {{metadata}}, {{metadata_json}}, {{rubrics}}, {{rubrics_json}}
  • Non-_json structured values are formatted JSON; _json values are compact JSON. Missing values render as an empty string.
  • When input contains a JSON object, {{input}} renders that object as formatted JSON.

Suggested av-zk0.3 validation:

  1. Pull this AgentV change and run bun run build before CLI checks because the CLI imports @agentv/core from dist.
  2. Load the financial-research-agent eval YAML and confirm the parsed test has inherited metadata, object-valued input rendered into question, and rubric outcome populated from criteria.
  3. Run a live eval using llm-grader and inspect exported JSONL: scores[].type should be llm-grader, and the grader prompt should include the expected source metadata, structured input text, and Dexter rubric entries.

Red/green UAT

Red on origin/main: a Dexter-like suite with top-level metadata, object-valued input, and rubric criteria did not surface the suite metadata and skipped the rubric because it was missing canonical outcome.

Green on this branch: the same shape parses with inherited source metadata, object-valued input preserved as canonical user message content and rendered into question, and a rubric outcome populated from criteria:

{
  "metadata": {
    "source_repo": "https://github.com/virattt/dexter",
    "source_commit": "8d9419829f443f84b804d033bb2c3b1fbd788629",
    "source_file": "src/evals/dataset/finance_agent.csv"
  },
  "input": {
    "role": "user",
    "content": { "company": "Apple", "ticker": "AAPL" }
  },
  "question": "{\n  \"company\": \"Apple\",\n  \"ticker\": \"AAPL\"\n}",
  "rubric": {
    "id": "rubric-1",
    "outcome": "Uses the provided ticker.",
    "operator": "correctness",
    "weight": 1,
    "required": true
  }
}

Verification

  • bun test packages/core/test/evaluation/yaml-parser-metadata.test.ts packages/core/test/evaluation/evaluators_variables.test.ts packages/core/test/evaluation/graders/prompt-resolution.test.ts
  • bun run lint
  • bun run typecheck
  • bun run test
  • bun run validate:examples
  • git diff --check
  • Manual parser UAT for suite metadata + object-valued input + rubric criteria alias

Post-Deploy Monitoring & Validation

No additional production monitoring is required; this changes local eval parsing and grader prompt assembly only. Validation window is the first CI run plus the av-zk0.3 Dexter follow-up. Healthy signals: the Dexter eval YAML parses without rubric warnings, prompt material includes metadata/rubrics and structured task input via {{input}}, and result JSONL uses llm-grader scores. Failure signals: missing-outcome warnings, unresolved-template-variable warnings, or skipped rubric entries; mitigation is to revert this PR or temporarily keep the code-grader shim while av-zk0.3 is adjusted.


Compound Engineering
Codex

@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 10, 2026

Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 02bd698
Status: ✅  Deploy successful!
Preview URL: https://7c4d1c28.agentv.pages.dev
Branch Preview URL: https://feat-llm-grader-structured-i.agentv.pages.dev

View logs

@christso christso changed the title feat(core): add structured llm-grader inputs feat(core): support structured llm-grader context Jun 10, 2026
@christso christso merged commit e43a4d4 into main Jun 10, 2026
8 checks passed
@christso christso deleted the feat/llm-grader-structured-input branch June 10, 2026 04:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant