Skip to content

feat(eval-writer): eval-generator subagent — bootstrap EVAL.yaml from existing artifacts #747

@christso

Description

@christso

Objective

Add an eval-generator subagent to agentv-eval-writer that auto-generates a draft EVAL.yaml from an existing artifact (skill file, prompt template, agent config). This removes the cold-start friction for evaluation — users can bootstrap evals from what they already have instead of hand-authoring from scratch.

Design Latitude

Location: plugins/agentv-dev/skills/agentv-eval-writer/agents/eval-generator.md

Process:

  1. Read the artifact (skill file, prompt template, etc.)
  2. Identify 6-8 realistic test scenarios that exercise the artifact's stated purpose
  3. Define 3-5 named assertions per test (deterministic where possible, llm-judge where not)
  4. Generate an EVAL.yaml with tests: and assertions: following the AgentEvals schema

Output: A draft EVAL.yaml that the human reviews before using.

Assertion quality heuristics:

  • Prefer deterministic assertions (contains, regex, is-json) over llm-grader wherever possible
  • Use llm-grader only when semantic understanding is genuinely required
  • Name assertions descriptively in SCREAMING_SNAKE_CASE (e.g., IDENTIFIES_NULL_BUG, SUGGESTS_FIX)
  • Include both positive ("should do X") and negative ("should not do Y") test cases

Integration with agentv-eval-writer: This is a specialized entry point for "generate from existing artifact," not a replacement for the eval-writer's general authoring capabilities.

Acceptance Signals

  • agents/eval-generator.md exists in agentv-eval-writer
  • Given a SKILL.md, produces a valid EVAL.yaml with 6-8 test scenarios and assertions
  • Generated EVAL.yaml passes agentv validate
  • Works for skill files, prompt templates, and agent configs
  • Output includes a comment or note indicating human review is needed

Non-Goals

  • Not a replacement for hand-authored evals — generated evals are drafts that need human review
  • Not a test case generator for existing EVAL.yaml files (that's a different workflow)
  • Does not auto-run the generated eval — the human reviews first
  • Does not guarantee assertion quality — the human validates that the eval measures what matters

Context

The autoresearch pattern requires an eval definition before the optimization loop can begin. In practice, writing the eval is the highest-friction step — users have the artifact but not the eval. The eval-generator reduces this from "write an eval from scratch" to "review and refine a generated draft."

pi-autoresearch takes a similar approach: the skill instructs the agent to set up the benchmark script and metrics before starting the loop. The eval-generator formalizes this as a reusable subagent.

Related

  • agentv-eval-writer skill (parent skill this extends)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions