-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Objective
Add an eval-generator subagent to agentv-eval-writer that auto-generates a draft EVAL.yaml from an existing artifact (skill file, prompt template, agent config). This removes the cold-start friction for evaluation — users can bootstrap evals from what they already have instead of hand-authoring from scratch.
Design Latitude
Location: plugins/agentv-dev/skills/agentv-eval-writer/agents/eval-generator.md
Process:
- Read the artifact (skill file, prompt template, etc.)
- Identify 6-8 realistic test scenarios that exercise the artifact's stated purpose
- Define 3-5 named assertions per test (deterministic where possible, llm-judge where not)
- Generate an EVAL.yaml with
tests:andassertions:following the AgentEvals schema
Output: A draft EVAL.yaml that the human reviews before using.
Assertion quality heuristics:
- Prefer deterministic assertions (
contains,regex,is-json) overllm-graderwherever possible - Use
llm-graderonly when semantic understanding is genuinely required - Name assertions descriptively in SCREAMING_SNAKE_CASE (e.g.,
IDENTIFIES_NULL_BUG,SUGGESTS_FIX) - Include both positive ("should do X") and negative ("should not do Y") test cases
Integration with agentv-eval-writer: This is a specialized entry point for "generate from existing artifact," not a replacement for the eval-writer's general authoring capabilities.
Acceptance Signals
agents/eval-generator.mdexists in agentv-eval-writer- Given a SKILL.md, produces a valid EVAL.yaml with 6-8 test scenarios and assertions
- Generated EVAL.yaml passes
agentv validate - Works for skill files, prompt templates, and agent configs
- Output includes a comment or note indicating human review is needed
Non-Goals
- Not a replacement for hand-authored evals — generated evals are drafts that need human review
- Not a test case generator for existing EVAL.yaml files (that's a different workflow)
- Does not auto-run the generated eval — the human reviews first
- Does not guarantee assertion quality — the human validates that the eval measures what matters
Context
The autoresearch pattern requires an eval definition before the optimization loop can begin. In practice, writing the eval is the highest-friction step — users have the artifact but not the eval. The eval-generator reduces this from "write an eval from scratch" to "review and refine a generated draft."
pi-autoresearch takes a similar approach: the skill instructs the agent to set up the benchmark script and metrics before starting the loop. The eval-generator formalizes this as a reusable subagent.
Related
- agentv-eval-writer skill (parent skill this extends)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status