feat(eval-writer): eval-generator subagent — bootstrap EVAL.yaml from existing artifacts

## Objective

Add an `eval-generator` subagent to agentv-eval-writer that auto-generates a draft EVAL.yaml from an existing artifact (skill file, prompt template, agent config). This removes the cold-start friction for evaluation — users can bootstrap evals from what they already have instead of hand-authoring from scratch.

## Design Latitude

**Location**: `plugins/agentv-dev/skills/agentv-eval-writer/agents/eval-generator.md`

**Process**:
1. Read the artifact (skill file, prompt template, etc.)
2. Identify 6-8 realistic test scenarios that exercise the artifact's stated purpose
3. Define 3-5 named assertions per test (deterministic where possible, llm-judge where not)
4. Generate an EVAL.yaml with `tests:` and `assertions:` following the AgentEvals schema

**Output**: A draft EVAL.yaml that the human reviews before using.

**Assertion quality heuristics**:
- Prefer deterministic assertions (`contains`, `regex`, `is-json`) over `llm-grader` wherever possible
- Use `llm-grader` only when semantic understanding is genuinely required
- Name assertions descriptively in SCREAMING_SNAKE_CASE (e.g., `IDENTIFIES_NULL_BUG`, `SUGGESTS_FIX`)
- Include both positive ("should do X") and negative ("should not do Y") test cases

**Integration with agentv-eval-writer**: This is a specialized entry point for "generate from existing artifact," not a replacement for the eval-writer's general authoring capabilities.

## Acceptance Signals

- `agents/eval-generator.md` exists in agentv-eval-writer
- Given a SKILL.md, produces a valid EVAL.yaml with 6-8 test scenarios and assertions
- Generated EVAL.yaml passes `agentv validate`
- Works for skill files, prompt templates, and agent configs
- Output includes a comment or note indicating human review is needed

## Non-Goals

- Not a replacement for hand-authored evals — generated evals are drafts that need human review
- Not a test case generator for existing EVAL.yaml files (that's a different workflow)
- Does not auto-run the generated eval — the human reviews first
- Does not guarantee assertion quality — the human validates that the eval measures what matters

## Context

The [autoresearch pattern](https://github.com/karpathy/autoresearch) requires an eval definition before the optimization loop can begin. In practice, writing the eval is the highest-friction step — users have the artifact but not the eval. The eval-generator reduces this from "write an eval from scratch" to "review and refine a generated draft."

[pi-autoresearch](https://github.com/davebcn87/pi-autoresearch) takes a similar approach: the skill instructs the agent to set up the benchmark script and metrics before starting the loop. The eval-generator formalizes this as a reusable subagent.

## Related

- agentv-eval-writer skill (parent skill this extends)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval-writer): eval-generator subagent — bootstrap EVAL.yaml from existing artifacts #747

Objective

Design Latitude

Acceptance Signals

Non-Goals

Context

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(eval-writer): eval-generator subagent — bootstrap EVAL.yaml from existing artifacts #747

Description

Objective

Design Latitude

Acceptance Signals

Non-Goals

Context

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions