feat: improve eval authoring guidance based on autopilot-dev learnings

## Summary

During the autopilot-dev plugin implementation (#824), several eval authoring patterns were discovered through trial and error that should be documented or enforced by agentv tooling. Each learning represents a case where a human had to intervene to fix an eval that an agent wrote incorrectly.

## Learnings

### 1. Workspace setup must copy skills to all provider discovery paths

The `before_all` setup hook must copy skills to `.claude/skills/`, `.agents/skills/`, AND `.pi/skills/`. Missing `.claude/skills/` caused `skill-trigger` assertions to fail for `claude-cli` targets even though the skills existed in `.agents/skills/`.

**Action:** Document this in eval authoring docs. Consider having `agentv validate` warn when a workspace template has a setup hook that copies to `.agents/` but not `.claude/`.

### 2. Ship/claim evals should not depend on GitHub remote or real git operations

Eval workspaces are sandboxed — no GitHub remote, no PRs, no issue tracker. Tests that ask agents to "merge the PR" or "claim issue #42 on GitHub" fail because the agent tries to interact with GitHub and hits a wall.

**Action:** Document that workspace evals should test decision-making discipline (risk classification, scope assessment), not git infrastructure operations. Provide example patterns in eval authoring guide.

### 3. Agents check git diff and reject mismatched claims

When a workspace has a simulated commit that doesn't match what the test prompt describes, agents inspect `git diff` and correctly flag the mismatch. This is good agent behavior but makes simulated scenarios fragile.

**Action:** Document that if test prompts reference specific code changes, the workspace must contain those exact changes. Or frame prompts as hypothetical ("Here is what the PR changes: ...") rather than factual claims the agent can verify.

### 4. Eval YAML glob pattern required for multi-file runs

`agentv eval run --target X evals/plugin-dir/` fails — must use `"evals/plugin-dir/*.eval.yaml"` glob. This is a usability issue that agents hit when trying to run all evals for a plugin.

**Action:** Support directory paths in eval run command, automatically expanding to `<dir>/**/*.eval.yaml`.

### 5. Combined runs produce one run directory per invocation

Running `agentv eval run "evals/plugin/*.eval.yaml"` produces a single run directory with all 17 tests. Running 5 separate `agentv eval run` commands produces 5 separate run directories. For result management, one run per target is cleaner.

**Action:** Document this in eval authoring guide. The auto-push feature (#826) should handle this transparently.

### 6. Claude CLI SessionStart hook errors kill subsequent tests

When running 17 tests in sequence with `claude-cli` target, a `SessionStart` hook error on one test caused all subsequent tests to fail with the same error. The first 4 tests passed, then all remaining 13 errored.

**Action:** Investigate whether claude-cli provider should retry on hook errors, or whether there's a session cleanup issue between tests.

## Related

- #824 — autopilot-dev plugin PR (where these were discovered)
- #826 — auto-push eval results to git repo
- #827 — target segment bug in artifact paths

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve eval authoring guidance based on autopilot-dev learnings #829

Summary

Learnings

1. Workspace setup must copy skills to all provider discovery paths

2. Ship/claim evals should not depend on GitHub remote or real git operations

3. Agents check git diff and reject mismatched claims

4. Eval YAML glob pattern required for multi-file runs

5. Combined runs produce one run directory per invocation

6. Claude CLI SessionStart hook errors kill subsequent tests

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: improve eval authoring guidance based on autopilot-dev learnings #829

Description

Summary

Learnings

1. Workspace setup must copy skills to all provider discovery paths

2. Ship/claim evals should not depend on GitHub remote or real git operations

3. Agents check git diff and reject mismatched claims

4. Eval YAML glob pattern required for multi-file runs

5. Combined runs produce one run directory per invocation

6. Claude CLI SessionStart hook errors kill subsequent tests

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions