-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
During the autopilot-dev plugin implementation (#824), several eval authoring patterns were discovered through trial and error that should be documented or enforced by agentv tooling. Each learning represents a case where a human had to intervene to fix an eval that an agent wrote incorrectly.
Learnings
1. Workspace setup must copy skills to all provider discovery paths
The before_all setup hook must copy skills to .claude/skills/, .agents/skills/, AND .pi/skills/. Missing .claude/skills/ caused skill-trigger assertions to fail for claude-cli targets even though the skills existed in .agents/skills/.
Action: Document this in eval authoring docs. Consider having agentv validate warn when a workspace template has a setup hook that copies to .agents/ but not .claude/.
2. Ship/claim evals should not depend on GitHub remote or real git operations
Eval workspaces are sandboxed — no GitHub remote, no PRs, no issue tracker. Tests that ask agents to "merge the PR" or "claim issue #42 on GitHub" fail because the agent tries to interact with GitHub and hits a wall.
Action: Document that workspace evals should test decision-making discipline (risk classification, scope assessment), not git infrastructure operations. Provide example patterns in eval authoring guide.
3. Agents check git diff and reject mismatched claims
When a workspace has a simulated commit that doesn't match what the test prompt describes, agents inspect git diff and correctly flag the mismatch. This is good agent behavior but makes simulated scenarios fragile.
Action: Document that if test prompts reference specific code changes, the workspace must contain those exact changes. Or frame prompts as hypothetical ("Here is what the PR changes: ...") rather than factual claims the agent can verify.
4. Eval YAML glob pattern required for multi-file runs
agentv eval run --target X evals/plugin-dir/ fails — must use "evals/plugin-dir/*.eval.yaml" glob. This is a usability issue that agents hit when trying to run all evals for a plugin.
Action: Support directory paths in eval run command, automatically expanding to <dir>/**/*.eval.yaml.
5. Combined runs produce one run directory per invocation
Running agentv eval run "evals/plugin/*.eval.yaml" produces a single run directory with all 17 tests. Running 5 separate agentv eval run commands produces 5 separate run directories. For result management, one run per target is cleaner.
Action: Document this in eval authoring guide. The auto-push feature (#826) should handle this transparently.
6. Claude CLI SessionStart hook errors kill subsequent tests
When running 17 tests in sequence with claude-cli target, a SessionStart hook error on one test caused all subsequent tests to fail with the same error. The first 4 tests passed, then all remaining 13 errored.
Action: Investigate whether claude-cli provider should retry on hook errors, or whether there's a session cleanup issue between tests.
Related
- feat(plugin): add hivespec plugin (phase-based delivery lifecycle for agent swarms) #824 — autopilot-dev plugin PR (where these were discovered)
- feat: auto-push eval results to configurable git repo (needs design) #826 — auto-push eval results to git repo
- bug: target segment not removed from artifact paths despite PR #801 #827 — target segment bug in artifact paths