Skip to content

feat: improve eval authoring guidance based on autopilot-dev learnings #829

@christso

Description

@christso

Summary

During the autopilot-dev plugin implementation (#824), several eval authoring patterns were discovered through trial and error that should be documented or enforced by agentv tooling. Each learning represents a case where a human had to intervene to fix an eval that an agent wrote incorrectly.

Learnings

1. Workspace setup must copy skills to all provider discovery paths

The before_all setup hook must copy skills to .claude/skills/, .agents/skills/, AND .pi/skills/. Missing .claude/skills/ caused skill-trigger assertions to fail for claude-cli targets even though the skills existed in .agents/skills/.

Action: Document this in eval authoring docs. Consider having agentv validate warn when a workspace template has a setup hook that copies to .agents/ but not .claude/.

2. Ship/claim evals should not depend on GitHub remote or real git operations

Eval workspaces are sandboxed — no GitHub remote, no PRs, no issue tracker. Tests that ask agents to "merge the PR" or "claim issue #42 on GitHub" fail because the agent tries to interact with GitHub and hits a wall.

Action: Document that workspace evals should test decision-making discipline (risk classification, scope assessment), not git infrastructure operations. Provide example patterns in eval authoring guide.

3. Agents check git diff and reject mismatched claims

When a workspace has a simulated commit that doesn't match what the test prompt describes, agents inspect git diff and correctly flag the mismatch. This is good agent behavior but makes simulated scenarios fragile.

Action: Document that if test prompts reference specific code changes, the workspace must contain those exact changes. Or frame prompts as hypothetical ("Here is what the PR changes: ...") rather than factual claims the agent can verify.

4. Eval YAML glob pattern required for multi-file runs

agentv eval run --target X evals/plugin-dir/ fails — must use "evals/plugin-dir/*.eval.yaml" glob. This is a usability issue that agents hit when trying to run all evals for a plugin.

Action: Support directory paths in eval run command, automatically expanding to <dir>/**/*.eval.yaml.

5. Combined runs produce one run directory per invocation

Running agentv eval run "evals/plugin/*.eval.yaml" produces a single run directory with all 17 tests. Running 5 separate agentv eval run commands produces 5 separate run directories. For result management, one run per target is cleaner.

Action: Document this in eval authoring guide. The auto-push feature (#826) should handle this transparently.

6. Claude CLI SessionStart hook errors kill subsequent tests

When running 17 tests in sequence with claude-cli target, a SessionStart hook error on one test caused all subsequent tests to fail with the same error. The first 4 tests passed, then all remaining 13 errored.

Action: Investigate whether claude-cli provider should retry on hook errors, or whether there's a session cleanup issue between tests.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions