-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Summary
When running multiple eval files in a single agentv eval run invocation against the claude-cli target, the provider crashes after ~4 tests with a SessionStart hook error. All subsequent tests fail with the same error.
Reproduction
# This fails after ~4 tests:
agentv eval run --target claude-cli --workers 1 "evals/hivespec/*.eval.yaml"
# This works — each invocation gets a fresh session:
agentv eval run --target claude-cli --workers 1 evals/hivespec/hs-claim.eval.yaml
agentv eval run --target claude-cli --workers 1 evals/hivespec/hs-explore.eval.yaml
# etc.Error
{"type":"system","subtype":"hook_started","hook_id":"...","hook_name":"SessionStart:startup","hook_event":"SessionStart","session_id":"..."}Claude CLI exits with code 1. The error is a raw JSON event from the session hook, not a proper error message.
Observed behavior
- Tests 1-4: all PASS (1.000)
- Tests 5-17: all ERROR with the same SessionStart hook failure
- When the same 17 tests are run as 5 separate invocations (one per eval file), all complete successfully
Expected behavior
All 17 tests should complete when run in a single invocation. The claude-cli provider should clean up sessions between tests and handle hook errors gracefully.
Investigation starting points
- Claude CLI provider: `packages/core/src/evaluation/providers/` — find the claude-cli provider implementation
- Check how sessions are created and torn down between test cases
- Check if there's a session cleanup or process kill between tests
- Compare with pi-cli provider which handles 17 sequential tests without issues
Acceptance signals
- `agentv eval run --target claude-cli --workers 1 "evals/hivespec/*.eval.yaml"` completes all 17 tests without SessionStart errors
- Each test gets a fresh Claude CLI session (no state leakage between tests)
- If a SessionStart hook fails, the provider retries with a fresh session before marking the test as an execution error
Non-goals
- Parallel Claude CLI sessions (workers > 1) — that's a separate concern
- Changing how Claude CLI hooks work — the fix should be in agentv's provider layer
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels