Skip to content

bug: claude-cli provider crashes after ~4 tests in combined eval run (SessionStart hook error) #830

@christso

Description

@christso

Summary

When running multiple eval files in a single agentv eval run invocation against the claude-cli target, the provider crashes after ~4 tests with a SessionStart hook error. All subsequent tests fail with the same error.

Reproduction

# This fails after ~4 tests:
agentv eval run --target claude-cli --workers 1 "evals/hivespec/*.eval.yaml"

# This works — each invocation gets a fresh session:
agentv eval run --target claude-cli --workers 1 evals/hivespec/hs-claim.eval.yaml
agentv eval run --target claude-cli --workers 1 evals/hivespec/hs-explore.eval.yaml
# etc.

Error

{"type":"system","subtype":"hook_started","hook_id":"...","hook_name":"SessionStart:startup","hook_event":"SessionStart","session_id":"..."}

Claude CLI exits with code 1. The error is a raw JSON event from the session hook, not a proper error message.

Observed behavior

  • Tests 1-4: all PASS (1.000)
  • Tests 5-17: all ERROR with the same SessionStart hook failure
  • When the same 17 tests are run as 5 separate invocations (one per eval file), all complete successfully

Expected behavior

All 17 tests should complete when run in a single invocation. The claude-cli provider should clean up sessions between tests and handle hook errors gracefully.

Investigation starting points

  1. Claude CLI provider: `packages/core/src/evaluation/providers/` — find the claude-cli provider implementation
  2. Check how sessions are created and torn down between test cases
  3. Check if there's a session cleanup or process kill between tests
  4. Compare with pi-cli provider which handles 17 sequential tests without issues

Acceptance signals

  • `agentv eval run --target claude-cli --workers 1 "evals/hivespec/*.eval.yaml"` completes all 17 tests without SessionStart errors
  • Each test gets a fresh Claude CLI session (no state leakage between tests)
  • If a SessionStart hook fails, the provider retries with a fresh session before marking the test as an execution error

Non-goals

  • Parallel Claude CLI sessions (workers > 1) — that's a separate concern
  • Changing how Claude CLI hooks work — the fix should be in agentv's provider layer

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions