Eval System for Code Review Agents

This document describes how the evaluation system ensures quality and consistency across the code-review agent toolkit.

The system follows recommendations from Anthropic's Demystifying Evals for AI Agents: use deterministic (code-based) graders for everything they can handle, use model-based graders only for what genuinely requires judgment, and calibrate both against human review.

Architecture

┌──────────────────────────────────────────────────┐
│              User Workflows                      │
│  /code-review  /review-agent  /apply-fixes       │
└──────────────────┬───────────────────────────────┘
                   │
        ┌──────────┼──────────┐
        ▼          ▼          ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Layer 1  │ │ Layer 2  │ │ Layer 3  │
│ Hooks    │ │ Agents   │ │ Human    │
│ (determ.)│ │ (model)  │ │ (review) │
└──────────┘ └──────────┘ └──────────┘

Grader Layers

Layer 1: Deterministic (hooks)

Fast, free, deterministic checks that run automatically via PostToolUse hooks:

Hook	What it checks
`js-fp-review.sh`	Array mutations, global state mutations, Object.assign, parameter mutations
`token-efficiency-review.sh`	File length >500 lines, CLAUDE.md >5000 chars, function length >50 lines
`eval-compliance-check.sh`	Agent/skill file structure, output format, severity levels

Hooks are advisory only — they warn but never block. They catch mechanical issues cheaply before the model-based agents spend tokens on full analysis.

Layer 2: Model-based (agents)

Nineteen specialized agents that require LLM judgment. The full roster is documented in docs/agent_info.md. Agents with eval fixture coverage:

Agent	Focus
test-review	Test quality, coverage, assertion quality
structure-review	SRP, DRY, coupling, organization
naming-review	Naming clarity, conventions, magic values
domain-review	Business logic placement, boundary violations
complexity-review	Cyclomatic complexity, nesting, function size
claude-setup-review	CLAUDE.md completeness and accuracy
token-efficiency-review	Token optimization (full analysis beyond hook)
security-review	Injection, auth, data exposure, crypto
js-fp-review	Mutation detection (full analysis beyond hook)
svelte-review	Svelte reactivity, closure state leaks, store subscriptions

Each agent outputs a structured result:

{
  "agentName": "<name>",
  "status": "pass|warn|fail|skip",
  "issues": [
    {
      "severity": "error|warning|suggestion",
      "file": "<path>",
      "line": 0,
      "message": "<description>",
      "suggestedFix": "<fix>"
    }
  ],
  "summary": "<summary>"
}

Layer 3: Human review

The user reviews agent findings and decides which fixes to apply. The /apply-fixes command automates fix application but the user controls which correction prompts are included.

Workflows

`/code-review` — Full review

See Code Review Process for the full nine-step pipeline: target selection, pre-flight gates, static analysis pre-pass, parallel agent dispatch, ACCEPTED-RISKS suppression, health scoring, the auto-fix loop (up to 5 iterations), correction prompts, and the .review-passed gate file.

`/review-agent <name>` — Single agent

Files → Agent Definition → Review → Result

Load agent definition from agents/<name>.md
Determine target files
Run review following agent instructions
Report findings

`/apply-fixes <dir>` — Fix application

Prompts → Repo Rules → Apply Fix → Validate → Report

Load correction prompt JSON files from directory
Load repository rules (CLAUDE.md, .clinerules, etc.)
Apply each fix respecting repo conventions
Run validation (lint/build/tests) after each fix
Report results (applied, failed, validation failed)

How Hooks and Agents Complement Each Other

The hooks (js-fp-review.sh, token-efficiency-review.sh) provide instant feedback on the most common, mechanically detectable issues. The corresponding agents (js-fp-review, token-efficiency-review) provide deeper analysis that requires LLM judgment — for example, understanding whether a mutation is intentional based on surrounding context, or whether a long function is justified by its complexity.

Hook (instant, free)          Agent (thorough, costs tokens)
─────────────────────         ──────────────────────────────
.push() detected              Is the push on a local copy?
file >500 lines               Is the file a generated file?
Object.assign(obj, ...)       Is obj freshly created above?

Eval Compliance

Two mechanisms ensure new agents and skills follow patterns:

`/agent-audit` skill (manual)

Reads every agent, skill, and hook file and checks for:

Structured output format
Severity definitions
Detection rules and scope boundaries
Numbered steps and argument parsing
Advisory-only hook behavior

Outputs a compliance report with PASS/WARN/FAIL per item.

`eval-compliance-check.sh` hook (automatic)

Fires on Write/Edit to agent or skill files. Provides real-time advisory warnings when:

A review agent is missing output format or severity definitions
A skill is missing numbered steps or argument parsing
A review-related skill has no report section

Eval Fixtures

The evals/ directory contains a test corpus for validating agent accuracy:

evals/
├── fixtures/           # 54+ code samples (checked in)
│   ├── fp-*.ts         # js-fp-review (6 files)
│   ├── sec-*.ts        # security-review (5 files)
│   ├── test-*.test.ts  # test-review (6 files)
│   ├── cx-*.ts         # complexity-review (5 files)
│   ├── nm-*.ts         # naming-review (5 files)
│   ├── st-*.ts         # structure-review (5 files)
│   ├── dm-*.ts         # domain-review (5 files)
│   ├── te-*.md/.ts     # token-efficiency-review (5 files)
│   ├── sv-*.svelte.ts  # svelte-review (8 files)
│   └── cs-*/           # claude-setup-review (4 directories)
├── expected/           # Reference solutions (checked in)
│   └── <fixture-stem>.json
├── transcripts/        # Auto-created by runner (gitignored)
└── reports/            # Auto-created by runner (gitignored)

Each fixture is a small (20-80 line), focused code sample with a known-good or known-bad pattern. Reference solutions define expected status, issue count ranges, severity ranges, and keyword checks.

Reference solution schema

{
  "fixture": "fp-array-mutations.ts",
  "description": "Array mutations js-fp-review should catch",
  "applicableAgents": ["js-fp-review"],
  "agents": {
    "js-fp-review": {
      "expectedStatus": "fail",
      "issueCount": { "min": 3, "max": 6 },
      "severities": { "error": { "min": 1, "max": 3 } },
      "mustMention": ["push", "sort"]
    }
  }
}

`/agent-eval` command

Run agents against fixtures and grade results:

/agent-eval                                  # run all agents against all fixtures
/agent-eval --agent js-fp-review             # run one agent
/agent-eval --fixture fp-array-mutations.ts  # run one fixture
/agent-eval --trials 3                       # multi-trial with pass@k scoring

The runner resolves the toolkit root via symlink (for installed projects) and saves transcripts for trend analysis. It detects eval saturation when 3 consecutive runs produce identical grades.

Adding a New Agent

Create agents/<name>.md with:
- JSON output format (status, issues, summary)
- Severity definitions (error, warning, suggestion)
- Detection rules and thresholds (inline, not in a config file)
- File scope (which file types the agent applies to)
- Scope boundaries (what to ignore)
Optionally add a hook in hooks/<name>.sh for deterministic checks
Run /agent-audit to verify compliance
Add eval fixtures in evals/fixtures/ (2-3 pass, 2-3 fail) and reference solutions in evals/expected/
Run /agent-eval --agent <name> to validate accuracy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval System for Code Review Agents

Architecture

Grader Layers

Layer 1: Deterministic (hooks)

Layer 2: Model-based (agents)

Layer 3: Human review

Workflows

`/code-review` — Full review

`/review-agent <name>` — Single agent

`/apply-fixes <dir>` — Fix application

How Hooks and Agents Complement Each Other

Eval Compliance

`/agent-audit` skill (manual)

`eval-compliance-check.sh` hook (automatic)

Eval Fixtures

Reference solution schema

`/agent-eval` command

Adding a New Agent

FilesExpand file tree

eval-system.md

Latest commit

History

eval-system.md

File metadata and controls

Eval System for Code Review Agents

Architecture

Grader Layers

Layer 1: Deterministic (hooks)

Layer 2: Model-based (agents)

Layer 3: Human review

Workflows

/code-review — Full review

/review-agent <name> — Single agent

/apply-fixes <dir> — Fix application

How Hooks and Agents Complement Each Other

Eval Compliance

/agent-audit skill (manual)

eval-compliance-check.sh hook (automatic)

Eval Fixtures

Reference solution schema

/agent-eval command

Adding a New Agent

`/code-review` — Full review

`/review-agent <name>` — Single agent

`/apply-fixes <dir>` — Fix application

`/agent-audit` skill (manual)

`eval-compliance-check.sh` hook (automatic)

`/agent-eval` command