|
| 1 | +# Evalbuff |
| 2 | + |
| 3 | +Evalbuff is an automated system that iteratively improves a coding agent's performance by optimizing project documentation. It runs overnight, discovers what an agent gets wrong, writes docs to fix those gaps, and keeps only the changes that measurably improve scores. |
| 4 | + |
| 5 | +## The Idea |
| 6 | + |
| 7 | +Most coding agents read project documentation before making changes. Better docs lead to better code. But writing good docs is hard — you don't know what an agent needs to know until you watch it fail. |
| 8 | + |
| 9 | +Evalbuff closes this loop automatically: |
| 10 | + |
| 11 | +1. **Run** a coding agent on real eval tasks (reconstructing git commits) |
| 12 | +2. **Judge** the output with AI judges that apply living quality criteria |
| 13 | +3. **Analyze** failures — feed the judge's weaknesses to a doc-writer agent |
| 14 | +4. **Test** whether a proposed doc edit actually improves the agent's score |
| 15 | +5. **Keep** doc changes that help, revert ones that don't |
| 16 | +6. **Repeat** until the budget runs out or scores plateau |
| 17 | + |
| 18 | +The result: a `docs/` directory and `AGENTS.md` table of contents that encode exactly what the agent needs to know to perform well on your codebase. Any agent that reads project docs benefits — Claude Code, Codex, Codebuff, or anything else with a CLI. |
| 19 | + |
| 20 | +## Why Documentation? |
| 21 | + |
| 22 | +We chose documentation as the improvement lever because: |
| 23 | + |
| 24 | +- **Agent-agnostic.** Every modern coding agent reads project docs. Improving docs improves all agents, not just one. |
| 25 | +- **Interpretable.** Unlike fine-tuning weights or tweaking system prompts, docs are human-readable. You can review what evalbuff learned and decide if it makes sense. |
| 26 | +- **Composable.** Doc improvements stack. A doc about error handling patterns doesn't conflict with a doc about naming conventions. |
| 27 | +- **Persistent.** Docs live in the repo and benefit every future session, not just the current one. |
| 28 | + |
| 29 | +## Living Quality Criteria |
| 30 | + |
| 31 | +Evalbuff uses a leveling system so it doesn't try to optimize everything at once: |
| 32 | + |
| 33 | +| Level | Criteria Added | When | |
| 34 | +|-------|---------------|------| |
| 35 | +| L1 | Correctness, Completeness, Basic Style | Start | |
| 36 | +| L2 | + Pattern Consistency | After L1 avg >= 8.0 over 10 tasks | |
| 37 | +| L3 | + Test Quality | After L2 avg >= 8.0 over 10 tasks | |
| 38 | +| L4 | + Optimal Design | After L3 avg >= 8.0 over 10 tasks | |
| 39 | +| L5 | + Fluency | After L4 avg >= 8.0 over 10 tasks | |
| 40 | + |
| 41 | +This prevents the system from penalizing an agent for style issues when it can't even get the code to compile. Criteria are injected directly into the AI judge prompts. |
| 42 | + |
| 43 | +## Architecture |
| 44 | + |
| 45 | +``` |
| 46 | +┌─────────────────────────────────────────────────────┐ |
| 47 | +│ Orchestrator │ |
| 48 | +│ (run-evalbuff.ts) │ |
| 49 | +│ │ |
| 50 | +│ for each eval task: │ |
| 51 | +│ 1. Clone repo into isolated temp dir │ |
| 52 | +│ 2. Copy current docs/ into the clone │ |
| 53 | +│ 3. Run agent CLI on the task prompt │ |
| 54 | +│ 4. Judge the diff against ground truth │ |
| 55 | +│ 5. If score < threshold: │ |
| 56 | +│ a. Analyze failure → propose doc edit │ |
| 57 | +│ b. Re-run agent with new doc │ |
| 58 | +│ c. Re-judge → keep doc if score improved │ |
| 59 | +│ 6. Update criteria level if scores are high │ |
| 60 | +│ 7. Log entry to JSONL, save state │ |
| 61 | +│ │ |
| 62 | +│ Generate morning report │ |
| 63 | +└─────────────────────────────────────────────────────┘ |
| 64 | +``` |
| 65 | + |
| 66 | +### Components |
| 67 | + |
| 68 | +| File | Role | |
| 69 | +|------|------| |
| 70 | +| `run-evalbuff.ts` | Main orchestrator loop with budget caps and resumable state | |
| 71 | +| `cli-runner.ts` | Agent-agnostic CLI runner — spawns any agent command, captures git diff | |
| 72 | +| `judge.ts` | AI judging system (GPT-5.1 + Gemini) with criteria injection | |
| 73 | +| `docs-optimizer.ts` | Failure analysis, doc writing, doc application, score comparison | |
| 74 | +| `criteria.ts` | Living quality criteria with L1-L5 promotion logic | |
| 75 | +| `morning-report.ts` | Generates markdown summary from overnight JSONL log | |
| 76 | +| `test-repo-utils.ts` | Creates isolated git repos per eval task | |
| 77 | +| `agent-runner.ts` | BuffBench-style agent runner (for Codebuff SDK agents) | |
| 78 | +| `types.ts` | Shared types (EvalCommitV2, EvalDataV2, etc.) | |
| 79 | + |
| 80 | +## Usage |
| 81 | + |
| 82 | +### Command Line |
| 83 | + |
| 84 | +```bash |
| 85 | +bun run evals/evalbuff/run-evalbuff.ts \ |
| 86 | + --repo /path/to/target-repo \ |
| 87 | + --agent "claude -p" \ |
| 88 | + --evals evals/buffbench/eval-codebuff.json,evals/buffbench/eval-manifold.json \ |
| 89 | + --max-iterations 50 \ |
| 90 | + --max-cost 50 \ |
| 91 | + --score-threshold 7.0 \ |
| 92 | + --agent-timeout 300000 |
| 93 | +``` |
| 94 | + |
| 95 | +Or via the workspace script: |
| 96 | + |
| 97 | +```bash |
| 98 | +bun run --filter @codebuff/evals run-evalbuff -- \ |
| 99 | + --repo /path/to/target-repo \ |
| 100 | + --agent "codex exec --full-auto" \ |
| 101 | + --evals evals/buffbench/eval-codebuff.json |
| 102 | +``` |
| 103 | + |
| 104 | +### Arguments |
| 105 | + |
| 106 | +| Argument | Default | Description | |
| 107 | +|----------|---------|-------------| |
| 108 | +| `--repo` | required | Path to the target repo where docs/ will be written | |
| 109 | +| `--agent` | required | Agent CLI command (prompt is appended as last arg) | |
| 110 | +| `--evals` | required | Comma-separated paths to eval JSON files | |
| 111 | +| `--max-iterations` | 50 | Stop after this many tasks | |
| 112 | +| `--max-cost` | 50 | Stop after spending this many USD (estimated) | |
| 113 | +| `--score-threshold` | 7.0 | Only attempt doc edits for scores below this | |
| 114 | +| `--agent-timeout` | 300000 | Per-task agent timeout in ms (5 min default) | |
| 115 | +| `--criteria` | auto | Path to criteria JSON (auto-created if omitted) | |
| 116 | + |
| 117 | +### Overnight Run |
| 118 | + |
| 119 | +For an overnight run, set generous limits and let it go: |
| 120 | + |
| 121 | +```bash |
| 122 | +nohup bun run evals/evalbuff/run-evalbuff.ts \ |
| 123 | + --repo /path/to/repo \ |
| 124 | + --agent "claude -p" \ |
| 125 | + --evals evals/buffbench/eval-codebuff.json \ |
| 126 | + --max-iterations 200 \ |
| 127 | + --max-cost 100 \ |
| 128 | + > evalbuff-overnight.log 2>&1 & |
| 129 | +``` |
| 130 | + |
| 131 | +Check results in the morning: |
| 132 | +- `<repo>/evalbuff-report-YYYY-MM-DD.md` — morning report |
| 133 | +- `<repo>/evalbuff-log.jsonl` — detailed per-task log |
| 134 | +- `<repo>/docs/` — the docs that were kept |
| 135 | +- `<repo>/AGENTS.md` — table of contents |
| 136 | + |
| 137 | +### Resumable |
| 138 | + |
| 139 | +Evalbuff saves state to `evalbuff-state.json` in the target repo. If interrupted, re-running with the same arguments will skip completed tasks and continue where it left off. |
| 140 | + |
| 141 | +## How It Decides What Docs to Write |
| 142 | + |
| 143 | +When an agent scores below the threshold on a task, evalbuff: |
| 144 | + |
| 145 | +1. **Feeds the judge's weaknesses** to a doc-writer LLM agent |
| 146 | +2. The doc writer sees: the task prompt, ground truth diff, agent's diff, judge analysis, and all current docs |
| 147 | +3. It produces a **targeted doc file** — specific to the gap between what the agent did and what it should have done |
| 148 | +4. The doc is written to `docs/<suggested-path>.md` and `AGENTS.md` is updated |
| 149 | + |
| 150 | +The doc writer is instructed to be specific and actionable — referencing concrete file paths, function names, and patterns. Generic advice like "follow best practices" is explicitly rejected. |
| 151 | + |
| 152 | +## What Gets Produced |
| 153 | + |
| 154 | +After a run, the target repo will contain: |
| 155 | + |
| 156 | +``` |
| 157 | +target-repo/ |
| 158 | +├── docs/ |
| 159 | +│ ├── patterns/ |
| 160 | +│ │ └── error-handling.md # Evalbuff-generated |
| 161 | +│ ├── conventions/ |
| 162 | +│ │ └── naming.md # Evalbuff-generated |
| 163 | +│ └── architecture/ |
| 164 | +│ └── data-flow.md # Evalbuff-generated |
| 165 | +├── AGENTS.md # Table of contents |
| 166 | +├── evalbuff-state.json # Resumable state |
| 167 | +├── evalbuff-log.jsonl # Per-task log |
| 168 | +├── evalbuff-criteria.json # Current criteria level |
| 169 | +└── evalbuff-report-2026-03-25.md # Morning report |
| 170 | +``` |
| 171 | + |
| 172 | +### Morning Report |
| 173 | + |
| 174 | +The morning report includes: |
| 175 | +- Summary table (iterations, cost, duration, score deltas) |
| 176 | +- Doc changes table (which docs were tried, score impact, kept/reverted) |
| 177 | +- Error log |
| 178 | +- Score trajectory visualization |
| 179 | + |
| 180 | +## Eval Data Format |
| 181 | + |
| 182 | +Evalbuff reuses BuffBench's `EvalDataV2` format. Eval tasks are real git commits from open source repos, turned into prompts: |
| 183 | + |
| 184 | +```json |
| 185 | +{ |
| 186 | + "repoUrl": "https://github.com/org/repo", |
| 187 | + "evalCommits": [ |
| 188 | + { |
| 189 | + "id": "task-abc123", |
| 190 | + "sha": "abc123", |
| 191 | + "parentSha": "def456", |
| 192 | + "prompt": "Add error handling to the API endpoint...", |
| 193 | + "fileDiffs": [{ "path": "src/api.ts", "diff": "..." }], |
| 194 | + "supplementalFiles": ["src/types.ts"] |
| 195 | + } |
| 196 | + ] |
| 197 | +} |
| 198 | +``` |
| 199 | + |
| 200 | +Generate new evals with BuffBench's eval generation tools, then point evalbuff at the JSON files. |
| 201 | + |
| 202 | +## Relationship to BuffBench |
| 203 | + |
| 204 | +BuffBench benchmarks agents against each other. Evalbuff improves a single agent's performance over time. |
| 205 | + |
| 206 | +| | BuffBench | Evalbuff | |
| 207 | +|---|-----------|----------| |
| 208 | +| **Goal** | Compare agents | Improve an agent | |
| 209 | +| **Output** | Scores + rankings | Documentation | |
| 210 | +| **Loop** | Single pass | Iterative | |
| 211 | +| **Judges** | 3 (GPT, Gemini, Claude) | 2 (GPT, Gemini) | |
| 212 | +| **Agent coupling** | Codebuff SDK | Any CLI agent | |
| 213 | + |
| 214 | +Evalbuff was deep-copied from BuffBench and modified — they share types and eval data format but are independent codebases. |
0 commit comments