Skip to content

Commit ebaf37b

Browse files
jahoomaclaude
andcommitted
Add evalbuff: iterative agent improvement via docs optimization
Evalbuff is an automated overnight loop that improves coding agent performance by optimizing project documentation. It runs eval tasks, judges outputs with living quality criteria (L1-L5), analyzes failures, proposes targeted doc edits, and keeps only changes that measurably improve scores. Agent-agnostic — works with any CLI coding agent. Key components: - cli-runner: agent-agnostic CLI runner (shells out to any command) - criteria: living quality criteria with L1-L5 promotion logic - judge: modified from BuffBench with criteria injection - docs-optimizer: failure analysis + doc writing + score comparison - morning-report: markdown summary from overnight JSONL log - run-evalbuff: main orchestrator with budget caps and resumable state Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 83b334c commit ebaf37b

35 files changed

+3813
-0
lines changed

evals/evalbuff/README.md

Lines changed: 214 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# Evalbuff
2+
3+
Evalbuff is an automated system that iteratively improves a coding agent's performance by optimizing project documentation. It runs overnight, discovers what an agent gets wrong, writes docs to fix those gaps, and keeps only the changes that measurably improve scores.
4+
5+
## The Idea
6+
7+
Most coding agents read project documentation before making changes. Better docs lead to better code. But writing good docs is hard — you don't know what an agent needs to know until you watch it fail.
8+
9+
Evalbuff closes this loop automatically:
10+
11+
1. **Run** a coding agent on real eval tasks (reconstructing git commits)
12+
2. **Judge** the output with AI judges that apply living quality criteria
13+
3. **Analyze** failures — feed the judge's weaknesses to a doc-writer agent
14+
4. **Test** whether a proposed doc edit actually improves the agent's score
15+
5. **Keep** doc changes that help, revert ones that don't
16+
6. **Repeat** until the budget runs out or scores plateau
17+
18+
The result: a `docs/` directory and `AGENTS.md` table of contents that encode exactly what the agent needs to know to perform well on your codebase. Any agent that reads project docs benefits — Claude Code, Codex, Codebuff, or anything else with a CLI.
19+
20+
## Why Documentation?
21+
22+
We chose documentation as the improvement lever because:
23+
24+
- **Agent-agnostic.** Every modern coding agent reads project docs. Improving docs improves all agents, not just one.
25+
- **Interpretable.** Unlike fine-tuning weights or tweaking system prompts, docs are human-readable. You can review what evalbuff learned and decide if it makes sense.
26+
- **Composable.** Doc improvements stack. A doc about error handling patterns doesn't conflict with a doc about naming conventions.
27+
- **Persistent.** Docs live in the repo and benefit every future session, not just the current one.
28+
29+
## Living Quality Criteria
30+
31+
Evalbuff uses a leveling system so it doesn't try to optimize everything at once:
32+
33+
| Level | Criteria Added | When |
34+
|-------|---------------|------|
35+
| L1 | Correctness, Completeness, Basic Style | Start |
36+
| L2 | + Pattern Consistency | After L1 avg >= 8.0 over 10 tasks |
37+
| L3 | + Test Quality | After L2 avg >= 8.0 over 10 tasks |
38+
| L4 | + Optimal Design | After L3 avg >= 8.0 over 10 tasks |
39+
| L5 | + Fluency | After L4 avg >= 8.0 over 10 tasks |
40+
41+
This prevents the system from penalizing an agent for style issues when it can't even get the code to compile. Criteria are injected directly into the AI judge prompts.
42+
43+
## Architecture
44+
45+
```
46+
┌─────────────────────────────────────────────────────┐
47+
│ Orchestrator │
48+
│ (run-evalbuff.ts) │
49+
│ │
50+
│ for each eval task: │
51+
│ 1. Clone repo into isolated temp dir │
52+
│ 2. Copy current docs/ into the clone │
53+
│ 3. Run agent CLI on the task prompt │
54+
│ 4. Judge the diff against ground truth │
55+
│ 5. If score < threshold: │
56+
│ a. Analyze failure → propose doc edit │
57+
│ b. Re-run agent with new doc │
58+
│ c. Re-judge → keep doc if score improved │
59+
│ 6. Update criteria level if scores are high │
60+
│ 7. Log entry to JSONL, save state │
61+
│ │
62+
│ Generate morning report │
63+
└─────────────────────────────────────────────────────┘
64+
```
65+
66+
### Components
67+
68+
| File | Role |
69+
|------|------|
70+
| `run-evalbuff.ts` | Main orchestrator loop with budget caps and resumable state |
71+
| `cli-runner.ts` | Agent-agnostic CLI runner — spawns any agent command, captures git diff |
72+
| `judge.ts` | AI judging system (GPT-5.1 + Gemini) with criteria injection |
73+
| `docs-optimizer.ts` | Failure analysis, doc writing, doc application, score comparison |
74+
| `criteria.ts` | Living quality criteria with L1-L5 promotion logic |
75+
| `morning-report.ts` | Generates markdown summary from overnight JSONL log |
76+
| `test-repo-utils.ts` | Creates isolated git repos per eval task |
77+
| `agent-runner.ts` | BuffBench-style agent runner (for Codebuff SDK agents) |
78+
| `types.ts` | Shared types (EvalCommitV2, EvalDataV2, etc.) |
79+
80+
## Usage
81+
82+
### Command Line
83+
84+
```bash
85+
bun run evals/evalbuff/run-evalbuff.ts \
86+
--repo /path/to/target-repo \
87+
--agent "claude -p" \
88+
--evals evals/buffbench/eval-codebuff.json,evals/buffbench/eval-manifold.json \
89+
--max-iterations 50 \
90+
--max-cost 50 \
91+
--score-threshold 7.0 \
92+
--agent-timeout 300000
93+
```
94+
95+
Or via the workspace script:
96+
97+
```bash
98+
bun run --filter @codebuff/evals run-evalbuff -- \
99+
--repo /path/to/target-repo \
100+
--agent "codex exec --full-auto" \
101+
--evals evals/buffbench/eval-codebuff.json
102+
```
103+
104+
### Arguments
105+
106+
| Argument | Default | Description |
107+
|----------|---------|-------------|
108+
| `--repo` | required | Path to the target repo where docs/ will be written |
109+
| `--agent` | required | Agent CLI command (prompt is appended as last arg) |
110+
| `--evals` | required | Comma-separated paths to eval JSON files |
111+
| `--max-iterations` | 50 | Stop after this many tasks |
112+
| `--max-cost` | 50 | Stop after spending this many USD (estimated) |
113+
| `--score-threshold` | 7.0 | Only attempt doc edits for scores below this |
114+
| `--agent-timeout` | 300000 | Per-task agent timeout in ms (5 min default) |
115+
| `--criteria` | auto | Path to criteria JSON (auto-created if omitted) |
116+
117+
### Overnight Run
118+
119+
For an overnight run, set generous limits and let it go:
120+
121+
```bash
122+
nohup bun run evals/evalbuff/run-evalbuff.ts \
123+
--repo /path/to/repo \
124+
--agent "claude -p" \
125+
--evals evals/buffbench/eval-codebuff.json \
126+
--max-iterations 200 \
127+
--max-cost 100 \
128+
> evalbuff-overnight.log 2>&1 &
129+
```
130+
131+
Check results in the morning:
132+
- `<repo>/evalbuff-report-YYYY-MM-DD.md` — morning report
133+
- `<repo>/evalbuff-log.jsonl` — detailed per-task log
134+
- `<repo>/docs/` — the docs that were kept
135+
- `<repo>/AGENTS.md` — table of contents
136+
137+
### Resumable
138+
139+
Evalbuff saves state to `evalbuff-state.json` in the target repo. If interrupted, re-running with the same arguments will skip completed tasks and continue where it left off.
140+
141+
## How It Decides What Docs to Write
142+
143+
When an agent scores below the threshold on a task, evalbuff:
144+
145+
1. **Feeds the judge's weaknesses** to a doc-writer LLM agent
146+
2. The doc writer sees: the task prompt, ground truth diff, agent's diff, judge analysis, and all current docs
147+
3. It produces a **targeted doc file** — specific to the gap between what the agent did and what it should have done
148+
4. The doc is written to `docs/<suggested-path>.md` and `AGENTS.md` is updated
149+
150+
The doc writer is instructed to be specific and actionable — referencing concrete file paths, function names, and patterns. Generic advice like "follow best practices" is explicitly rejected.
151+
152+
## What Gets Produced
153+
154+
After a run, the target repo will contain:
155+
156+
```
157+
target-repo/
158+
├── docs/
159+
│ ├── patterns/
160+
│ │ └── error-handling.md # Evalbuff-generated
161+
│ ├── conventions/
162+
│ │ └── naming.md # Evalbuff-generated
163+
│ └── architecture/
164+
│ └── data-flow.md # Evalbuff-generated
165+
├── AGENTS.md # Table of contents
166+
├── evalbuff-state.json # Resumable state
167+
├── evalbuff-log.jsonl # Per-task log
168+
├── evalbuff-criteria.json # Current criteria level
169+
└── evalbuff-report-2026-03-25.md # Morning report
170+
```
171+
172+
### Morning Report
173+
174+
The morning report includes:
175+
- Summary table (iterations, cost, duration, score deltas)
176+
- Doc changes table (which docs were tried, score impact, kept/reverted)
177+
- Error log
178+
- Score trajectory visualization
179+
180+
## Eval Data Format
181+
182+
Evalbuff reuses BuffBench's `EvalDataV2` format. Eval tasks are real git commits from open source repos, turned into prompts:
183+
184+
```json
185+
{
186+
"repoUrl": "https://github.com/org/repo",
187+
"evalCommits": [
188+
{
189+
"id": "task-abc123",
190+
"sha": "abc123",
191+
"parentSha": "def456",
192+
"prompt": "Add error handling to the API endpoint...",
193+
"fileDiffs": [{ "path": "src/api.ts", "diff": "..." }],
194+
"supplementalFiles": ["src/types.ts"]
195+
}
196+
]
197+
}
198+
```
199+
200+
Generate new evals with BuffBench's eval generation tools, then point evalbuff at the JSON files.
201+
202+
## Relationship to BuffBench
203+
204+
BuffBench benchmarks agents against each other. Evalbuff improves a single agent's performance over time.
205+
206+
| | BuffBench | Evalbuff |
207+
|---|-----------|----------|
208+
| **Goal** | Compare agents | Improve an agent |
209+
| **Output** | Scores + rankings | Documentation |
210+
| **Loop** | Single pass | Iterative |
211+
| **Judges** | 3 (GPT, Gemini, Claude) | 2 (GPT, Gemini) |
212+
| **Agent coupling** | Codebuff SDK | Any CLI agent |
213+
214+
Evalbuff was deep-copied from BuffBench and modified — they share types and eval data format but are independent codebases.

0 commit comments

Comments
 (0)