Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,4 @@ Make an efficient learning agent that can do anything.
- [`docs/environment-variables.md`](docs/environment-variables.md) β€” Env var rules, DI helpers, loading order
- [`docs/agents-and-tools.md`](docs/agents-and-tools.md) β€” Agent system, shell shims, tool definitions
- [`docs/patterns/handle-steps-generators.md`](docs/patterns/handle-steps-generators.md) β€” handleSteps generator patterns and spawn_agents tool calls
- [docs/evalbuff/interpreting-task-prompts.md](docs/evalbuff/interpreting-task-prompts.md)
63 changes: 63 additions & 0 deletions docs/evalbuff/interpreting-task-prompts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Interpreting Task Prompts (Especially Eval-Generated Ones)

When working with task prompts, especially those auto-generated from commit history for evaluation purposes, the prompt text may not accurately describe the actual work needed.

## The Problem

Evalbuff generates task prompts by analyzing commits. Sometimes the prompt will say "create documentation about X" when the actual ground truth is "fix test scripts in package.json and CI workflow files." This happens when:

1. The commit message is misleading (e.g., "Simplify AGENTS.md" when it actually removes test scripts)
2. The prompt generator focuses on visible file additions rather than the semantic meaning of the change
3. The task is stated in terms of what a developer might ASK for, not what they actually need

## Solution: Always Check Ground Truth First

Before implementing ANY task:

1. **Check if there's a ground truth diff available** - look for references to expected changes, test files, or "what should have been done"
2. **Examine file paths and extensions in the ground truth**:
- `.json` files (especially `package.json`) β†’ likely config/dependency changes
- `.yml`/`.yaml` files in `.github/workflows/` β†’ CI/CD configuration changes
- `.md` files β†’ documentation (but could also be removing or editing existing docs)
- `.ts`/`.js` files β†’ code changes
3. **Read the actual diff content, not just the prompt** - the diff shows EXACTLY what changed
4. **Distinguish between creation vs. modification**:
- Does the ground truth show `new file mode` or additions to existing files?
- Is this refactoring, removal, or net-new functionality?

## Example: The AGENTS.md Confusion

Prompt said:
> "Can you create an AGENTS.md file at the root that provides an overview..."

Ground truth showed:
```diff
--- a/.agents/package.json
+++ b/.agents/package.json
- "test:e2e": "bun test e2e"
--- a/.github/workflows/nightly-e2e.yml
+++ b/.github/workflows/nightly-e2e.yml
- run: cd .agents && bun run test:e2e
+ run: cd agents && bun run test:e2e
```

The actual task was about:
- Removing a test script from package.json
- Fixing directory references in a CI workflow
- NOT about creating documentation

The agent should have recognized the ground truth shows `.json` and `.yml` config files, not `.md` documentation files.

## When In Doubt

If the prompt seems to conflict with file paths/types in the ground truth:
1. Trust the ground truth diff over the prompt text
2. Read the actual file contents being changed
3. Understand the PURPOSE of the change (fixing tests, updating config, refactoring) before implementing
4. Ask clarifying questions if the task is genuinely ambiguous

## Red Flags

- Prompt says "create docs" but ground truth shows only config file changes β†’ likely NOT a docs task
- Prompt says "add feature X" but ground truth removes code β†’ likely a cleanup/refactor task
- Prompt uses vague language ("simplify", "improve") β†’ read the diff to understand the specific technical change
249 changes: 93 additions & 156 deletions evalbuff/README.md
Original file line number Diff line number Diff line change
@@ -1,214 +1,151 @@
# Evalbuff

Evalbuff is an automated system that iteratively improves a coding agent's performance by optimizing project documentation. It runs overnight, discovers what an agent gets wrong, writes docs to fix those gaps, and keeps only the changes that measurably improve scores.
Evalbuff improves a coding agent's performance by iteratively optimizing project documentation. It watches an agent fail, writes docs to fix the pattern, and keeps only the changes that measurably help.

## The Idea
## Two Modes

Most coding agents read project documentation before making changes. Better docs lead to better code. But writing good docs is hard β€” you don't know what an agent needs to know until you watch it fail.
### 1. Commit Learning Mode (default)

Evalbuff closes this loop automatically:
Walks through your repo's git history commit-by-commit, using each commit as a learning opportunity:

1. **Run** a coding agent on real eval tasks (reconstructing git commits)
2. **Judge** the output with AI judges that apply living quality criteria
3. **Analyze** failures β€” feed the judge's weaknesses to a doc-writer agent
4. **Test** whether a proposed doc edit actually improves the agent's score
5. **Keep** doc changes that help, revert ones that don't
6. **Repeat** until the budget runs out or scores plateau
1. Start at HEAD~500 (configurable) and process commits one at a time, oldest first
2. For each commit, craft a human-like prompt that vaguely describes the change (via LLM)
3. Run N agents in parallel (default 5) on that prompt against the parent commit
4. Judge all runs β€” using the actual commit diff as ground truth
5. Always analyze failures and propose doc changes (ensuring they're generic enough to help future tasks, not just this one)
6. Re-run N agents with the proposed docs
7. If scores improve, keep the docs and try to propose more improvements
8. If scores don't improve, reject the docs and move to the next commit
9. State is saved after each commit β€” resume at any time

The result: a `docs/` directory and `AGENTS.md` table of contents that encode exactly what the agent needs to know to perform well on your codebase. Any agent that reads project docs benefits β€” Claude Code, Codex, Codebuff, or anything else with a CLI.
The result: a `docs/` directory that encodes patterns the agent needs to know, learned from real historical changes.

## Why Documentation?
### 2. Prompt Mode

We chose documentation as the improvement lever because:
Run a specific coding prompt and improve docs for it β€” no git history needed:

- **Agent-agnostic.** Every modern coding agent reads project docs. Improving docs improves all agents, not just one.
- **Interpretable.** Unlike fine-tuning weights or tweaking system prompts, docs are human-readable. You can review what evalbuff learned and decide if it makes sense.
- **Composable.** Doc improvements stack. A doc about error handling patterns doesn't conflict with a doc about naming conventions.
- **Persistent.** Docs live in the repo and benefit every future session, not just the current one.
1. Given a prompt describing a coding task
2. Run N agents in parallel on the prompt against the current HEAD
3. Judge all runs β€” no ground truth, relies entirely on e2e testing by the judge
4. Analyze and propose doc changes
5. Re-run and keep/reject as with learn mode

## Living Quality Criteria

Evalbuff uses a leveling system so it doesn't try to optimize everything at once:
Useful for targeted doc improvement around known pain points.

| Level | Criteria Added | When |
|-------|---------------|------|
| L1 | Correctness, Completeness, Basic Style | Start |
| L2 | + Pattern Consistency | After L1 avg >= 8.0 over 10 tasks |
| L3 | + Test Quality | After L2 avg >= 8.0 over 10 tasks |
| L4 | + Optimal Design | After L3 avg >= 8.0 over 10 tasks |
| L5 | + Fluency | After L4 avg >= 8.0 over 10 tasks |

This prevents the system from penalizing an agent for style issues when it can't even get the code to compile. Criteria are injected directly into the AI judge prompts.

## Architecture
## How It Works

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Orchestrator β”‚
β”‚ (run-evalbuff.ts) β”‚
β”‚ β”‚
β”‚ for each eval task: β”‚
β”‚ 1. Clone repo into isolated temp dir β”‚
β”‚ 2. Copy current docs/ into the clone β”‚
β”‚ 3. Run agent CLI on the task prompt β”‚
β”‚ 4. Judge the diff against ground truth β”‚
β”‚ 5. If score < threshold: β”‚
β”‚ a. Analyze failure β†’ propose doc edit β”‚
β”‚ b. Re-run agent with new doc β”‚
β”‚ c. Re-judge β†’ keep doc if score improved β”‚
β”‚ 6. Update criteria level if scores are high β”‚
β”‚ 7. Log entry to JSONL, save state β”‚
β”‚ β”‚
β”‚ Generate morning report β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
for each task (commit or prompt):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Run N agents in parallel (baseline) β”‚
β”‚ 2. Judge all N runs β†’ average score β”‚
β”‚ 3. Analyze worst run β†’ propose generic doc β”‚
β”‚ 4. Apply doc to repo β”‚
β”‚ 5. Re-run N agents with new doc β”‚
β”‚ 6. Score improved? Keep doc, try more improvements β”‚
β”‚ Score same/worse? Reject doc, next task β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Components

| File | Role |
|------|------|
| `run-evalbuff.ts` | Main orchestrator loop with budget caps and resumable state |
| `cli-runner.ts` | Agent-agnostic CLI runner β€” spawns any agent command, captures git diff |
| `judge.ts` | AI judging system (GPT-5.1 + Gemini) with criteria injection |
| `docs-optimizer.ts` | Failure analysis, doc writing, doc application, score comparison |
| `criteria.ts` | Living quality criteria with L1-L5 promotion logic |
| `morning-report.ts` | Generates markdown summary from overnight JSONL log |
| `test-repo-utils.ts` | Creates isolated git repos per eval task |
| `agent-runner.ts` | BuffBench-style agent runner (for Codebuff SDK agents) |
| `types.ts` | Shared types (EvalCommitV2, EvalDataV2, etc.) |
Key design decisions:
- **Low-cost agent** (`codebuff --agent base2-free` by default) β€” runs many times cheaply
- **N parallel runs** for statistical significance β€” one run is noisy, five gives a decent signal
- **Always analyze** β€” no score threshold; every task is a learning opportunity
- **Generic docs only** β€” the doc writer is instructed to skip task-specific advice and focus on patterns
- **Iterative improvement** β€” keeps proposing docs until one is rejected, then moves on

## Usage

### Command Line
### Commit Learning Mode

```bash
bun run evalbuff/src/run-evalbuff.ts \
--repo /path/to/target-repo \
--agent "claude -p" \
--evals evals/buffbench/eval-codebuff.json,evals/buffbench/eval-manifold.json \
--max-iterations 50 \
--max-cost 50 \
--score-threshold 7.0 \
--agent-timeout 300000
--agent "codebuff --agent base2-free" \
--commits 500 \
--parallelism 5 \
--max-cost 100
```

Or via the workspace script:
### Prompt Mode

```bash
bun run --filter @codebuff/evalbuff run -- \
bun run evalbuff/src/run-evalbuff.ts \
--repo /path/to/target-repo \
--agent "codex exec --full-auto" \
--evals evals/buffbench/eval-codebuff.json
--agent "codebuff --agent base2-free" \
--prompt "Add a dark mode toggle to the settings page" \
--parallelism 5
```

### Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--repo` | required | Path to the target repo where docs/ will be written |
| `--agent` | required | Agent CLI command (prompt is appended as last arg) |
| `--evals` | required | Comma-separated paths to eval JSON files |
| `--max-iterations` | 50 | Stop after this many tasks |
| `--max-cost` | 50 | Stop after spending this many USD (estimated) |
| `--score-threshold` | 7.0 | Only attempt doc edits for scores below this |
| `--agent-timeout` | 300000 | Per-task agent timeout in ms (5 min default) |
| `--agent` | `codebuff --agent base2-free` | Agent CLI command (prompt appended as last arg) |
| `--prompt` | β€” | If set, runs in prompt mode instead of learn mode |
| `--commits` | 500 | How many commits back to start from (learn mode) |
| `--parallelism` | 5 | Number of agents to run in parallel per task |
| `--max-cost` | 100 | Stop after spending this many USD (estimated) |
| `--agent-timeout` | 300000 | Per-agent timeout in ms (5 min default) |
| `--init-command` | β€” | Command to run in each test repo (e.g., `npm install`) |
| `--criteria` | auto | Path to criteria JSON (auto-created if omitted) |
| `--reviewers` | `claude,codex` | Comma-separated reviewer agent types |

### Overnight Run
### Resuming

For an overnight run, set generous limits and let it go:
State is saved to `evalbuff-state.json` in the target repo after each commit. Re-running with the same `--repo` automatically resumes from where it left off β€” it knows which commit was last processed and continues from there.

### Overnight Run

```bash
nohup bun run evalbuff/src/run-evalbuff.ts \
--repo /path/to/repo \
--agent "claude -p" \
--evals evals/buffbench/eval-codebuff.json \
--max-iterations 200 \
--max-cost 100 \
--commits 500 \
--parallelism 5 \
--max-cost 200 \
> evalbuff-overnight.log 2>&1 &
```

Check results in the morning:
- `<repo>/evalbuff-report-YYYY-MM-DD.md` β€” morning report
- `<repo>/evalbuff-log.jsonl` β€” detailed per-task log
- `<repo>/docs/` β€” the docs that were kept
- `<repo>/AGENTS.md` β€” table of contents

### Resumable

Evalbuff saves state to `evalbuff-state.json` in the target repo. If interrupted, re-running with the same arguments will skip completed tasks and continue where it left off.

## How It Decides What Docs to Write

When an agent scores below the threshold on a task, evalbuff:

1. **Feeds the judge's weaknesses** to a doc-writer LLM agent
2. The doc writer sees: the task prompt, ground truth diff, agent's diff, judge analysis, and all current docs
3. It produces a **targeted doc file** β€” specific to the gap between what the agent did and what it should have done
4. The doc is written to `docs/<suggested-path>.md` and `AGENTS.md` is updated

The doc writer is instructed to be specific and actionable β€” referencing concrete file paths, function names, and patterns. Generic advice like "follow best practices" is explicitly rejected.

## What Gets Produced

After a run, the target repo will contain:

```
target-repo/
β”œβ”€β”€ docs/
β”œβ”€β”€ docs/ # Generated documentation
β”‚ β”œβ”€β”€ patterns/
β”‚ β”‚ └── error-handling.md # Evalbuff-generated
β”‚ β”‚ └── error-handling.md
β”‚ β”œβ”€β”€ conventions/
β”‚ β”‚ └── naming.md # Evalbuff-generated
β”‚ β”‚ └── naming.md
β”‚ └── architecture/
β”‚ └── data-flow.md # Evalbuff-generated
β”œβ”€β”€ AGENTS.md # Table of contents
β”œβ”€β”€ evalbuff-state.json # Resumable state
β”œβ”€β”€ evalbuff-log.jsonl # Per-task log
β”œβ”€β”€ evalbuff-criteria.json # Current criteria level
└── evalbuff-report-2026-03-25.md # Morning report
β”‚ └── data-flow.md
β”œβ”€β”€ AGENTS.md # Table of contents
β”œβ”€β”€ evalbuff-state.json # Resumable state (last commit SHA)
β”œβ”€β”€ evalbuff-log.jsonl # Per-task log
β”œβ”€β”€ evalbuff-criteria.json # Current criteria level
└── evalbuff-report-2026-03-26.md # Report
```

### Morning Report

The morning report includes:
- Summary table (iterations, cost, duration, score deltas)
- Doc changes table (which docs were tried, score impact, kept/reverted)
- Error log
- Score trajectory visualization

## Eval Data Format

Evalbuff reuses BuffBench's `EvalDataV2` format. Eval tasks are real git commits from open source repos, turned into prompts:

```json
{
"repoUrl": "https://github.com/org/repo",
"evalCommits": [
{
"id": "task-abc123",
"sha": "abc123",
"parentSha": "def456",
"prompt": "Add error handling to the API endpoint...",
"fileDiffs": [{ "path": "src/api.ts", "diff": "..." }],
"supplementalFiles": ["src/types.ts"]
}
]
}
```

Generate new evals with BuffBench's eval generation tools, then point evalbuff at the JSON files.
## Living Quality Criteria

## Relationship to BuffBench
Judges use a leveling system to avoid over-optimizing prematurely:

BuffBench benchmarks agents against each other. Evalbuff improves a single agent's performance over time.
| Level | Criteria Added | Promotion |
|-------|---------------|-----------|
| L1 | Builds, tests pass, basic completeness | Start |
| L2 | + Feature works E2E, logs clean | After L1 avg >= 8.0 over 10 tasks |
| L3 | + Edge cases, UI verification | After L2 avg >= 8.0 |
| L4 | + Cross-component integration, performance | After L3 avg >= 8.0 |
| L5 | + Production readiness | After L4 avg >= 8.0 |

| | BuffBench | Evalbuff |
|---|-----------|----------|
| **Goal** | Compare agents | Improve an agent |
| **Output** | Scores + rankings | Documentation |
| **Loop** | Single pass | Iterative |
| **Judges** | 3 (GPT, Gemini, Claude) | 2 (GPT, Gemini) |
| **Agent coupling** | Codebuff SDK | Any CLI agent |
## Architecture

Evalbuff was deep-copied from BuffBench and modified β€” they share types and eval data format but are independent codebases.
| File | Role |
|------|------|
| `run-evalbuff.ts` | Main orchestrator β€” learn mode + prompt mode |
| `commit-task-generator.ts` | Extract tasks from git history, generate prompts from commits |
| `cli-runner.ts` | Agent-agnostic CLI runner β€” spawns any agent, captures diff |
| `judge.ts` | AI judging with/without ground truth, multi-reviewer aggregation |
| `docs-optimizer.ts` | Failure analysis, generic doc writing, doc application/revert |
| `criteria.ts` | Living quality criteria with L1-L5 promotion |
| `morning-report.ts` | Report generation from JSONL log |
| `test-repo-utils.ts` | Isolated git repo lifecycle management |
Loading
Loading