Rework evalbuff: commit learning, parallel agents, trace compression by jahooma · Pull Request #481 · CodebuffAI/codebuff

jahooma · 2026-03-27T01:28:50Z

Summary

Two-mode architecture: Learn mode walks git history commit-by-commit generating tasks via LLM; prompt mode runs a specific task through the same improvement loop
Parallel agent execution: Runs N agents per task (configurable, default 5) and averages scores for statistical robustness
Iterative doc improvement: Proposes generic docs from agent failures, re-runs agents to verify improvement, keeps or rejects based on score delta
Trace compression: Hybrid approach stores large tool results in files with inline pointers so the doc writer can inspect agent reasoning and find root causes
Process group killing: Fixes timeout handling by using detached spawn + process.kill(-pid) to kill entire agent process trees

Test plan

Unit tests for trace compressor (8 tests)
Integration tests for learn mode and prompt mode (5 tests)
Real e2e test against this repo's git history — processed commits, generated docs, iterative loop correctly kept/rejected changes

🤖 Generated with Claude Code

…sion Two-mode architecture: learn mode walks git history commit-by-commit, prompt mode runs a specific task. Both use iterative doc improvement with parallel agent execution, judging, and keep/reject loop. Key changes: - Add commit-task-generator: extracts tasks from git history via LLM - Add trace-compressor: hybrid compression stores large tool results in files with inline pointers so doc writer can see agent reasoning - Rewrite run-evalbuff with runLearnMode/runPromptMode, parallel agent execution (N runs per task), and iterative doc improvement - Fix cli-runner timeout: kill entire process group via detached spawn - Update judge with judgeTaskResult for prompt mode (no ground truth) - Update docs-optimizer: always analyze, agent trace support, revert logic that preserves previously-accepted doc edits - Rewrite tests for new architecture Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e prompts - Read full file contents at parent commit (up to 500K) to give the prompt generator rich context about the codebase, matching buffbench's approach - Include the complete diff (up to 200K chars) instead of truncating at 8K - Rewrite system prompt to produce human-like prompts: high-level functional requirements, natural language, no file paths unless a human would mention them - Skip commits with diffs >200K instead of >50K Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jahooma and others added 2 commits March 26, 2026 17:49

evalbuff: add evalbuff/interpreting-task-prompts.md (6259c17)

f8ee6e8

jahooma requested review from brandonkachen and charleslien as code owners March 27, 2026 01:28

jahooma merged commit f0636fc into main Mar 27, 2026
34 checks passed

jahooma deleted the jahooma/evalbuff-rework branch March 27, 2026 07:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework evalbuff: commit learning, parallel agents, trace compression#481

Rework evalbuff: commit learning, parallel agents, trace compression#481
jahooma merged 3 commits intomainfrom
jahooma/evalbuff-rework

jahooma commented Mar 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jahooma commented Mar 27, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant