Skip to content

Rework evalbuff: commit learning, parallel agents, trace compression#481

Merged
jahooma merged 3 commits intomainfrom
jahooma/evalbuff-rework
Mar 27, 2026
Merged

Rework evalbuff: commit learning, parallel agents, trace compression#481
jahooma merged 3 commits intomainfrom
jahooma/evalbuff-rework

Conversation

@jahooma
Copy link
Copy Markdown
Contributor

@jahooma jahooma commented Mar 27, 2026

Summary

  • Two-mode architecture: Learn mode walks git history commit-by-commit generating tasks via LLM; prompt mode runs a specific task through the same improvement loop
  • Parallel agent execution: Runs N agents per task (configurable, default 5) and averages scores for statistical robustness
  • Iterative doc improvement: Proposes generic docs from agent failures, re-runs agents to verify improvement, keeps or rejects based on score delta
  • Trace compression: Hybrid approach stores large tool results in files with inline pointers so the doc writer can inspect agent reasoning and find root causes
  • Process group killing: Fixes timeout handling by using detached spawn + process.kill(-pid) to kill entire agent process trees

Test plan

  • Unit tests for trace compressor (8 tests)
  • Integration tests for learn mode and prompt mode (5 tests)
  • Real e2e test against this repo's git history — processed commits, generated docs, iterative loop correctly kept/rejected changes

🤖 Generated with Claude Code

jahooma and others added 2 commits March 26, 2026 17:49
…sion

Two-mode architecture: learn mode walks git history commit-by-commit,
prompt mode runs a specific task. Both use iterative doc improvement
with parallel agent execution, judging, and keep/reject loop.

Key changes:
- Add commit-task-generator: extracts tasks from git history via LLM
- Add trace-compressor: hybrid compression stores large tool results
  in files with inline pointers so doc writer can see agent reasoning
- Rewrite run-evalbuff with runLearnMode/runPromptMode, parallel
  agent execution (N runs per task), and iterative doc improvement
- Fix cli-runner timeout: kill entire process group via detached spawn
- Update judge with judgeTaskResult for prompt mode (no ground truth)
- Update docs-optimizer: always analyze, agent trace support, revert
  logic that preserves previously-accepted doc edits
- Rewrite tests for new architecture

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e prompts

- Read full file contents at parent commit (up to 500K) to give the prompt
  generator rich context about the codebase, matching buffbench's approach
- Include the complete diff (up to 200K chars) instead of truncating at 8K
- Rewrite system prompt to produce human-like prompts: high-level functional
  requirements, natural language, no file paths unless a human would mention them
- Skip commits with diffs >200K instead of >50K

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jahooma jahooma merged commit f0636fc into main Mar 27, 2026
34 checks passed
@jahooma jahooma deleted the jahooma/evalbuff-rework branch March 27, 2026 07:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant