Skip to content

usamadar/coding-agent-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Coding Agent Benchmark

A benchmark harness for evaluating and comparing coding agents (LLM-powered CLI tools) across a suite of synthetic software engineering tasks. Measures correctness and speed.

Overview

The harness runs each configured agent against a set of isolated coding tasks, then executes hidden test suites to score the results. Agents work in temporary workspaces and never see the tests, ensuring a fair evaluation.

Default agents: Claude Code (claude-opus-4-6) and Codex CLI (gpt-5.4)

Metrics collected per task:

  • Test pass rate (correctness)
  • Wall-clock time

Tasks

11 tasks spanning multiple languages and categories:

# Task Language Category
00 smoke-test Python bugfix
01 python-bugfix-csv Python bugfix
02 c-bugfix-linkedlist C bugfix
03 typescript-feature-table-filter TypeScript feature
04 python-feature-pagination Python feature
05 typescript-scratch-task-queue TypeScript scratch
06 cpp-scratch-lru-cache C++ scratch
07 python-refactor-monolith Python refactor
08 typescript-refactor-callbacks TypeScript refactor
09 c-multifile-segfault C/C++ multifile
10 fullstack-angular-python Python/Angular multifile

Each task contains:

tasks/<name>/
  metadata.json   # language, category, timeout
  prompt.md       # instruction given to the agent
  repo/           # source code (broken or incomplete)
  tests/          # hidden test suite

Quick Start

Prerequisites

  • Python 3.11+
  • Agent CLIs installed and on PATH (e.g. claude, codex)
  • Language toolchains for the tasks you want to run (Python, Node.js, GCC/G++, Make)

Install

pip install -r requirements.txt

Run

# Full benchmark (all agents, all tasks)
python3 -m harness.run

# Dry run -- test the repos as-is without invoking agents
python3 -m harness.run --dry-run

# Custom config or tasks directory
python3 -m harness.run --config path/to/config.yaml --tasks-dir path/to/tasks

CLI Options

Flag Description
--dry-run Skip agent invocation, run tests on unmodified repos
--config PATH Path to config YAML (default: harness/config.yaml)
--tasks-dir PATH Path to tasks directory (default: tasks/)

Configuration

Agents, test runners, and output directory are defined in harness/config.yaml.

Adding an Agent

agents:
  my-agent:
    command: "my-cli"
    args: ["--prompt", "{prompt}", "--model", "{model}"]
    env:
      MY_TOOL_HOME: "{root}/tmp/tool-home"
    model: "my-model-name"
    timeout_seconds: 300

{prompt}, {model}, {workspace}, and {root} are replaced at runtime.

Codex benchmark home

The default Codex benchmark agent uses a repo-local CODEX_HOME at harness/codex-home/. This keeps benchmark behavior isolated from your personal ~/.codex/config.toml, so enabled MCP servers and Apps in your normal Codex setup do not affect the benchmark.

The tracked benchmark config lives at harness/codex-home/config.toml. Runtime files in that directory are gitignored. That benchmark-local config is also where Codex's benchmark behavior is controlled, including its default reasoning effort and extra steering instructions to stay fast and direct. The harness still passes service_tier="fast" on the Codex command line.

Claude is launched in non-interactive print mode with --permission-mode bypassPermissions, so it can use its normal built-in tools without per-action approval prompts during benchmark runs. It is also pinned to --effort high and gets a benchmark-specific appended system prompt that matches the Codex benchmark goal of being fast, direct, and avoiding extra work unless needed. The harness also limits Claude to --setting-sources local, --strict-mcp-config, --no-chrome, and --disable-slash-commands so personal MCP/config integrations are less likely to leak into benchmark runs.

To reuse your existing file-backed login without reusing your normal Codex config:

cp ~/.codex/auth.json harness/codex-home/auth.json

If you prefer, you can instead log in once directly against the benchmark home:

CODEX_HOME="$PWD/harness/codex-home" codex login

After either setup, benchmark runs will continue using only the benchmark-local CODEX_HOME.

Test Runners

Each language needs a test runner entry:

test_runners:
  python:
    command: "python3 -m pytest {test_dir} -v --tb=short"
    pattern: "tests/"
  typescript:
    command: "cd {workspace} && npm install --quiet && npx jest {test_dir} --verbose"
    pattern: "tests/"

Output

Results are saved to results/<run-id>/:

results/2026-03-07-152520/
  summary.json            # per-agent aggregate metrics
  claude-code/
    00-smoke-test.json    # per-task result
    ...
  codex/
    00-smoke-test.json
    ...

The terminal report shows a comparison table:

==============================================================
  BENCHMARK RESULTS
==============================================================

TASK RESULTS
--------------------------------------------------------------
Task                           claude-code     codex
--------------------------------------------------------------
00-smoke-test                  PASS 2/2 15s    PASS 2/2 12s
01-python-bugfix-csv           PASS 8/8 24s    PASS 8/8 37s
02-c-bugfix-linkedlist         PASS 8/8 18s    PASS 8/8 15s
03-typescript-feature-table-filter PASS 13/13 18s  PASS 13/13 17s
04-python-feature-pagination   FAIL 9/10 20s   PASS 10/10 21s
05-typescript-scratch-task-queue PASS 8/8 37s    PASS 8/8 49s
06-cpp-scratch-lru-cache       PASS 9/9 58s    PASS 9/9 35s
07-python-refactor-monolith    PASS 12/12 22s  PASS 12/12 22s
08-typescript-refactor-callbacks PASS 10/10 25s  PASS 10/10 26s
09-c-multifile-segfault        PASS 8/8 50s    PASS 8/8 46s
10-fullstack-angular-python    ERROR 10/10 20s ERROR 10/10 21s

FAILURE DETAILS
--------------------------------------------------------------
04-python-feature-pagination [claude-code] FAIL: 1/10 tests failed
10-fullstack-angular-python [claude-code] ERROR: typescript: runner exited non-zero without a parsed test summary
10-fullstack-angular-python [codex] ERROR: typescript: runner exited non-zero without a parsed test summary

SUMMARY
---------------------------------------------------------
Metric                    claude-code     codex
---------------------------------------------------------
Tasks fully passed        10/11           11/11
Avg correctness           0.99            1.0
Avg speed (s)             27.84           27.46

Project Structure

harness/
  config.py          # Configuration loading (AgentConfig, TestRunnerConfig)
  config.yaml        # Default agent and test runner definitions
  run.py             # Main benchmark orchestrator and CLI entry point
  runner.py          # Agent execution and workspace isolation
  task_loader.py     # Task discovery and metadata parsing
  test_executor.py   # Test execution and output parsing (pytest, jest, make)
  scoring.py         # Score aggregation and summary computation
  report.py          # Terminal report formatting
tasks/               # Benchmark task definitions
tests/               # Unit tests for the harness
docs/plans/          # Design and implementation documentation
results/             # Benchmark run output (gitignored)

Running Tests

python3 -m pytest tests/ -v

Adding a Task

  1. Create a directory under tasks/ (prefix with a number for ordering)
  2. Add metadata.json:
    {
      "language": "python",
      "category": "bugfix",
      "timeout_seconds": 120
    }
    For multi-language tasks, add "test_languages": ["python", "typescript"].
  3. Add prompt.md with the task description
  4. Add repo/ with the source code
  5. Add tests/ with the test suite

License

MIT

About

⏺ A benchmark harness for evaluating coding agents (LLM-powered CLI tools) across synthetic software engineering tasks. Compares correctness, speed, and token efficiency across multiple languages and task categories including bugfixes, features, refactors, and scratch implementations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors