Coding Agent Benchmark

A benchmark harness for evaluating and comparing coding agents (LLM-powered CLI tools) across a suite of synthetic software engineering tasks. Measures correctness and speed.

Overview

The harness runs each configured agent against a set of isolated coding tasks, then executes hidden test suites to score the results. Agents work in temporary workspaces and never see the tests, ensuring a fair evaluation.

Default agents: Claude Code (claude-opus-4-6) and Codex CLI (gpt-5.4)

Metrics collected per task:

Test pass rate (correctness)
Wall-clock time

Tasks

11 tasks spanning multiple languages and categories:

#	Task	Language	Category
00	smoke-test	Python	bugfix
01	python-bugfix-csv	Python	bugfix
02	c-bugfix-linkedlist	C	bugfix
03	typescript-feature-table-filter	TypeScript	feature
04	python-feature-pagination	Python	feature
05	typescript-scratch-task-queue	TypeScript	scratch
06	cpp-scratch-lru-cache	C++	scratch
07	python-refactor-monolith	Python	refactor
08	typescript-refactor-callbacks	TypeScript	refactor
09	c-multifile-segfault	C/C++	multifile
10	fullstack-angular-python	Python/Angular	multifile

Each task contains:

tasks/<name>/
  metadata.json   # language, category, timeout
  prompt.md       # instruction given to the agent
  repo/           # source code (broken or incomplete)
  tests/          # hidden test suite

Quick Start

Prerequisites

Python 3.11+
Agent CLIs installed and on PATH (e.g. claude, codex)
Language toolchains for the tasks you want to run (Python, Node.js, GCC/G++, Make)

Install

pip install -r requirements.txt

Run

# Full benchmark (all agents, all tasks)
python3 -m harness.run

# Dry run -- test the repos as-is without invoking agents
python3 -m harness.run --dry-run

# Custom config or tasks directory
python3 -m harness.run --config path/to/config.yaml --tasks-dir path/to/tasks

CLI Options

Flag	Description
`--dry-run`	Skip agent invocation, run tests on unmodified repos
`--config PATH`	Path to config YAML (default: `harness/config.yaml`)
`--tasks-dir PATH`	Path to tasks directory (default: `tasks/`)

Configuration

Agents, test runners, and output directory are defined in harness/config.yaml.

Adding an Agent

agents:
  my-agent:
    command: "my-cli"
    args: ["--prompt", "{prompt}", "--model", "{model}"]
    env:
      MY_TOOL_HOME: "{root}/tmp/tool-home"
    model: "my-model-name"
    timeout_seconds: 300

{prompt}, {model}, {workspace}, and {root} are replaced at runtime.

Codex benchmark home

The default Codex benchmark agent uses a repo-local CODEX_HOME at harness/codex-home/. This keeps benchmark behavior isolated from your personal ~/.codex/config.toml, so enabled MCP servers and Apps in your normal Codex setup do not affect the benchmark.

The tracked benchmark config lives at harness/codex-home/config.toml. Runtime files in that directory are gitignored. That benchmark-local config is also where Codex's benchmark behavior is controlled, including its default reasoning effort and extra steering instructions to stay fast and direct. The harness still passes service_tier="fast" on the Codex command line.

Claude is launched in non-interactive print mode with --permission-mode bypassPermissions, so it can use its normal built-in tools without per-action approval prompts during benchmark runs. It is also pinned to --effort high and gets a benchmark-specific appended system prompt that matches the Codex benchmark goal of being fast, direct, and avoiding extra work unless needed. The harness also limits Claude to --setting-sources local, --strict-mcp-config, --no-chrome, and --disable-slash-commands so personal MCP/config integrations are less likely to leak into benchmark runs.

To reuse your existing file-backed login without reusing your normal Codex config:

cp ~/.codex/auth.json harness/codex-home/auth.json

If you prefer, you can instead log in once directly against the benchmark home:

CODEX_HOME="$PWD/harness/codex-home" codex login

After either setup, benchmark runs will continue using only the benchmark-local CODEX_HOME.

Test Runners

Each language needs a test runner entry:

test_runners:
  python:
    command: "python3 -m pytest {test_dir} -v --tb=short"
    pattern: "tests/"
  typescript:
    command: "cd {workspace} && npm install --quiet && npx jest {test_dir} --verbose"
    pattern: "tests/"

Output

Results are saved to results/<run-id>/:

results/2026-03-07-152520/
  summary.json            # per-agent aggregate metrics
  claude-code/
    00-smoke-test.json    # per-task result
    ...
  codex/
    00-smoke-test.json
    ...

The terminal report shows a comparison table:

==============================================================
  BENCHMARK RESULTS
==============================================================

TASK RESULTS
--------------------------------------------------------------
Task                           claude-code     codex
--------------------------------------------------------------
00-smoke-test                  PASS 2/2 15s    PASS 2/2 12s
01-python-bugfix-csv           PASS 8/8 24s    PASS 8/8 37s
02-c-bugfix-linkedlist         PASS 8/8 18s    PASS 8/8 15s
03-typescript-feature-table-filter PASS 13/13 18s  PASS 13/13 17s
04-python-feature-pagination   FAIL 9/10 20s   PASS 10/10 21s
05-typescript-scratch-task-queue PASS 8/8 37s    PASS 8/8 49s
06-cpp-scratch-lru-cache       PASS 9/9 58s    PASS 9/9 35s
07-python-refactor-monolith    PASS 12/12 22s  PASS 12/12 22s
08-typescript-refactor-callbacks PASS 10/10 25s  PASS 10/10 26s
09-c-multifile-segfault        PASS 8/8 50s    PASS 8/8 46s
10-fullstack-angular-python    ERROR 10/10 20s ERROR 10/10 21s

FAILURE DETAILS
--------------------------------------------------------------
04-python-feature-pagination [claude-code] FAIL: 1/10 tests failed
10-fullstack-angular-python [claude-code] ERROR: typescript: runner exited non-zero without a parsed test summary
10-fullstack-angular-python [codex] ERROR: typescript: runner exited non-zero without a parsed test summary

SUMMARY
---------------------------------------------------------
Metric                    claude-code     codex
---------------------------------------------------------
Tasks fully passed        10/11           11/11
Avg correctness           0.99            1.0
Avg speed (s)             27.84           27.46

Project Structure

harness/
  config.py          # Configuration loading (AgentConfig, TestRunnerConfig)
  config.yaml        # Default agent and test runner definitions
  run.py             # Main benchmark orchestrator and CLI entry point
  runner.py          # Agent execution and workspace isolation
  task_loader.py     # Task discovery and metadata parsing
  test_executor.py   # Test execution and output parsing (pytest, jest, make)
  scoring.py         # Score aggregation and summary computation
  report.py          # Terminal report formatting
tasks/               # Benchmark task definitions
tests/               # Unit tests for the harness
docs/plans/          # Design and implementation documentation
results/             # Benchmark run output (gitignored)

Running Tests

python3 -m pytest tests/ -v

Adding a Task

Create a directory under tasks/ (prefix with a number for ordering)

Add metadata.json:

{
  "language": "python",
  "category": "bugfix",
  "timeout_seconds": 120
}

For multi-language tasks, add "test_languages": ["python", "typescript"].

Add prompt.md with the task description
Add repo/ with the source code
Add tests/ with the test suite

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
docs/plans		docs/plans
harness		harness
tasks		tasks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coding Agent Benchmark

Overview

Tasks

Quick Start

Prerequisites

Install

Run

CLI Options

Configuration

Adding an Agent

Codex benchmark home

Test Runners

Output

Project Structure

Running Tests

Adding a Task

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Coding Agent Benchmark

Overview

Tasks

Quick Start

Prerequisites

Install

Run

CLI Options

Configuration

Adding an Agent

Codex benchmark home

Test Runners

Output

Project Structure

Running Tests

Adding a Task

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages