You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: rename MCP-Unique → CodeScaleBench-Org across all documentation
Replace "MCP-Unique" and "MCP-unique" with "CodeScaleBench-Org" or "Org"
in prose across 22 markdown files. Filenames and code identifiers
(selected_mcp_unique_tasks.json, mcp_unique variable names) are unchanged.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
**Combined catalog total: 400 tasks** (180 SDLC across 9 suites + 220 MCP-unique across 11 suites). An additional 28 backup tasks are archived in `benchmarks/backups/`.
103
+
**Combined catalog total: 400 tasks** (180 SDLC across 9 suites + 220 Org across 11 suites). An additional 28 backup tasks are archived in `benchmarks/backups/`.
104
104
105
105
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
106
106
@@ -113,7 +113,7 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
113
113
All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
119
119
@@ -131,7 +131,7 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
131
131
## Repository Structure
132
132
133
133
```
134
-
benchmarks/ # Task definitions organized by SDLC phase + MCP-unique
134
+
benchmarks/ # Task definitions organized by SDLC phase + Org
schemas/ # JSON schemas for MANIFEST.json, task.toml, etc.
215
215
```
216
216
217
-
Each suite directory contains per-task subdirectories with `instruction.md`, `task.toml`, `tests/`, and ground truth (or `solution/`). MCP-unique tasks additionally include `task_spec.json`, `oracle_answer.json`, and Dockerfile variants for baseline/MCP-only execution.
217
+
Each suite directory contains per-task subdirectories with `instruction.md`, `task.toml`, `tests/`, and ground truth (or `solution/`). Org tasks additionally include `task_spec.json`, `oracle_answer.json`, and Dockerfile variants for baseline/MCP-only execution.
Copy file name to clipboardExpand all lines: benchmarks/README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# CodeScaleBench Benchmarks
2
2
3
-
This directory contains SDLC-aligned suites plus MCP-unique org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 294 selected tasks across 20 suites).
3
+
This directory contains SDLC-aligned suites plus Org org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 294 selected tasks across 20 suites).
4
4
5
5
See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
6
6
@@ -23,7 +23,7 @@ See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodol
23
23
24
24
---
25
25
26
-
## MCP-Unique Suite Overview (Selected Catalog)
26
+
## CodeScaleBench-Org Suite Overview (Selected Catalog)
27
27
28
28
These suites measure cross-repo discovery, tracing, and org-scale code intelligence use cases. Counts below reflect the current selected catalog in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (some suite directories may contain additional draft/deferred tasks that are not selected).
29
29
@@ -40,7 +40,7 @@ These suites measure cross-repo discovery, tracing, and org-scale code intellige
Copy file name to clipboardExpand all lines: docs/BLOG_POST.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,25 +18,25 @@ Here's the core experimental design. The same agent (Claude Code with Haiku 4.5)
18
18
19
19
This is the part I think matters most: both configurations have access to the same information. The only difference is the access method. We're not giving the MCP agent extra information — we're testing whether a different pipe to the same information changes outcomes. (If anything, it's a conservative test: in real enterprise settings the agent typically wouldn't have full local access to every relevant repo, so the baseline is actually more favorable than reality.)
20
20
21
-
Tasks are organized by SDLC phase — Understand, Design, Build, Fix, Test, Document, Secure, Debug — plus a set of MCP-unique tasks that specifically require cross-repository discovery across 3-20 repos. The tasks span 40+ open-source repositories and 10 programming languages, from Kubernetes to Django to the Linux kernel. I wrote a [white paper](WHITE_PAPER_REPORT_V2.md) with the full methodology and an explanation of all the evaluation layers, including the information retrieval analysis pipeline.
21
+
Tasks are organized by SDLC phase — Understand, Design, Build, Fix, Test, Document, Secure, Debug — plus a set of Org tasks that specifically require cross-repository discovery across 3-20 repos. The tasks span 40+ open-source repositories and 10 programming languages, from Kubernetes to Django to the Linux kernel. I wrote a [white paper](WHITE_PAPER_REPORT_V2.md) with the full methodology and an explanation of all the evaluation layers, including the information retrieval analysis pipeline.
22
22
23
23
## The Headline: Near-Zero Overall, But the Spread Is the Story
24
24
25
-
After running 250 valid task pairs across all SDLC suites plus 11 MCP-unique suites (169 SDLC + 81 MCP-unique, with 1 baseline infrastructure error excluded from 251 registered tasks), MCP shows a small but statistically significant positive effect: baseline mean reward 0.594, MCP mean reward 0.640, delta **+0.047** (95% bootstrap CI: [+0.007, +0.085]).
25
+
After running 250 valid task pairs across all SDLC suites plus 11 Org suites (169 SDLC + 81 Org, with 1 baseline infrastructure error excluded from 251 registered tasks), MCP shows a small but statistically significant positive effect: baseline mean reward 0.594, MCP mean reward 0.640, delta **+0.047** (95% bootstrap CI: [+0.007, +0.085]).
26
26
27
27
But that modest average obscures the real story, because the delta swings from **-0.183** to **+0.440** depending on the task type. That spread — from MCP hurting to MCP helping dramatically — is a much more interesting finding than any single number, because it tells you when code intelligence tools matter and when they don't.
28
28
29
29
## Where MCP Wins
30
30
31
-
The strongest SDLC gain is the Understand suite. MCP-unique tasks show a substantial positive delta, with specific sub-suites showing very large gains.
31
+
The strongest SDLC gain is the Understand suite. Org tasks show a substantial positive delta, with specific sub-suites showing very large gains.
32
32
33
33
| Suite | Tasks | Baseline Mean | MCP Mean | Delta |
**MCP-unique tasks** require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% bootstrap CI: [+0.116, +0.255]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
39
+
**Org tasks** require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% bootstrap CI: [+0.116, +0.255]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
40
40
41
41
**Understand tasks** show the strongest SDLC gain at +0.190 (0.660 to 0.851, 95% CI: [+0.043, +0.361]). This was the biggest reversal in the dataset — earlier drafts showed Understand as strongly negative, but that signal was coming from invalid/contaminated runs that were removed and rerun.
42
42
@@ -95,7 +95,7 @@ One finding I didn't expect after recomputing the cost section on a strict paire
95
95
96
96
Using one consistent method (`task_metrics.cost_usd`, cache-inclusive, same n=251 pairs), MCP is about 3.8% more expensive on average (+$0.013/task). The cost story is suite-dependent: MCP is cheaper in design/document/understand/mcp_unique, and more expensive in build/debug/fix/secure/test. MCP is still much faster overall: wall-clock drops from 1401.9s to 653.0s on average (-53.4%), and agent execution time drops from 1058.3s to 299.3s (-71.7%).
97
97
98
-
This reframes the value question a bit. On the suites where MCP improves reward (especially MCP-unique and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
98
+
This reframes the value question a bit. On the suites where MCP improves reward (especially Org and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
99
99
100
100
## How I Built This (And What Broke)
101
101
@@ -131,9 +131,9 @@ If you're building or evaluating code intelligence tools and want to run your st
131
131
132
132
I started this project because I was drowning in noise. Every tool claims to "supercharge" agent performance. Every benchmark result is a press release. I wanted to know what was actually true, with data granular enough to understand why.
133
133
134
-
Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. MCP-unique security tasks show +0.440, onboarding +0.337, understand +0.190. When the bottleneck is finding and understanding scattered context, MCP tools help — and the 95% confidence intervals on these effects exclude zero.
134
+
Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. Org security tasks show +0.440, onboarding +0.337, understand +0.190. When the bottleneck is finding and understanding scattered context, MCP tools help — and the 95% confidence intervals on these effects exclude zero.
135
135
136
-
They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (-0.015) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 250 valid pairs is +0.047 (95% bootstrap CI: [+0.007, +0.085]) — a small but statistically significant positive, driven primarily by the MCP-unique tasks where cross-repo discovery is the core challenge.
136
+
They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (-0.015) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 250 valid pairs is +0.047 (95% bootstrap CI: [+0.007, +0.085]) — a small but statistically significant positive, driven primarily by the Org tasks where cross-repo discovery is the core challenge.
137
137
138
138
And there's a third category — tasks where the retrieval metrics are basically the same but outcomes still differ — that I can't fully explain yet and might be the most important one to understand.
0 commit comments