You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: update white paper and blog post to 251-pair results
Recomputed all headline numbers against the full 251-pair corpus
(170 SDLC + 81 MCP-unique, all haiku, all official). Key changes:
- Overall delta: +0.049 (95% CI [+0.010, +0.088]), statistically significant
- SDLC delta: -0.015 (CI spans zero, not significant)
- MCP-unique delta: +0.183 (CI [+0.115, +0.252]), strongly significant
- Added 95% confidence intervals to all suite-level results
- Added per-MCP-unique-sub-suite breakdown (security +0.440, onboarding +0.337)
- Removed stale 212-pair / +0.053 / +0.331 numbers
- Updated abstract with quantitative findings summary
- Added placeholder docs/official_results/README.md to fix broken ref
Previous versions understated MCP-unique baseline coverage (42/81 then 60/81).
All 81 MCP-unique baselines are now complete, bringing the full picture into focus:
MCP provides substantial value for cross-repo discovery but marginal/negative
value when the agent already has full local source code.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/BLOG_POST.md
+15-15Lines changed: 15 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
-
# Does Better Code Context Actually Help Coding Agents? I Built 248 Benchmarks to Find Out.
1
+
# Does Better Code Context Actually Help Coding Agents? I Built 251 Benchmarks to Find Out.
2
2
3
3
In January, I wrote about rethinking coding agent benchmarks — the evaluation gaps I saw, the enterprise-vs-open-source disconnect, and this question I couldn't stop thinking about: does giving agents better code context actually make them better at their jobs? I said I was going to go find out.
4
4
5
5
I went and found out. Kind of. The answer, like most honest answers in this space, is "it depends — but here's exactly what it depends on, and I have data."
6
6
7
-
Since that post, I built CodeContextBench (CCB): 248 software engineering tasks spanning the full software development lifecycle, designed to measure whether external code intelligence tools — specifically Sourcegraph's MCP tools — improve AI coding agent performance. I built the benchmark framework, the evaluation pipeline, the ground truth system, and the statistical analysis layer primarily using Claude Code, across 580+ conversation sessions over about 26 days. An AI coding agent building a benchmark to evaluate AI coding agents' use of code intelligence tools. It's as meta-recursive as it sounds, and I'll come back to that.
7
+
Since that post, I built CodeContextBench (CCB): 251 software engineering tasks spanning the full software development lifecycle, designed to measure whether external code intelligence tools — specifically Sourcegraph's MCP tools — improve AI coding agent performance. I built the benchmark framework, the evaluation pipeline, the ground truth system, and the statistical analysis layer primarily using Claude Code, across 580+ conversation sessions over about 26 days. An AI coding agent building a benchmark to evaluate AI coding agents' use of code intelligence tools. It's as meta-recursive as it sounds, and I'll come back to that.
8
8
9
9
But first: the results.
10
10
@@ -22,25 +22,25 @@ Tasks are organized by SDLC phase — Understand, Design, Build, Fix, Test, Docu
22
22
23
23
## The Headline: Near-Zero Overall, But the Spread Is the Story
24
24
25
-
After running 230 valid task pairs across all SDLC suites plus 10 MCP-unique suites (170 SDLC + 60 MCP-unique), the overall numbers are essentially flat: baseline mean reward 0.631, MCP mean reward 0.633, delta **+0.002**. On average, MCP-augmented agents score about the same as baseline.
25
+
After running 251 valid task pairs across all SDLC suites plus 10 MCP-unique suites (170 SDLC + 81 MCP-unique), MCP shows a small but statistically significant positive effect: baseline mean reward 0.591, MCP mean reward 0.640, delta **+0.049** (95% CI: [+0.010, +0.088]).
26
26
27
-
But that near-zero average obscures the real story, because the delta swings from **-0.183** to **+0.216** depending on the task type. That spread — from MCP hurting to MCP helping materially — is a much more interesting finding than any single number, because it tells you when code intelligence tools matter and when they don't.
27
+
But that modest average obscures the real story, because the delta swings from **-0.183** to **+0.440** depending on the task type. That spread — from MCP hurting to MCP helping dramatically — is a much more interesting finding than any single number, because it tells you when code intelligence tools matter and when they don't.
28
28
29
29
## Where MCP Wins
30
30
31
-
The strongest SDLC gain is the Understand suite. MCP-unique tasks show a modest positive delta overall, with specific sub-suites showing larger gains.
31
+
The strongest SDLC gain is the Understand suite. MCP-unique tasks show a substantial positive delta, with specific sub-suites showing very large gains.
32
32
33
33
| Suite | Tasks | Baseline Mean | MCP Mean | Delta |
**Understand tasks**show the strongest gain at +0.191 (0.660 to 0.851). This was the biggest reversal in the dataset — earlier drafts showed Understand as strongly negative, but that signal was coming from invalid/contaminated runs that were removed and rerun.
39
+
**MCP-unique tasks**require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% CI: [+0.115, +0.252]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
40
40
41
-
**MCP-unique tasks**require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The overall +0.050 delta masks significant variation: security tasks show +0.216, cross-org tasks +0.184, org-scale tasks +0.192, while migration tasks are -0.065 and platform tasks -0.049. The sub-suite variation tells a richer story than the aggregate.
41
+
**Understand tasks**show the strongest SDLC gain at +0.190 (0.661 to 0.851, 95% CI: [+0.024, +0.357]). This was the biggest reversal in the dataset — earlier drafts showed Understand as strongly negative, but that signal was coming from invalid/contaminated runs that were removed and rerun.
42
42
43
-
**Documentation tasks** show a modest positive at +0.048. The agent already does well on documentation with full local code (0.847), and MCP nudges it slightly higher.
43
+
**Documentation tasks** show a modest positive at +0.048 (95% CI: [+0.011, +0.085]). The agent already does well on documentation with full local code (0.847), and MCP nudges it slightly higher.
44
44
45
45
## Where MCP Doesn't Help (or Hurts)
46
46
@@ -51,11 +51,11 @@ MCP hurts on **Debug** (-0.183) and **Build** (-0.121). **Design** (-0.036) and
51
51
| Debug | 20 | 0.670 | 0.487 |**-0.183**|
52
52
| Build | 25 | 0.494 | 0.372 | -0.121 |
53
53
| Design | 20 | 0.753 | 0.718 | -0.036 |
54
-
| Secure | 20 | 0.669| 0.659 | -0.010 |
54
+
| Secure | 20 | 0.670| 0.659 | -0.010 |
55
55
| Test | 20 | 0.480 | 0.480 | +0.000 |
56
56
| Fix | 25 | 0.479 | 0.491 | +0.012 |
57
57
58
-
The **Debug** result remains the clearest negative signal: MCP underperforms baseline by -0.183. Build is also materially negative (-0.121). These are local execution-and-modification workflows, and adding a remote retrieval layer does not reliably help the agent get to the actual code change faster or better.
58
+
The **Debug** result is the clearest negative signal: MCP underperforms baseline by -0.183 (95% CI: [-0.304, -0.062], excludes zero). Build is also materially negative (-0.121). These are local execution-and-modification workflows, and adding a remote retrieval layer does not reliably help the agent get to the actual code change faster or better.
59
59
60
60
Fix tasks have the lowest MCP tool ratio of any suite (35% of tool calls use MCP tools) and the highest local tool call count (39.8 per task). Bug-fixing is editing work. The agent needs to read a stack trace, find the offending code, change it, and run the tests. The relevant context is usually local. Adding a remote search layer to that workflow doesn't help — it just adds latency and another thing to do before getting to the actual fix.
61
61
@@ -115,7 +115,7 @@ The February 6th QA audit found 28 issues (9 critical) in the benchmark infrastr
115
115
116
116
## What I Don't Know Yet
117
117
118
-
I want to be clear about the limitations. These are results from a single agent (Claude Code), a single MCP provider (Sourcegraph), running Claude Haiku 4.5. The sample sizes are meaningful (230 valid pairs) but not large enough for high-confidence conclusions on every individual suite. Multi-trial evaluation with bootstrap confidence intervals is planned but not yet complete.
118
+
I want to be clear about the limitations. These are results from a single agent (Claude Code), a single MCP provider (Sourcegraph), running Claude Haiku 4.5. The sample sizes are meaningful (251 valid pairs) and the overall effect is statistically significant (95% CI excludes zero), but individual sub-suite confidence intervals are wide enough that some suite-level conclusions could shift with more data. Multi-trial evaluation with bootstrap confidence intervals is planned but not yet complete.
119
119
120
120
The moderate correlation between retrieval quality and task outcomes (Spearman r=0.395, p=0.041) confirms that finding the right files helps — but it's not the whole story. What else matters? Is it the structure of the tool output? The way search-first workflows shape the agent's reasoning? Some interaction between retrieval strategy and the agent's existing capabilities? I don't know, and I think the answer matters a lot — both for how we build code intelligence tools and for how we design agent workflows.
121
121
@@ -131,9 +131,9 @@ If you're building or evaluating code intelligence tools and want to run your st
131
131
132
132
I started this project because I was drowning in noise. Every tool claims to "supercharge" agent performance. Every benchmark result is a press release. I wanted to know what was actually true, with data granular enough to understand why.
133
133
134
-
Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. Understand tasks show +0.191, MCP-unique security and cross-org tasks show +0.216 and +0.184 respectively. When the bottleneck is finding and understanding scattered context, MCP tools help.
134
+
Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. MCP-unique security tasks show +0.440, onboarding +0.337, understand +0.190. When the bottleneck is finding and understanding scattered context, MCP tools help — and the 95% confidence intervals on these effects exclude zero.
135
135
136
-
They provide mixed value on other SDLC tasks. MCP helps on understand (+0.191) and document (+0.048), is effectively flat on fix (+0.012) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 230 pairs is +0.002 — essentially zero — because the gains and losses roughly cancel out.
136
+
They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (+0.012) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 251 pairs is +0.049 (95% CI: [+0.010, +0.088]) — a small but statistically significant positive, driven primarily by the MCP-unique tasks where cross-repo discovery is the core challenge.
137
137
138
138
And there's a third category — tasks where the retrieval metrics are basically the same but outcomes still differ — that I can't fully explain yet and might be the most important one to understand.
0 commit comments