You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# CodeScaleBench: A Systematic Evaluation Framework for Assessing the Impact of Enhanced Code Intelligence on AI Coding Agent Performance
2
2
3
3
**White Paper Technical Report**
4
-
**Date:** March 3, 2026
5
4
**Last Modified:** March 5, 2026
6
5
7
6
---
8
7
9
8
## Abstract
10
9
11
-
CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to measure whether external code intelligence tools -- specifically Sourcegraph's Model Context Protocol (MCP) tools -- improve AI coding agent performance. The benchmark evaluates agents under two controlled conditions: a baseline with full local source code and no external tools, and an MCP-augmented configuration where source code is unavailable locally and the agent must use remote code intelligence tools (semantic search, symbol resolution, dependency tracing, etc.) to navigate codebases. In the analysis set used in this report update (generated March 3, 2026 from `runs/analysis`), there are **1,281 valid scored rows**, **1,822 total historical rows**, and **370 paired baseline/MCP tasks** after averaging multiple runs per task/config. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org. Retrieval evaluation on the same analysis set yields **799** event files, **311** computable tasks, and aggregate file-level metrics of **0.4598 file recall** and **0.3644 MRR**. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
10
+
CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates one context-retrieval condition using Model Context Protocol (MCP) tools against a baseline condition with local source access and no external retrieval tools. In the analysis set used here (generated March 3, 2026 from `runs/analysis`), there are **1,281 valid scored rows**, **1,822 total rows**, and **370 paired baseline/MCP tasks** after averaging multiple runs per task/config. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org, and the report provides suite-level reward breakdowns across all 20 suites. Retrieval evaluation on the same analysis set yields **799** event files, **311** computable tasks, and aggregate file-level metrics of **0.4598 file recall** and **0.3644 MRR**. The report also includes pair-normalized cost and time calculations from matched baseline/MCP tasks, including average cost-per-task and wall-clock deltas. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
12
11
13
12
---
14
13
@@ -27,8 +26,7 @@ CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks**
27
26
11.[Results](#11-results)
28
27
12.[Threats to Validity](#12-threats-to-validity)
29
28
13.[Future Work](#13-future-work)
30
-
14.[Development Process: Building a Benchmark with Claude Code](#14-development-process-building-a-benchmark-with-claude-code)
31
-
15.[Appendices](#15-appendices)
29
+
14.[Appendices](#14-appendices)
32
30
33
31
---
34
32
@@ -137,7 +135,7 @@ benchmarks/<suite>/<task-id>/
137
135
|**Baseline**|`baseline-local-direct`| Full local code | None |`Dockerfile`| Evaluation context only |
Both SDLC and Org tasks use the same config pair (`baseline-local-direct` + `mcp-remote-direct`). Some legacy Org runs used `baseline-local-artifact` + `mcp-remote-artifact` configs; these are handled by analysis scripts but are no longer the default.
138
+
Both SDLC and Org tasks use the same config pair (`baseline-local-direct` + `mcp-remote-direct`).
141
139
142
140
---
143
141
@@ -691,7 +689,7 @@ The MCP preamble is a block of instructions prepended to the task prompt for MCP
691
689
692
690
The preamble also performs **repository scoping** — a `{repo_scope}` placeholder is substituted at launch time with the specific `repo:^github.com/ORG/REPO$` filter patterns for the task's target repositories, so the agent searches the correct repos from its first query.
693
691
694
-
The current preamble (V5) went through 5 design iterations to balance forcing MCP adoption against over-constraining agent behavior (see Section 10.1 for the full iteration history).
692
+
The preamble communicates MCP environment constraints and tool usage guidance while preserving agent flexibility.
|**V1/V2**| Early Feb | Minimal MCP mentions | 0 SG tool calls even with tools available |
783
-
|**V3**| Feb 7 |"MANDATORY" triple reinforcement | 90%+ adoption but overly prescriptive; caused "MCP death spiral" on broken mirrors |
784
-
|**V4**| Feb 12 |"Soft guidance" header | 60% zero-MCP adoption; too permissive |
785
-
|**V5**| Feb 20 |"Truncation constraint" lead | Effective: leads with "files not present", forces MCP without mandating specific workflow |
778
+
- `baseline-local-direct`: full localsource access, no external retrieval tools
779
+
- `mcp-remote-direct`: truncated local source, MCP-based remote retrieval
786
780
787
-
**The "MCP Death Spiral" discovery** (V3 era): When the aggressive V3 preamble mandated MCP usage, agents on tasks with broken mirrors or wrong repo names would waste their entire context window on failing SG queries, scoring 0.0 where baseline scored 1.0. This directly motivated the V5 design.
781
+
This isolation ensures that measured deltas reflect the access method rather than differences in task content or verifier logic.
788
782
789
-
**The git history bypass bug** (V5 motivation): Five of 9 test tasks used `git show HEAD:filename` to recover full source from git history, completely defeating sg_only truncation. V5 fix: recommit truncated state so `git show HEAD:` returns empty files.
783
+
### 10.2 Verifier Determinism
790
784
791
-
### 10.2 SG_base Dropping Decision (Feb 12)
785
+
Task scoring is deterministic and in-container:
792
786
793
-
**Data that informed the decision**:
787
+
- SDLC tasks use executable verifier scripts (`test.sh`)
788
+
- Org tasks use oracle-driven scoring (`eval.sh` + oracle checks)
789
+
- All tasks write a scalar reward to `/logs/verifier/reward.txt`
Optional analysis layers (LLM judge, IR metrics, bootstrap CIs) are additive and do not override primary reward.
800
792
801
-
**Rationale**: SG_base (keyword+NLS search only, no Deep Search) showed no meaningful improvement over baseline. The value came from the comprehensive MCP configuration. Maintaining 3 configs tripled compute cost without providing discriminative data.
793
+
### 10.3 Data Inclusion Rules
802
794
803
-
### 10.3 DependEval Benchmark Removal
804
-
805
-
DependEval (9 tasks for dependency resolution) was removed because:
806
-
807
-
- Missing `code_content.txt` files
808
-
- Empty problem statements
809
-
- Wrong ground_truth format
810
-
- "Code is inline, not in repos" -- fundamentally incompatible with MCP comparison since there was no external repository to index
811
-
812
-
### 10.4 Verifier Bug Discoveries and Fixes
813
-
814
-
Major verifier bugs discovered through QA audit (Feb 6):
| CrossRepo wrong path | All 8 runs crashed: `/task/tests/` instead of `/tests/`| Updated paths |
820
-
| PyTorch `make test` no-op | 10/12 tasks had broken verifiers: `test/` dir caused GNU make collision | Renamed target |
821
-
| CodeReview brittle matching |`"For is null"` vs `"For == null"` penalized correct code | Relaxed matching |
822
-
| Baseline instruction contamination | 30/156 instructions had SG refs leaking into baseline | Cleaned |
795
+
Reported metrics in Section 11 are computed from valid scored rows and matched baseline/MCP task pairs. Pair-normalized analysis is performed at the canonical task level by averaging available valid runs per task/config before computing deltas.
823
796
824
797
---
825
798
@@ -829,7 +802,7 @@ Major verifier bugs discovered through QA audit (Feb 6):
829
802
830
803
This section reflects the analysis export generated on **March 3, 2026** from `runs/analysis`:
831
804
- Valid scored rows in export: **1,281**
832
-
- Historical rows in`all_tasks`: **1,822**
805
+
- Total rows in`all_tasks`: **1,822**
833
806
- Paired baseline/MCP tasks with both sides present: **370**
834
807
- Suites represented: **20** (9 SDLC + 11 Org)
835
808
@@ -1155,62 +1128,7 @@ All current results use Claude Code as the sole agent harness. The framework alr
1155
1128
1156
1129
---
1157
1130
1158
-
## 14. Development Process: Building a Benchmark with Claude Code
1159
-
1160
-
CodeScaleBench was primarily developed with Claude Code. This section documents the development process for methodological transparency.
1161
-
1162
-
### 14.1 Development Timeline
1163
-
1164
-
```
1165
-
Jan 30 ─── Feb 3 ─── Feb 6 ─── Feb 10 ─── Feb 15 ─── Feb 20 ─── Feb 25 ─── Mar 2 ── Mar 3
3. **Claude Code generates PRD** → structured user stories with acceptance criteria
1197
-
4. **Implementation via autonomous sessions** → Ralph agent system executes PRDs
1198
-
5. **User reviews and iterates** → identifies gaps, requests changes
1199
-
6. **QA and validation** → automated checks + manual audit
1200
-
1201
-
### 14.4 Lessons Learned
1202
-
1203
-
1. **AI is effective at benchmark infrastructure**: Generating Dockerfiles, writing evaluation scripts, and building metrics pipelines are well-suited to AI-assisted development.
1204
-
1205
-
2. **Domain expertise remains critical**: Methodology and validity threat analysis required human judgment that couldn't be fully automated.
1206
-
1207
-
3. **Iterative QA is essential**: The Feb 6 QA audit found 28 issues (9 critical) in infrastructure that was largely AI-generated. Systematic validation caught bugs that individual reviews missed.
1208
-
1209
-
4. **Preamble engineering is non-trivial**: Five iterations were needed to find the right balance between forcing MCP usage and avoiding prescriptive constraints.
0 commit comments