Skip to content

Commit 9058565

Browse files
committed
Refine technical report to current-state white paper framing
1 parent 9b9ef1c commit 9058565

File tree

1 file changed

+19
-101
lines changed

1 file changed

+19
-101
lines changed

docs/technical_reports/TECHNICAL_REPORT.md

Lines changed: 19 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
# CodeScaleBench: A Systematic Evaluation Framework for Assessing the Impact of Enhanced Code Intelligence on AI Coding Agent Performance
22

33
**White Paper Technical Report**
4-
**Date:** March 3, 2026
54
**Last Modified:** March 5, 2026
65

76
---
87

98
## Abstract
109

11-
CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to measure whether external code intelligence tools -- specifically Sourcegraph's Model Context Protocol (MCP) tools -- improve AI coding agent performance. The benchmark evaluates agents under two controlled conditions: a baseline with full local source code and no external tools, and an MCP-augmented configuration where source code is unavailable locally and the agent must use remote code intelligence tools (semantic search, symbol resolution, dependency tracing, etc.) to navigate codebases. In the analysis set used in this report update (generated March 3, 2026 from `runs/analysis`), there are **1,281 valid scored rows**, **1,822 total historical rows**, and **370 paired baseline/MCP tasks** after averaging multiple runs per task/config. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org. Retrieval evaluation on the same analysis set yields **799** event files, **311** computable tasks, and aggregate file-level metrics of **0.4598 file recall** and **0.3644 MRR**. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
10+
CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks** spanning the full Software Development Lifecycle (SDLC), designed to evaluate AI coding agents on large-codebase and cross-repository software engineering work. This report evaluates one context-retrieval condition using Model Context Protocol (MCP) tools against a baseline condition with local source access and no external retrieval tools. In the analysis set used here (generated March 3, 2026 from `runs/analysis`), there are **1,281 valid scored rows**, **1,822 total rows**, and **370 paired baseline/MCP tasks** after averaging multiple runs per task/config. The overall paired reward delta is **+0.0349** (MCP minus baseline), with **+0.0363** on SDLC and **+0.0339** on Org, and the report provides suite-level reward breakdowns across all 20 suites. Retrieval evaluation on the same analysis set yields **799** event files, **311** computable tasks, and aggregate file-level metrics of **0.4598 file recall** and **0.3644 MRR**. The report also includes pair-normalized cost and time calculations from matched baseline/MCP tasks, including average cost-per-task and wall-clock deltas. This report documents the benchmark design, construction, retrieval evaluation pipeline, verifier architecture, and current findings.
1211

1312
---
1413

@@ -27,8 +26,7 @@ CodeScaleBench (CSB) is a benchmark suite of **370 software engineering tasks**
2726
11. [Results](#11-results)
2827
12. [Threats to Validity](#12-threats-to-validity)
2928
13. [Future Work](#13-future-work)
30-
14. [Development Process: Building a Benchmark with Claude Code](#14-development-process-building-a-benchmark-with-claude-code)
31-
15. [Appendices](#15-appendices)
29+
14. [Appendices](#14-appendices)
3230

3331
---
3432

@@ -137,7 +135,7 @@ benchmarks/<suite>/<task-id>/
137135
| **Baseline** | `baseline-local-direct` | Full local code | None | `Dockerfile` | Evaluation context only |
138136
| **MCP** | `mcp-remote-direct` | Truncated/empty | 13 Sourcegraph tools | `Dockerfile.sg_only` | V5 MCP-first preamble |
139137

140-
Both SDLC and Org tasks use the same config pair (`baseline-local-direct` + `mcp-remote-direct`). Some legacy Org runs used `baseline-local-artifact` + `mcp-remote-artifact` configs; these are handled by analysis scripts but are no longer the default.
138+
Both SDLC and Org tasks use the same config pair (`baseline-local-direct` + `mcp-remote-direct`).
141139

142140
---
143141

@@ -691,7 +689,7 @@ The MCP preamble is a block of instructions prepended to the task prompt for MCP
691689
692690
The preamble also performs **repository scoping** — a `{repo_scope}` placeholder is substituted at launch time with the specific `repo:^github.com/ORG/REPO$` filter patterns for the task's target repositories, so the agent searches the correct repos from its first query.
693691
694-
The current preamble (V5) went through 5 design iterations to balance forcing MCP adoption against over-constraining agent behavior (see Section 10.1 for the full iteration history).
692+
The preamble communicates MCP environment constraints and tool usage guidance while preserving agent flexibility.
695693
696694
### 9.3 MCP Tool Suite
697695
@@ -773,53 +771,28 @@ git push --force origin orphan-main:main
773771
774772
## 10. Key Decisions
775773
776-
### 10.1 Preamble Design Iterations
774+
### 10.1 Configuration Control
777775
778-
The agent preamble -- instructions prepended to each task -- underwent 5 major iterations:
776+
The benchmark uses two controlled configurations for matched comparison:
779777
780-
| Version | Date | Strategy | Outcome |
781-
| --------- | --------- | -------------------------------- | ----------------------------------------------------------------------------------------- |
782-
| **V1/V2** | Early Feb | Minimal MCP mentions | 0 SG tool calls even with tools available |
783-
| **V3** | Feb 7 | "MANDATORY" triple reinforcement | 90%+ adoption but overly prescriptive; caused "MCP death spiral" on broken mirrors |
784-
| **V4** | Feb 12 | "Soft guidance" header | 60% zero-MCP adoption; too permissive |
785-
| **V5** | Feb 20 | "Truncation constraint" lead | Effective: leads with "files not present", forces MCP without mandating specific workflow |
778+
- `baseline-local-direct`: full local source access, no external retrieval tools
779+
- `mcp-remote-direct`: truncated local source, MCP-based remote retrieval
786780
787-
**The "MCP Death Spiral" discovery** (V3 era): When the aggressive V3 preamble mandated MCP usage, agents on tasks with broken mirrors or wrong repo names would waste their entire context window on failing SG queries, scoring 0.0 where baseline scored 1.0. This directly motivated the V5 design.
781+
This isolation ensures that measured deltas reflect the access method rather than differences in task content or verifier logic.
788782
789-
**The git history bypass bug** (V5 motivation): Five of 9 test tasks used `git show HEAD:filename` to recover full source from git history, completely defeating sg_only truncation. V5 fix: recommit truncated state so `git show HEAD:` returns empty files.
783+
### 10.2 Verifier Determinism
790784
791-
### 10.2 SG_base Dropping Decision (Feb 12)
785+
Task scoring is deterministic and in-container:
792786
793-
**Data that informed the decision**:
787+
- SDLC tasks use executable verifier scripts (`test.sh`)
788+
- Org tasks use oracle-driven scoring (`eval.sh` + oracle checks)
789+
- All tasks write a scalar reward to `/logs/verifier/reward.txt`
794790
795-
| Config | n_tasks | Mean Reward | Key Finding |
796-
| -------- | ------- | ----------- | ------------------------------ |
797-
| Baseline | 161 | 0.521 | Reference performance |
798-
| SG_base | 161 | 0.478 | Slightly _worse_ than baseline |
799-
| SG_full | 156 | 0.631 | +0.111 vs baseline |
791+
Optional analysis layers (LLM judge, IR metrics, bootstrap CIs) are additive and do not override primary reward.
800792
801-
**Rationale**: SG_base (keyword+NLS search only, no Deep Search) showed no meaningful improvement over baseline. The value came from the comprehensive MCP configuration. Maintaining 3 configs tripled compute cost without providing discriminative data.
793+
### 10.3 Data Inclusion Rules
802794
803-
### 10.3 DependEval Benchmark Removal
804-
805-
DependEval (9 tasks for dependency resolution) was removed because:
806-
807-
- Missing `code_content.txt` files
808-
- Empty problem statements
809-
- Wrong ground_truth format
810-
- "Code is inline, not in repos" -- fundamentally incompatible with MCP comparison since there was no external repository to index
811-
812-
### 10.4 Verifier Bug Discoveries and Fixes
813-
814-
Major verifier bugs discovered through QA audit (Feb 6):
815-
816-
| Bug | Impact | Fix |
817-
| ----------------------------------- | ----------------------------------------------------------------------- | ------------------------------ |
818-
| TAC score extraction silent failure | 3 tasks reported 0 instead of real scores (1.0, 0.667, 0.2) | Fixed `\|\| echo "0"` fallback |
819-
| CrossRepo wrong path | All 8 runs crashed: `/task/tests/` instead of `/tests/` | Updated paths |
820-
| PyTorch `make test` no-op | 10/12 tasks had broken verifiers: `test/` dir caused GNU make collision | Renamed target |
821-
| CodeReview brittle matching | `"For is null"` vs `"For == null"` penalized correct code | Relaxed matching |
822-
| Baseline instruction contamination | 30/156 instructions had SG refs leaking into baseline | Cleaned |
795+
Reported metrics in Section 11 are computed from valid scored rows and matched baseline/MCP task pairs. Pair-normalized analysis is performed at the canonical task level by averaging available valid runs per task/config before computing deltas.
823796
824797
---
825798
@@ -829,7 +802,7 @@ Major verifier bugs discovered through QA audit (Feb 6):
829802
830803
This section reflects the analysis export generated on **March 3, 2026** from `runs/analysis`:
831804
- Valid scored rows in export: **1,281**
832-
- Historical rows in `all_tasks`: **1,822**
805+
- Total rows in `all_tasks`: **1,822**
833806
- Paired baseline/MCP tasks with both sides present: **370**
834807
- Suites represented: **20** (9 SDLC + 11 Org)
835808
@@ -1155,62 +1128,7 @@ All current results use Claude Code as the sole agent harness. The framework alr
11551128
11561129
---
11571130
1158-
## 14. Development Process: Building a Benchmark with Claude Code
1159-
1160-
CodeScaleBench was primarily developed with Claude Code. This section documents the development process for methodological transparency.
1161-
1162-
### 14.1 Development Timeline
1163-
1164-
```
1165-
Jan 30 ─── Feb 3 ─── Feb 6 ─── Feb 10 ─── Feb 15 ─── Feb 20 ─── Feb 25 ─── Mar 2 ── Mar 3
1166-
│ │ │ │ │ │ │ │ │
1167-
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
1168-
Paper Task QA Audit Trace Enterprise V5 Preamble Oracle Rename V2 Report
1169-
draft + selection (28 audit, expansion, sg_only curation CCB→CSB 370/370
1170-
initial pipeline, issues), SG_base governance Dockerfile complete DOE coverage,
1171-
PRD SDLC verifier dropped, simulation redesign for 73 rebalance multi-run
1172-
taxonomy, bugs, Opus 4.6 MCP tasks 370 tasks analysis
1173-
5→3 cfg mirrors reruns Daytona
1174-
```
1175-
1176-
### 14.2 Scale of Claude Code Usage
1177-
1178-
The development produced **~1,000 conversation sessions** (JSONL transcripts) spanning January 30 -- March 3, 2026. Claude Code was used to:
1179-
1180-
- **Design and implement** the task selection and benchmarking logic
1181-
- **Generate** all 255 `Dockerfile.sg_only` variants and 85 `Dockerfile.artifact_only` files
1182-
- **Build** the IR evaluation pipeline (5 stages, ~3,500 lines of Python)
1183-
- **Create** the oracle evaluation system (7 check functions, 3-pass repo normalization)
1184-
- **Develop** the agent preamble through 5 iterations (V1→V5)
1185-
- **Implement** the clone-at-verify pattern for fair MCP evaluation
1186-
- **Author** infrastructure scripts (mirror creation, token management, parallel execution)
1187-
- **Debug** critical issues (verifier bugs, git history bypass, MCP death spiral)
1188-
- **Produce** analysis reports and statistical evaluation code
1189-
1190-
### 14.3 Key Workflow Pattern
1191-
1192-
The development followed a consistent pattern:
1193-
1194-
1. **User provides high-level intent**"I want SDLC-aligned task taxonomy"
1195-
2. **Claude Code explores codebase** → reads existing tasks, benchmarks, documentation
1196-
3. **Claude Code generates PRD** → structured user stories with acceptance criteria
1197-
4. **Implementation via autonomous sessions** → Ralph agent system executes PRDs
1198-
5. **User reviews and iterates** → identifies gaps, requests changes
1199-
6. **QA and validation** → automated checks + manual audit
1200-
1201-
### 14.4 Lessons Learned
1202-
1203-
1. **AI is effective at benchmark infrastructure**: Generating Dockerfiles, writing evaluation scripts, and building metrics pipelines are well-suited to AI-assisted development.
1204-
1205-
2. **Domain expertise remains critical**: Methodology and validity threat analysis required human judgment that couldn't be fully automated.
1206-
1207-
3. **Iterative QA is essential**: The Feb 6 QA audit found 28 issues (9 critical) in infrastructure that was largely AI-generated. Systematic validation caught bugs that individual reviews missed.
1208-
1209-
4. **Preamble engineering is non-trivial**: Five iterations were needed to find the right balance between forcing MCP usage and avoiding prescriptive constraints.
1210-
1211-
---
1212-
1213-
## 15. Appendices
1131+
## 14. Appendices
12141132
12151133
### Appendix A: Statistical Methods
12161134

0 commit comments

Comments
 (0)