Skip to content

Commit cbcd03d

Browse files
sjarmakclaude
andcommitted
docs: rename MCP-Unique → CodeScaleBench-Org across all documentation
Replace "MCP-Unique" and "MCP-unique" with "CodeScaleBench-Org" or "Org" in prose across 22 markdown files. Filenames and code identifiers (selected_mcp_unique_tasks.json, mcp_unique variable names) are unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7a3d321 commit cbcd03d

22 files changed

+153
-153
lines changed

README.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ Eleven additional suites measure cross-repo discovery, symbol resolution, depend
100100
| `csb_org_crossrepo` | K: Cross-Repo Discovery | 20 | Cross-repo search, dependency discovery, impact analysis |
101101
| **Total** | | **220** | |
102102

103-
**Combined catalog total: 400 tasks** (180 SDLC across 9 suites + 220 MCP-unique across 11 suites). An additional 28 backup tasks are archived in `benchmarks/backups/`.
103+
**Combined catalog total: 400 tasks** (180 SDLC across 9 suites + 220 Org across 11 suites). An additional 28 backup tasks are archived in `benchmarks/backups/`.
104104

105105
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
106106

@@ -113,7 +113,7 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
113113
All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
114114

115115
- **SDLC suites** (`csb_sdlc_feature`, `csb_sdlc_refactor`, `csb_sdlc_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
116-
- **MCP-unique suites** (`csb_org_*`): `baseline-local-artifact` + `mcp-remote-artifact`
116+
- **Org suites** (`csb_org_*`): `baseline-local-artifact` + `mcp-remote-artifact`
117117

118118
Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
119119

@@ -131,7 +131,7 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
131131
## Repository Structure
132132

133133
```
134-
benchmarks/ # Task definitions organized by SDLC phase + MCP-unique
134+
benchmarks/ # Task definitions organized by SDLC phase + Org
135135
csb_sdlc_feature/ # Feature Implementation (20 tasks)
136136
csb_sdlc_refactor/ # Cross-File Refactoring (20 tasks)
137137
csb_sdlc_debug/ # Debugging & Investigation (20 tasks)
@@ -142,17 +142,17 @@ benchmarks/ # Task definitions organized by SDLC phase + MCP-unique
142142
csb_sdlc_test/ # Testing & QA (20 tasks)
143143
csb_sdlc_understand/ # Requirements & Discovery (20 tasks)
144144
backups/ # Archived backup tasks (28 total)
145-
csb_org_compliance/ # MCP-unique: compliance & audit (20 tasks)
146-
csb_org_crossorg/ # MCP-unique: cross-org discovery (20 tasks)
147-
csb_org_crossrepo/ # MCP-unique: cross-repo discovery (20 tasks)
148-
csb_org_crossrepo_tracing/ # MCP-unique: dependency tracing (20 tasks)
149-
csb_org_domain/ # MCP-unique: domain lineage (20 tasks)
150-
csb_org_incident/ # MCP-unique: incident debugging (20 tasks)
151-
csb_org_migration/ # MCP-unique: framework migration (20 tasks)
152-
csb_org_onboarding/ # MCP-unique: onboarding (20 tasks)
153-
csb_org_org/ # MCP-unique: org context (20 tasks)
154-
csb_org_platform/ # MCP-unique: platform knowledge (20 tasks)
155-
csb_org_security/ # MCP-unique: vulnerability remediation (20 tasks)
145+
csb_org_compliance/ # Org: compliance & audit (20 tasks)
146+
csb_org_crossorg/ # Org: cross-org discovery (20 tasks)
147+
csb_org_crossrepo/ # Org: cross-repo discovery (20 tasks)
148+
csb_org_crossrepo_tracing/ # Org: dependency tracing (20 tasks)
149+
csb_org_domain/ # Org: domain lineage (20 tasks)
150+
csb_org_incident/ # Org: incident debugging (20 tasks)
151+
csb_org_migration/ # Org: framework migration (20 tasks)
152+
csb_org_onboarding/ # Org: onboarding (20 tasks)
153+
csb_org_org/ # Org: org context (20 tasks)
154+
csb_org_platform/ # Org: platform knowledge (20 tasks)
155+
csb_org_security/ # Org: vulnerability remediation (20 tasks)
156156
configs/ # Run configs and task selection
157157
_common.sh # Shared infra: token refresh, parallel execution, multi-account
158158
sdlc_suite_2config.sh # Generic SDLC runner (used by phase wrappers below)
@@ -167,8 +167,8 @@ configs/ # Run configs and task selection
167167
run_selected_tasks.sh # Unified runner for all tasks
168168
validate_one_per_benchmark.sh # Pre-flight smoke (1 task per suite)
169169
selected_benchmark_tasks.json # Canonical SDLC task selection with metadata
170-
selected_mcp_unique_tasks.json # MCP-unique task selection with metadata
171-
use_case_registry.json # 100 GTM use cases (MCP-unique task source)
170+
selected_mcp_unique_tasks.json # Org task selection with metadata
171+
use_case_registry.json # 100 GTM use cases (Org task source)
172172
archive/ # Pre-SDLC migration scripts (preserved for history)
173173
scripts/ # Metrics extraction, evaluation, and operational tooling
174174
csb_metrics/ # Python package: models, extractors, discovery, judge context
@@ -201,7 +201,7 @@ docs/ # Operational documentation
201201
TASK_CATALOG.md # Detailed per-task reference
202202
TASK_SELECTION.md # Selection criteria, difficulty calibration, MCP scoring
203203
SCORING_SEMANTICS.md # Reward and pass interpretation per benchmark
204-
MCP_UNIQUE_TASKS.md # MCP-unique task system, authoring, oracle evaluation
204+
MCP_UNIQUE_TASKS.md # Org task system, authoring, oracle evaluation
205205
MCP_UNIQUE_CALIBRATION.md # Oracle coverage analysis and threshold calibration
206206
WORKFLOW_METRICS.md # Timing/cost metric definitions
207207
AGENT_INTERFACE.md # Runtime I/O contract for agents
@@ -214,7 +214,7 @@ skills/ # AI agent skill definitions (operational runbooks)
214214
schemas/ # JSON schemas for MANIFEST.json, task.toml, etc.
215215
```
216216

217-
Each suite directory contains per-task subdirectories with `instruction.md`, `task.toml`, `tests/`, and ground truth (or `solution/`). MCP-unique tasks additionally include `task_spec.json`, `oracle_answer.json`, and Dockerfile variants for baseline/MCP-only execution.
217+
Each suite directory contains per-task subdirectories with `instruction.md`, `task.toml`, `tests/`, and ground truth (or `solution/`). Org tasks additionally include `task_spec.json`, `oracle_answer.json`, and Dockerfile variants for baseline/MCP-only execution.
218218

219219
---
220220

@@ -315,12 +315,12 @@ bash configs/test_2config.sh # 20 Testing & QA tasks
315315
bash configs/document_2config.sh # 20 Documentation tasks
316316
```
317317

318-
### MCP-Unique Tasks
318+
### CodeScaleBench-Org Tasks
319319

320-
MCP-unique tasks use a separate selection file:
320+
Org tasks use a separate selection file:
321321

322322
```bash
323-
# Run all MCP-unique tasks across 2 configs
323+
# Run all Org tasks across 2 configs
324324
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json
325325

326326
# Filter by use-case category

benchmarks/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# CodeScaleBench Benchmarks
22

3-
This directory contains SDLC-aligned suites plus MCP-unique org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 294 selected tasks across 20 suites).
3+
This directory contains SDLC-aligned suites plus Org org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 294 selected tasks across 20 suites).
44

55
See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
66

@@ -23,7 +23,7 @@ See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodol
2323

2424
---
2525

26-
## MCP-Unique Suite Overview (Selected Catalog)
26+
## CodeScaleBench-Org Suite Overview (Selected Catalog)
2727

2828
These suites measure cross-repo discovery, tracing, and org-scale code intelligence use cases. Counts below reflect the current selected catalog in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (some suite directories may contain additional draft/deferred tasks that are not selected).
2929

@@ -40,7 +40,7 @@ These suites measure cross-repo discovery, tracing, and org-scale code intellige
4040
| `csb_org_org` | 5 | Org-wide coding correctness tasks requiring broad context |
4141
| `csb_org_platform` | 5 | Platform/devtools and tribal-knowledge discovery |
4242
| `csb_org_security` | 10 | Vulnerability remediation and security analysis at org scale |
43-
| **Total MCP-Unique (selected)** | **81** | |
43+
| **Total CodeScaleBench-Org (selected)** | **81** | |
4444

4545
For suite taxonomy, authoring, and oracle evaluation details, see [`docs/MCP_UNIQUE_TASKS.md`](../docs/MCP_UNIQUE_TASKS.md).
4646

docs/BLOG_POST.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,25 +18,25 @@ Here's the core experimental design. The same agent (Claude Code with Haiku 4.5)
1818

1919
This is the part I think matters most: both configurations have access to the same information. The only difference is the access method. We're not giving the MCP agent extra information — we're testing whether a different pipe to the same information changes outcomes. (If anything, it's a conservative test: in real enterprise settings the agent typically wouldn't have full local access to every relevant repo, so the baseline is actually more favorable than reality.)
2020

21-
Tasks are organized by SDLC phase — Understand, Design, Build, Fix, Test, Document, Secure, Debug — plus a set of MCP-unique tasks that specifically require cross-repository discovery across 3-20 repos. The tasks span 40+ open-source repositories and 10 programming languages, from Kubernetes to Django to the Linux kernel. I wrote a [white paper](WHITE_PAPER_REPORT_V2.md) with the full methodology and an explanation of all the evaluation layers, including the information retrieval analysis pipeline.
21+
Tasks are organized by SDLC phase — Understand, Design, Build, Fix, Test, Document, Secure, Debug — plus a set of Org tasks that specifically require cross-repository discovery across 3-20 repos. The tasks span 40+ open-source repositories and 10 programming languages, from Kubernetes to Django to the Linux kernel. I wrote a [white paper](WHITE_PAPER_REPORT_V2.md) with the full methodology and an explanation of all the evaluation layers, including the information retrieval analysis pipeline.
2222

2323
## The Headline: Near-Zero Overall, But the Spread Is the Story
2424

25-
After running 250 valid task pairs across all SDLC suites plus 11 MCP-unique suites (169 SDLC + 81 MCP-unique, with 1 baseline infrastructure error excluded from 251 registered tasks), MCP shows a small but statistically significant positive effect: baseline mean reward 0.594, MCP mean reward 0.640, delta **+0.047** (95% bootstrap CI: [+0.007, +0.085]).
25+
After running 250 valid task pairs across all SDLC suites plus 11 Org suites (169 SDLC + 81 Org, with 1 baseline infrastructure error excluded from 251 registered tasks), MCP shows a small but statistically significant positive effect: baseline mean reward 0.594, MCP mean reward 0.640, delta **+0.047** (95% bootstrap CI: [+0.007, +0.085]).
2626

2727
But that modest average obscures the real story, because the delta swings from **-0.183** to **+0.440** depending on the task type. That spread — from MCP hurting to MCP helping dramatically — is a much more interesting finding than any single number, because it tells you when code intelligence tools matter and when they don't.
2828

2929
## Where MCP Wins
3030

31-
The strongest SDLC gain is the Understand suite. MCP-unique tasks show a substantial positive delta, with specific sub-suites showing very large gains.
31+
The strongest SDLC gain is the Understand suite. Org tasks show a substantial positive delta, with specific sub-suites showing very large gains.
3232

3333
| Suite | Tasks | Baseline Mean | MCP Mean | Delta |
3434
|-------|-------|--------------|----------|-------|
35-
| MCP-Unique (all) | 81 | 0.525 | 0.708 | **+0.183** |
35+
| CodeScaleBench-Org (all) | 81 | 0.525 | 0.708 | **+0.183** |
3636
| Understand | 20 | 0.660 | 0.851 | **+0.190** |
3737
| Document | 20 | 0.847 | 0.895 | +0.048 |
3838

39-
**MCP-unique tasks** require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% bootstrap CI: [+0.116, +0.255]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
39+
**Org tasks** require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% bootstrap CI: [+0.116, +0.255]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
4040

4141
**Understand tasks** show the strongest SDLC gain at +0.190 (0.660 to 0.851, 95% CI: [+0.043, +0.361]). This was the biggest reversal in the dataset — earlier drafts showed Understand as strongly negative, but that signal was coming from invalid/contaminated runs that were removed and rerun.
4242

@@ -95,7 +95,7 @@ One finding I didn't expect after recomputing the cost section on a strict paire
9595

9696
Using one consistent method (`task_metrics.cost_usd`, cache-inclusive, same n=251 pairs), MCP is about 3.8% more expensive on average (+$0.013/task). The cost story is suite-dependent: MCP is cheaper in design/document/understand/mcp_unique, and more expensive in build/debug/fix/secure/test. MCP is still much faster overall: wall-clock drops from 1401.9s to 653.0s on average (-53.4%), and agent execution time drops from 1058.3s to 299.3s (-71.7%).
9797

98-
This reframes the value question a bit. On the suites where MCP improves reward (especially MCP-unique and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
98+
This reframes the value question a bit. On the suites where MCP improves reward (especially Org and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
9999

100100
## How I Built This (And What Broke)
101101

@@ -131,9 +131,9 @@ If you're building or evaluating code intelligence tools and want to run your st
131131

132132
I started this project because I was drowning in noise. Every tool claims to "supercharge" agent performance. Every benchmark result is a press release. I wanted to know what was actually true, with data granular enough to understand why.
133133

134-
Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. MCP-unique security tasks show +0.440, onboarding +0.337, understand +0.190. When the bottleneck is finding and understanding scattered context, MCP tools help — and the 95% confidence intervals on these effects exclude zero.
134+
Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. Org security tasks show +0.440, onboarding +0.337, understand +0.190. When the bottleneck is finding and understanding scattered context, MCP tools help — and the 95% confidence intervals on these effects exclude zero.
135135

136-
They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (-0.015) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 250 valid pairs is +0.047 (95% bootstrap CI: [+0.007, +0.085]) — a small but statistically significant positive, driven primarily by the MCP-unique tasks where cross-repo discovery is the core challenge.
136+
They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (-0.015) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 250 valid pairs is +0.047 (95% bootstrap CI: [+0.007, +0.085]) — a small but statistically significant positive, driven primarily by the Org tasks where cross-repo discovery is the core challenge.
137137

138138
And there's a third category — tasks where the retrieval metrics are basically the same but outcomes still differ — that I can't fully explain yet and might be the most important one to understand.
139139

docs/CONTEXT_RETRIEVAL_AGENT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,7 @@ Matches `schemas/retrieval_events_schema.json` ground_truth block. Files as plai
283283
- `chunks` — line-range annotations per file, format: `{file, line_start, line_end, annotation}` (optional but recommended)
284284
- `dependency_chain` — ordered file list tracing call/import chain (optional)
285285

286-
### 2. `oracle_answer.json` (MCP-unique tasks only — artifact verifier format)
286+
### 2. `oracle_answer.json` (Org tasks only — artifact verifier format)
287287

288288
Same format as before, consumed by `oracle_checks.py`:
289289

docs/EVALUATION_PIPELINE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ run first (every task), then an optional LLM judge adds qualitative scoring,
55
and statistical modules provide confidence intervals and correlation analysis.
66

77
This document covers the pipeline architecture. For per-benchmark scoring
8-
details, see [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md). For MCP-unique
8+
details, see [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md). For Org
99
oracle checks, see [MCP_UNIQUE_TASKS.md](MCP_UNIQUE_TASKS.md). For the full
1010
retrieval/IR evaluation pipeline (normalized retrieval events, file/chunk IR
1111
metrics, utilization probes, taxonomy, and emitted artifacts), see
@@ -48,7 +48,7 @@ Harbor run output (result.json, transcript)
4848

4949
## Layer 1: Deterministic Verifiers
5050

51-
Every task ships a `tests/test.sh` (SDLC tasks) or `tests/eval.sh` (MCP-unique
51+
Every task ships a `tests/test.sh` (SDLC tasks) or `tests/eval.sh` (Org
5252
tasks) that runs inside the Docker container after the agent finishes. The
5353
verifier writes a reward (0.0–1.0) to `/logs/verifier/reward.txt` and exits 0
5454
on success, non-zero on failure.
@@ -153,9 +153,9 @@ python3 scripts/run_judge.py --run runs/official/my_run/ --force
153153

154154
Output: `judge_result.json` written alongside each task's `result.json`.
155155

156-
### Hybrid Scoring (MCP-Unique Tasks)
156+
### Hybrid Scoring (CodeScaleBench-Org Tasks)
157157

158-
MCP-unique tasks with `tests/criteria.json` support hybrid evaluation:
158+
Org tasks with `tests/criteria.json` support hybrid evaluation:
159159
`composite = 0.6 * verifier_reward + 0.4 * rubric_score`. Enable with
160160
`--hybrid` flag on `run_judge.py`.
161161

docs/EXTENSIBILITY.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -81,9 +81,9 @@ When adding benchmark environment variants, keep canonical task definitions inta
8181
(for example under `benchmarks/csb_sdlc_document/`).
8282
4. Treat variant runs as separate studies in reporting and curation.
8383

84-
## 7) Add MCP-Unique Tasks (csb_org_* suites)
84+
## 7) Add CodeScaleBench-Org Tasks (csb_org_* suites)
8585

86-
MCP-unique tasks measure org-scale cross-repo discovery — what local-only agents
86+
Org tasks measure org-scale cross-repo discovery — what local-only agents
8787
cannot do. See `docs/MCP_UNIQUE_TASKS.md` for the full authoring guide.
8888

8989
**Quick start:**

0 commit comments

Comments
 (0)