sourcegraph
diff --git a/‎README.md‎
Lines changed: 21 additions & 21 deletions b/‎README.md‎
Lines changed: 21 additions & 21 deletions
diff --git a/‎benchmarks/README.md‎
Lines changed: 3 additions & 3 deletions b/‎benchmarks/README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/BLOG_POST.md‎
Lines changed: 8 additions & 8 deletions b/‎docs/BLOG_POST.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎docs/CONTEXT_RETRIEVAL_AGENT.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/CONTEXT_RETRIEVAL_AGENT.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/EVALUATION_PIPELINE.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/EVALUATION_PIPELINE.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/EXTENSIBILITY.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/EXTENSIBILITY.md‎
Lines changed: 2 additions & 2 deletions
@@ -100,7 +100,7 @@ Eleven additional suites measure cross-repo discovery, symbol resolution, depend
 | `csb_org_crossrepo` | K: Cross-Repo Discovery | 20 | Cross-repo search, dependency discovery, impact analysis |
 | **Total** | | **220** | |
 
-**Combined catalog total: 400 tasks** (180 SDLC across 9 suites + 220 MCP-unique across 11 suites). An additional 28 backup tasks are archived in `benchmarks/backups/`.
+**Combined catalog total: 400 tasks** (180 SDLC across 9 suites + 220 Org across 11 suites). An additional 28 backup tasks are archived in `benchmarks/backups/`.
 
 Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
 
@@ -113,7 +113,7 @@ See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task syste
 All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
 
 - **SDLC suites** (`csb_sdlc_feature`, `csb_sdlc_refactor`, `csb_sdlc_fix`, etc.): `baseline-local-direct` + `mcp-remote-direct`
-- **MCP-unique suites** (`csb_org_*`): `baseline-local-artifact` + `mcp-remote-artifact`
+- **Org suites** (`csb_org_*`): `baseline-local-artifact` + `mcp-remote-artifact`
 
 Legacy run directory names (`baseline`, `sourcegraph_full`, `artifact_full`) may still appear in historical outputs and are handled by analysis scripts.
 
@@ -131,7 +131,7 @@ See [docs/reference/CONFIGS.md](docs/reference/CONFIGS.md) for the canonical con
 ## Repository Structure
 
 ```
-benchmarks/              # Task definitions organized by SDLC phase + MCP-unique
+benchmarks/              # Task definitions organized by SDLC phase + Org
   csb_sdlc_feature/      #   Feature Implementation (20 tasks)
   csb_sdlc_refactor/     #   Cross-File Refactoring (20 tasks)
   csb_sdlc_debug/        #   Debugging & Investigation (20 tasks)
@@ -142,17 +142,17 @@ benchmarks/              # Task definitions organized by SDLC phase + MCP-unique
   csb_sdlc_test/         #   Testing & QA (20 tasks)
   csb_sdlc_understand/   #   Requirements & Discovery (20 tasks)
   backups/               #   Archived backup tasks (28 total)
-  csb_org_compliance/    #   MCP-unique: compliance & audit (20 tasks)
-  csb_org_crossorg/      #   MCP-unique: cross-org discovery (20 tasks)
-  csb_org_crossrepo/     #   MCP-unique: cross-repo discovery (20 tasks)
-  csb_org_crossrepo_tracing/  #   MCP-unique: dependency tracing (20 tasks)
-  csb_org_domain/        #   MCP-unique: domain lineage (20 tasks)
-  csb_org_incident/      #   MCP-unique: incident debugging (20 tasks)
-  csb_org_migration/     #   MCP-unique: framework migration (20 tasks)
-  csb_org_onboarding/    #   MCP-unique: onboarding (20 tasks)
-  csb_org_org/           #   MCP-unique: org context (20 tasks)
-  csb_org_platform/      #   MCP-unique: platform knowledge (20 tasks)
-  csb_org_security/      #   MCP-unique: vulnerability remediation (20 tasks)
+  csb_org_compliance/    #   Org: compliance & audit (20 tasks)
+  csb_org_crossorg/      #   Org: cross-org discovery (20 tasks)
+  csb_org_crossrepo/     #   Org: cross-repo discovery (20 tasks)
+  csb_org_crossrepo_tracing/  #   Org: dependency tracing (20 tasks)
+  csb_org_domain/        #   Org: domain lineage (20 tasks)
+  csb_org_incident/      #   Org: incident debugging (20 tasks)
+  csb_org_migration/     #   Org: framework migration (20 tasks)
+  csb_org_onboarding/    #   Org: onboarding (20 tasks)
+  csb_org_org/           #   Org: org context (20 tasks)
+  csb_org_platform/      #   Org: platform knowledge (20 tasks)
+  csb_org_security/      #   Org: vulnerability remediation (20 tasks)
 configs/                 # Run configs and task selection
   _common.sh             #   Shared infra: token refresh, parallel execution, multi-account
   sdlc_suite_2config.sh  #   Generic SDLC runner (used by phase wrappers below)
@@ -167,8 +167,8 @@ configs/                 # Run configs and task selection
   run_selected_tasks.sh  #   Unified runner for all tasks
   validate_one_per_benchmark.sh  # Pre-flight smoke (1 task per suite)
   selected_benchmark_tasks.json  # Canonical SDLC task selection with metadata
-  selected_mcp_unique_tasks.json # MCP-unique task selection with metadata
-  use_case_registry.json #   100 GTM use cases (MCP-unique task source)
+  selected_mcp_unique_tasks.json # Org task selection with metadata
+  use_case_registry.json #   100 GTM use cases (Org task source)
   archive/               #   Pre-SDLC migration scripts (preserved for history)
 scripts/                 # Metrics extraction, evaluation, and operational tooling
   csb_metrics/           #   Python package: models, extractors, discovery, judge context
@@ -201,7 +201,7 @@ docs/                    # Operational documentation
   TASK_CATALOG.md        #   Detailed per-task reference
   TASK_SELECTION.md      #   Selection criteria, difficulty calibration, MCP scoring
   SCORING_SEMANTICS.md   #   Reward and pass interpretation per benchmark
-  MCP_UNIQUE_TASKS.md    #   MCP-unique task system, authoring, oracle evaluation
+  MCP_UNIQUE_TASKS.md    #   Org task system, authoring, oracle evaluation
   MCP_UNIQUE_CALIBRATION.md # Oracle coverage analysis and threshold calibration
   WORKFLOW_METRICS.md    #   Timing/cost metric definitions
   AGENT_INTERFACE.md     #   Runtime I/O contract for agents
@@ -214,7 +214,7 @@ skills/                  # AI agent skill definitions (operational runbooks)
 schemas/                 # JSON schemas for MANIFEST.json, task.toml, etc.
 ```
 
-Each suite directory contains per-task subdirectories with `instruction.md`, `task.toml`, `tests/`, and ground truth (or `solution/`). MCP-unique tasks additionally include `task_spec.json`, `oracle_answer.json`, and Dockerfile variants for baseline/MCP-only execution.
+Each suite directory contains per-task subdirectories with `instruction.md`, `task.toml`, `tests/`, and ground truth (or `solution/`). Org tasks additionally include `task_spec.json`, `oracle_answer.json`, and Dockerfile variants for baseline/MCP-only execution.
 
 ---
 
@@ -315,12 +315,12 @@ bash configs/test_2config.sh             # 20 Testing & QA tasks
 bash configs/document_2config.sh         # 20 Documentation tasks
 ```
 
-### MCP-Unique Tasks
+### CodeScaleBench-Org Tasks
 
-MCP-unique tasks use a separate selection file:
+Org tasks use a separate selection file:
 
 ```bash
-# Run all MCP-unique tasks across 2 configs
+# Run all Org tasks across 2 configs
 bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json
 
 # Filter by use-case category
 
@@ -1,6 +1,6 @@
 # CodeScaleBench Benchmarks
 
-This directory contains SDLC-aligned suites plus MCP-unique org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 294 selected tasks across 20 suites).
+This directory contains SDLC-aligned suites plus Org org-scale retrieval suites. The canonical selected task catalog is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (currently 294 selected tasks across 20 suites).
 
 See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
 
@@ -23,7 +23,7 @@ See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodol
 
 ---
 
-## MCP-Unique Suite Overview (Selected Catalog)
+## CodeScaleBench-Org Suite Overview (Selected Catalog)
 
 These suites measure cross-repo discovery, tracing, and org-scale code intelligence use cases. Counts below reflect the current selected catalog in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json) (some suite directories may contain additional draft/deferred tasks that are not selected).
 
@@ -40,7 +40,7 @@ These suites measure cross-repo discovery, tracing, and org-scale code intellige
 | `csb_org_org` | 5 | Org-wide coding correctness tasks requiring broad context |
 | `csb_org_platform` | 5 | Platform/devtools and tribal-knowledge discovery |
 | `csb_org_security` | 10 | Vulnerability remediation and security analysis at org scale |
-| **Total MCP-Unique (selected)** | **81** | |
+| **Total CodeScaleBench-Org (selected)** | **81** | |
 
 For suite taxonomy, authoring, and oracle evaluation details, see [`docs/MCP_UNIQUE_TASKS.md`](../docs/MCP_UNIQUE_TASKS.md).
 
 
@@ -18,25 +18,25 @@ Here's the core experimental design. The same agent (Claude Code with Haiku 4.5)
 
 This is the part I think matters most: both configurations have access to the same information. The only difference is the access method. We're not giving the MCP agent extra information — we're testing whether a different pipe to the same information changes outcomes. (If anything, it's a conservative test: in real enterprise settings the agent typically wouldn't have full local access to every relevant repo, so the baseline is actually more favorable than reality.)
 
-Tasks are organized by SDLC phase — Understand, Design, Build, Fix, Test, Document, Secure, Debug — plus a set of MCP-unique tasks that specifically require cross-repository discovery across 3-20 repos. The tasks span 40+ open-source repositories and 10 programming languages, from Kubernetes to Django to the Linux kernel. I wrote a [white paper](WHITE_PAPER_REPORT_V2.md) with the full methodology and an explanation of all the evaluation layers, including the information retrieval analysis pipeline.
+Tasks are organized by SDLC phase — Understand, Design, Build, Fix, Test, Document, Secure, Debug — plus a set of Org tasks that specifically require cross-repository discovery across 3-20 repos. The tasks span 40+ open-source repositories and 10 programming languages, from Kubernetes to Django to the Linux kernel. I wrote a [white paper](WHITE_PAPER_REPORT_V2.md) with the full methodology and an explanation of all the evaluation layers, including the information retrieval analysis pipeline.
 
 ## The Headline: Near-Zero Overall, But the Spread Is the Story
 
-After running 250 valid task pairs across all SDLC suites plus 11 MCP-unique suites (169 SDLC + 81 MCP-unique, with 1 baseline infrastructure error excluded from 251 registered tasks), MCP shows a small but statistically significant positive effect: baseline mean reward 0.594, MCP mean reward 0.640, delta **+0.047** (95% bootstrap CI: [+0.007, +0.085]).
+After running 250 valid task pairs across all SDLC suites plus 11 Org suites (169 SDLC + 81 Org, with 1 baseline infrastructure error excluded from 251 registered tasks), MCP shows a small but statistically significant positive effect: baseline mean reward 0.594, MCP mean reward 0.640, delta **+0.047** (95% bootstrap CI: [+0.007, +0.085]).
 
 But that modest average obscures the real story, because the delta swings from **-0.183** to **+0.440** depending on the task type. That spread — from MCP hurting to MCP helping dramatically — is a much more interesting finding than any single number, because it tells you when code intelligence tools matter and when they don't.
 
 ## Where MCP Wins
 
-The strongest SDLC gain is the Understand suite. MCP-unique tasks show a substantial positive delta, with specific sub-suites showing very large gains.
+The strongest SDLC gain is the Understand suite. Org tasks show a substantial positive delta, with specific sub-suites showing very large gains.
 
 | Suite | Tasks | Baseline Mean | MCP Mean | Delta |
 |-------|-------|--------------|----------|-------|
-| MCP-Unique (all) | 81 | 0.525 | 0.708 | **+0.183** |
+| CodeScaleBench-Org (all) | 81 | 0.525 | 0.708 | **+0.183** |
 | Understand | 20 | 0.660 | 0.851 | **+0.190** |
 | Document | 20 | 0.847 | 0.895 | +0.048 |
 
-**MCP-unique tasks** require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% bootstrap CI: [+0.116, +0.255]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
+**Org tasks** require cross-repository discovery — tracing a vulnerability across an ecosystem of repos, or mapping how a config value propagates through 5 different services. These tasks span 3-20 repositories and specifically measure org-scale information retrieval. The +0.183 delta (95% bootstrap CI: [+0.116, +0.255]) is the strongest effect in the benchmark. The sub-suite variation is striking: security tasks show **+0.440**, onboarding **+0.337**, org-scale **+0.197**, incident response **+0.177**, while migration (+0.051) and platform (-0.049) are near-flat.
 
 **Understand tasks** show the strongest SDLC gain at +0.190 (0.660 to 0.851, 95% CI: [+0.043, +0.361]). This was the biggest reversal in the dataset — earlier drafts showed Understand as strongly negative, but that signal was coming from invalid/contaminated runs that were removed and rerun.
 
@@ -95,7 +95,7 @@ One finding I didn't expect after recomputing the cost section on a strict paire
 
 Using one consistent method (`task_metrics.cost_usd`, cache-inclusive, same n=251 pairs), MCP is about 3.8% more expensive on average (+$0.013/task). The cost story is suite-dependent: MCP is cheaper in design/document/understand/mcp_unique, and more expensive in build/debug/fix/secure/test. MCP is still much faster overall: wall-clock drops from 1401.9s to 653.0s on average (-53.4%), and agent execution time drops from 1058.3s to 299.3s (-71.7%).
 
-This reframes the value question a bit. On the suites where MCP improves reward (especially MCP-unique and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
+This reframes the value question a bit. On the suites where MCP improves reward (especially Org and, in the cleaned paired set, Understand), you're getting better results at lower cost and lower latency. On the suites where reward is flat (fix, test), you're getting similar results faster. The clearly bad trade-offs remain debug and build, where the agent is faster but less effective.
 
 ## How I Built This (And What Broke)
 
@@ -131,9 +131,9 @@ If you're building or evaluating code intelligence tools and want to run your st
 
 I started this project because I was drowning in noise. Every tool claims to "supercharge" agent performance. Every benchmark result is a press release. I wanted to know what was actually true, with data granular enough to understand why.
 
-Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. MCP-unique security tasks show +0.440, onboarding +0.337, understand +0.190. When the bottleneck is finding and understanding scattered context, MCP tools help — and the 95% confidence intervals on these effects exclude zero.
+Here's what the data says so far: code intelligence tools provide measurable value on tasks that require **comprehension across large codebases** and **cross-repository discovery**. Org security tasks show +0.440, onboarding +0.337, understand +0.190. When the bottleneck is finding and understanding scattered context, MCP tools help — and the 95% confidence intervals on these effects exclude zero.
 
-They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (-0.015) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 250 valid pairs is +0.047 (95% bootstrap CI: [+0.007, +0.085]) — a small but statistically significant positive, driven primarily by the MCP-unique tasks where cross-repo discovery is the core challenge.
+They provide mixed value on other SDLC tasks. MCP helps on understand (+0.190) and document (+0.048), is effectively flat on fix (-0.015) and test (+0.000), and hurts on debugging (-0.183) and build (-0.121). When the agent already has full source code and the bottleneck is local execution/editing rather than retrieval, adding a remote search layer can still be overhead. The overall delta across all 250 valid pairs is +0.047 (95% bootstrap CI: [+0.007, +0.085]) — a small but statistically significant positive, driven primarily by the Org tasks where cross-repo discovery is the core challenge.
 
 And there's a third category — tasks where the retrieval metrics are basically the same but outcomes still differ — that I can't fully explain yet and might be the most important one to understand.
 
 
@@ -283,7 +283,7 @@ Matches `schemas/retrieval_events_schema.json` ground_truth block. Files as plai
 - `chunks` — line-range annotations per file, format: `{file, line_start, line_end, annotation}` (optional but recommended)
 - `dependency_chain` — ordered file list tracing call/import chain (optional)
 
-### 2. `oracle_answer.json` (MCP-unique tasks only — artifact verifier format)
+### 2. `oracle_answer.json` (Org tasks only — artifact verifier format)
 
 Same format as before, consumed by `oracle_checks.py`:
 
 
@@ -5,7 +5,7 @@ run first (every task), then an optional LLM judge adds qualitative scoring,
 and statistical modules provide confidence intervals and correlation analysis.
 
 This document covers the pipeline architecture. For per-benchmark scoring
-details, see [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md). For MCP-unique
+details, see [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md). For Org
 oracle checks, see [MCP_UNIQUE_TASKS.md](MCP_UNIQUE_TASKS.md). For the full
 retrieval/IR evaluation pipeline (normalized retrieval events, file/chunk IR
 metrics, utilization probes, taxonomy, and emitted artifacts), see
@@ -48,7 +48,7 @@ Harbor run output (result.json, transcript)
 
 ## Layer 1: Deterministic Verifiers
 
-Every task ships a `tests/test.sh` (SDLC tasks) or `tests/eval.sh` (MCP-unique
+Every task ships a `tests/test.sh` (SDLC tasks) or `tests/eval.sh` (Org
 tasks) that runs inside the Docker container after the agent finishes. The
 verifier writes a reward (0.0–1.0) to `/logs/verifier/reward.txt` and exits 0
 on success, non-zero on failure.
@@ -153,9 +153,9 @@ python3 scripts/run_judge.py --run runs/official/my_run/ --force
 
 Output: `judge_result.json` written alongside each task's `result.json`.
 
-### Hybrid Scoring (MCP-Unique Tasks)
+### Hybrid Scoring (CodeScaleBench-Org Tasks)
 
-MCP-unique tasks with `tests/criteria.json` support hybrid evaluation:
+Org tasks with `tests/criteria.json` support hybrid evaluation:
 `composite = 0.6 * verifier_reward + 0.4 * rubric_score`. Enable with
 `--hybrid` flag on `run_judge.py`.
 
 
@@ -81,9 +81,9 @@ When adding benchmark environment variants, keep canonical task definitions inta
    (for example under `benchmarks/csb_sdlc_document/`).
 4. Treat variant runs as separate studies in reporting and curation.
 
-## 7) Add MCP-Unique Tasks (csb_org_* suites)
+## 7) Add CodeScaleBench-Org Tasks (csb_org_* suites)
 
-MCP-unique tasks measure org-scale cross-repo discovery — what local-only agents
+Org tasks measure org-scale cross-repo discovery — what local-only agents
 cannot do. See `docs/MCP_UNIQUE_TASKS.md` for the full authoring guide.
 
 **Quick start:**