Skip to content

Commit 3106f0d

Browse files
committed
Simplify cost reporting to haiku-only and remove LOC cost breakdowns
1 parent a7be27e commit 3106f0d

File tree

2 files changed

+5
-33
lines changed

2 files changed

+5
-33
lines changed

docs/BLOG_POST.md

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -170,30 +170,17 @@ Could also just be plain ol' agent non-determinism. Retrieval quality alone does
170170

171171
Let's take a break from whatever voodoo variables control reward outcomes and talk about costs and timing.
172172

173-
For the headline cost comparison, I switched to one canonical paired method on `runs/official/_raw`:
173+
For the headline cost comparison, I use one canonical paired method on `runs/official/_raw`:
174174

175175
1. Normalize task IDs (`mcp_` / `sgonly_` prefixes and random suffixes removed).
176176
2. For each `(model, task)`, keep the latest valid baseline run and latest valid MCP run.
177177
3. Valid means `output_tokens > 0` and `agent_execution_seconds >= 10`.
178-
4. Compare one MCP run to one baseline run per task, then average per model.
178+
4. Compare one MCP run to one baseline run per task.
179179

180180
For **haiku valid pairs** (`n=392`), baseline is `$0.733/task` and MCP is `$0.512/task` (**-30.16%**).
181181

182182
![Haiku valid pairs baseline vs MCP cost](assets/blog/codescalebench_mcp/figure_8_haiku_cost_baseline_vs_mcp.png)
183183

184-
If you split the same haiku pairs by estimated codebase LOC (from GitHub repo size), MCP looks more expensive in several bins:
185-
186-
| Estimated LOC Band | n | BL $/task | MCP $/task | MCP vs BL |
187-
|--------------------|---|-----------|------------|-----------|
188-
| <400K | 9 | 0.3721 | 0.7599 | **+104.20%** |
189-
| 400K-2M | 14 | 0.3680 | 0.5237 | **+42.29%** |
190-
| 2M-8M | 44 | 0.4057 | 0.4139 | **+2.02%** |
191-
| 8M-40M | 126 | 0.3124 | 0.3569 | **+14.26%** |
192-
| >40M | 97 | 1.8362 | 0.6554 | **-64.31%** |
193-
| unknown | 102 | 0.4277 | 0.5864 | **+37.11%** |
194-
195-
That is a weighting effect, not a contradiction: the `>40M` band has large absolute savings and enough mass to pull the overall weighted average down even when several smaller bands are MCP-expensive.
196-
197184
Speed tells an even cleaner story:
198185

199186
| Metric | Baseline Mean | MCP Mean | Delta |

docs/technical_reports/TECHNICAL_REPORT_V2.md

Lines changed: 3 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1093,35 +1093,20 @@ The dominant pattern is unchanged: keyword search + read-file dominate MCP usage
10931093
10941094
### 11.11 Cost Analysis
10951095
1096-
Updated cost results are now reported from `runs/official/_raw` with a model-stratified, task-weighted canonical pairing:
1096+
Updated cost results are now reported from `runs/official/_raw` with a task-weighted canonical pairing:
10971097
10981098
- Normalize task IDs (`mcp_`/`sgonly_`/random suffix stripped).
10991099
- For each `(model, task)`, keep the latest valid baseline and latest valid MCP run.
11001100
- Compare one pair per task (`output_tokens > 0` and `agent_execution_seconds >= 10`).
11011101
11021102
Source artifact: `docs/analysis/mcp_cost_pairs_official_raw_20260304.json`.
1103-
Figure: `docs/assets/blog/codescalebench_mcp/figure_7_cost_pairing_by_model_and_size.{png,svg}` (haiku only, size-binned by estimated LOC from GitHub repo size).
1103+
Figure: `docs/assets/blog/codescalebench_mcp/figure_8_haiku_cost_baseline_vs_mcp.{png,svg}`.
11041104
11051105
| Model | n paired tasks | BL $/task | MCP $/task | Δ $/task | MCP vs BL |
11061106
|-------|-----------------|-----------|------------|----------|-----------|
11071107
| haiku | 392 | 0.7333 | 0.5121 | -0.2212 | **-30.16%** |
1108-
| sonnet | 9 | 1.4830 | 1.3951 | -0.0880 | **-5.93%** |
1109-
| opus | 96 | 58.8995 | 94.8916 | +35.9921 | **+61.11%** |
11101108
1111-
This replaces the prior single pooled cost headline and is the canonical cost estimate in this report. For the most stable comparison set (haiku, `n=392` valid pairs), MCP reduces average cost per task from **$0.7333** to **$0.5121** (**-30.16%**). Model effects remain heterogeneous: sonnet is slightly cheaper with MCP, while opus is more expensive.
1112-
1113-
**Haiku cost by estimated codebase LOC (same canonical pairing):**
1114-
1115-
| Estimated LOC Band | n | BL $/task | MCP $/task | Δ $/task | MCP vs BL |
1116-
|--------------------|---|-----------|------------|----------|-----------|
1117-
| <400K | 9 | 0.3721 | 0.7599 | +0.3878 | **+104.20%** |
1118-
| 400K-2M | 14 | 0.3680 | 0.5237 | +0.1556 | **+42.29%** |
1119-
| 2M-8M | 44 | 0.4057 | 0.4139 | +0.0082 | **+2.02%** |
1120-
| 8M-40M | 126 | 0.3124 | 0.3569 | +0.0445 | **+14.26%** |
1121-
| >40M | 97 | 1.8362 | 0.6554 | -1.1808 | **-64.31%** |
1122-
| unknown | 102 | 0.4277 | 0.5864 | +0.1587 | **+37.11%** |
1123-
1124-
Method note: size bins are derived from GitHub repo size in KB and mapped to LOC bands; `unknown` indicates missing or unresolved repository metadata.
1109+
This is the canonical cost estimate in this report: for haiku (`n=392` valid pairs), MCP reduces average cost per task from **$0.7333** to **$0.5121** (**-30.16%**).
11251110
11261111
### 11.12 Timing Analysis
11271112

0 commit comments

Comments
 (0)