Skip to content

Commit c9bb737

Browse files
committed
Switch cost size analysis to haiku-only GitHub-size LOC bins
1 parent c916168 commit c9bb737

File tree

6 files changed

+1957
-1944
lines changed

6 files changed

+1957
-1944
lines changed

docs/BLOG_POST.md

Lines changed: 20 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -118,23 +118,17 @@ Context retrieval isn't the bottleneck for every software development situation.
118118

119119
## MCP Value Scales With Codebase Size
120120

121-
For the fully refreshed pass, I used task-level size proxies that are present for this dataset (`context_length` and `files_count`) with multi-run averages per task/config:
121+
For the refreshed pass, I binned tasks by estimated LOC from GitHub repo size (same LOC mapping used elsewhere: `<400K`, `400K-2M`, `2M-8M`, `8M-40M`, `>40M`):
122122

123-
| Context Size Proxy | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
124-
|--------------------|---|---------|----------|----------|---------------|
125-
| <100K tokens | 222 | 0.400 | 0.433 | +0.034 | 0.026862 |
126-
| 100K–1M tokens | 98 | 0.639 | 0.670 | +0.031 | 0.093518 |
127-
| unknown | 50 | 0.523 | 0.571 | +0.048 | 0.059717 |
123+
| Estimated LOC Band | n | BL Mean | MCP Mean | Δ Reward |
124+
|--------------------|---|---------|----------|----------|
125+
| <400K | 3 | 0.850 | 0.770 | -0.080 |
126+
| 400K-2M | 8 | 0.140 | 0.399 | +0.259 |
127+
| 2M-8M | 48 | 0.521 | 0.483 | -0.037 |
128+
| 8M-40M | 166 | 0.489 | 0.544 | +0.055 |
129+
| >40M | 145 | 0.462 | 0.492 | +0.030 |
128130

129-
And by file-count bins:
130-
131-
| Files Count Bin | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
132-
|----------------|---|---------|----------|----------|---------------|
133-
| <10 | 168 | 0.327 | 0.375 | +0.048 | 0.032454 |
134-
| 10–100 | 91 | 0.676 | 0.699 | +0.023 | 0.097068 |
135-
| unknown | 111 | 0.550 | 0.575 | +0.025 | 0.034117 |
136-
137-
So in this refreshed slice, MCP reward delta is positive across all available size-proxy bins.
131+
So in this refreshed slice, MCP reward gains are strongest in the 400K-2M and 8M-40M bands, mixed in smaller/mid bands, and still positive in the largest band.
138132

139133
Breaking it down by difficulty (with variance): hard tasks remain positive (+0.038, var 0.046768), medium tasks are most positive (+0.115, var 0.053039), and expert tasks remain negative (−0.057, var 0.070557).
140134

@@ -188,17 +182,20 @@ Headline cost results from that method:
188182

189183
So the cost story is model-dependent: MCP is cheaper on haiku/sonnet in this slice, but substantially more expensive on opus.
190184

191-
![Paired MCP vs baseline cost by model and size](assets/blog/codescalebench_mcp/figure_7_cost_pairing_by_model_and_size.png)
185+
![Paired MCP vs baseline cost for haiku by estimated codebase LOC](assets/blog/codescalebench_mcp/figure_7_cost_pairing_by_model_and_size.png)
192186

193-
And for haiku specifically (same canonical pairing), cost as a function of size proxies:
187+
For haiku specifically (same canonical pairing), cost as a function of estimated codebase LOC from GitHub repository size:
194188

195-
| Context Length Bin | n | BL $/task | MCP $/task | MCP vs BL |
189+
| Estimated LOC Band | n | BL $/task | MCP $/task | MCP vs BL |
196190
|--------------------|---|-----------|------------|-----------|
197-
| <100K | 222 | 0.2349 | 0.2146 | **-8.65%** |
198-
| 100K-1M | 98 | 1.4832 | 0.5094 | **-65.66%** |
199-
| unknown | 72 | 1.2491 | 1.4331 | **+14.73%** |
200-
201-
The large `unknown` bucket is important context here: size metadata coverage is incomplete in this slice, so the known-size bins are cleaner than aggregate unknown.
191+
| <400K | 9 | 0.3721 | 0.7599 | **+104.20%** |
192+
| 400K-2M | 14 | 0.3680 | 0.5237 | **+42.29%** |
193+
| 2M-8M | 44 | 0.4057 | 0.4139 | **+2.02%** |
194+
| 8M-40M | 126 | 0.3124 | 0.3569 | **+14.26%** |
195+
| >40M | 97 | 1.8362 | 0.6554 | **-64.31%** |
196+
| unknown | 102 | 0.4277 | 0.5864 | **+37.11%** |
197+
198+
Method note: this figure intentionally excludes opus and uses only haiku paired tasks. Size bins are derived from GitHub repo size (`/repos/{owner}/{repo}.size` in KB) mapped to LOC bands (`<400K`, `400K-2M`, `2M-8M`, `8M-40M`, `>40M`); `unknown` means missing/unresolved repo metadata.
202199

203200
Speed tells an even cleaner story:
204201

0 commit comments

Comments
 (0)