You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/BLOG_POST.md
+20-23Lines changed: 20 additions & 23 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -118,23 +118,17 @@ Context retrieval isn't the bottleneck for every software development situation.
118
118
119
119
## MCP Value Scales With Codebase Size
120
120
121
-
For the fully refreshed pass, I used task-level size proxies that are present for this dataset (`context_length` and `files_count`) with multi-run averages per task/config:
121
+
For the refreshed pass, I binned tasks by estimated LOC from GitHub repo size (same LOC mapping used elsewhere: `<400K`, `400K-2M`, `2M-8M`, `8M-40M`, `>40M`):
122
122
123
-
| Context Size Proxy | n | BL Mean | MCP Mean | Δ Reward | Var(Δ Reward) |
So in this refreshed slice, MCP reward delta is positive across all available size-proxy bins.
131
+
So in this refreshed slice, MCP reward gains are strongest in the 400K-2M and 8M-40M bands, mixed in smaller/mid bands, and still positive in the largest band.
138
132
139
133
Breaking it down by difficulty (with variance): hard tasks remain positive (+0.038, var 0.046768), medium tasks are most positive (+0.115, var 0.053039), and expert tasks remain negative (−0.057, var 0.070557).
140
134
@@ -188,17 +182,20 @@ Headline cost results from that method:
188
182
189
183
So the cost story is model-dependent: MCP is cheaper on haiku/sonnet in this slice, but substantially more expensive on opus.
190
184
191
-

185
+

192
186
193
-
And for haiku specifically (same canonical pairing), cost as a function of size proxies:
187
+
For haiku specifically (same canonical pairing), cost as a function of estimated codebase LOC from GitHub repository size:
194
188
195
-
|Context Length Bin| n | BL $/task | MCP $/task | MCP vs BL |
189
+
|Estimated LOC Band| n | BL $/task | MCP $/task | MCP vs BL |
The large `unknown` bucket is important context here: size metadata coverage is incomplete in this slice, so the known-size bins are cleaner than aggregate unknown.
191
+
| <400K | 9 | 0.3721 | 0.7599 |**+104.20%**|
192
+
| 400K-2M | 14 | 0.3680 | 0.5237 |**+42.29%**|
193
+
| 2M-8M | 44 | 0.4057 | 0.4139 |**+2.02%**|
194
+
| 8M-40M | 126 | 0.3124 | 0.3569 |**+14.26%**|
195
+
| >40M | 97 | 1.8362 | 0.6554 |**-64.31%**|
196
+
| unknown | 102 | 0.4277 | 0.5864 |**+37.11%**|
197
+
198
+
Method note: this figure intentionally excludes opus and uses only haiku paired tasks. Size bins are derived from GitHub repo size (`/repos/{owner}/{repo}.size` in KB) mapped to LOC bands (`<400K`, `400K-2M`, `2M-8M`, `8M-40M`, `>40M`); `unknown` means missing/unresolved repo metadata.
0 commit comments