Skip to content

Commit 3739051

Browse files
committed
Refine haiku cost bar chart spacing and rewrite blog cost blurb
1 parent c9bb737 commit 3739051

File tree

3 files changed

+1291
-12
lines changed

3 files changed

+1291
-12
lines changed

docs/BLOG_POST.md

Lines changed: 9 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -170,21 +170,18 @@ Could also just be plain ol' agent non-determinism. Retrieval quality alone does
170170

171171
Let's take a break from whatever voodoo variables control reward outcomes and talk about costs and timing.
172172

173-
I updated cost reporting to a cleaner canonical pairing method on `runs/official/_raw`: for each `(model, task)` pair, keep the latest valid baseline run and latest valid MCP run, then compare those one-to-one (task-weighted, stratified by model).
173+
For the headline cost comparison, I switched to one canonical paired method on `runs/official/_raw`:
174174

175-
Headline cost results from that method:
175+
1. Normalize task IDs (`mcp_` / `sgonly_` prefixes and random suffixes removed).
176+
2. For each `(model, task)`, keep the latest valid baseline run and latest valid MCP run.
177+
3. Valid means `output_tokens > 0` and `agent_execution_seconds >= 10`.
178+
4. Compare one MCP run to one baseline run per task, then average per model.
176179

177-
| Model | n paired tasks | BL $/task | MCP $/task | MCP vs BL |
178-
|-------|-----------------|-----------|------------|-----------|
179-
| haiku | 392 | 0.7333 | 0.5121 | **-30.16%** |
180-
| sonnet | 9 | 1.4830 | 1.3951 | **-5.93%** |
181-
| opus | 96 | 58.8995 | 94.8916 | **+61.11%** |
180+
For **haiku valid pairs** (`n=392`), baseline is `$0.733/task` and MCP is `$0.512/task` (**-30.16%**).
182181

183-
So the cost story is model-dependent: MCP is cheaper on haiku/sonnet in this slice, but substantially more expensive on opus.
182+
![Haiku valid pairs baseline vs MCP cost](assets/blog/codescalebench_mcp/figure_8_haiku_cost_baseline_vs_mcp.png)
184183

185-
![Paired MCP vs baseline cost for haiku by estimated codebase LOC](assets/blog/codescalebench_mcp/figure_7_cost_pairing_by_model_and_size.png)
186-
187-
For haiku specifically (same canonical pairing), cost as a function of estimated codebase LOC from GitHub repository size:
184+
If you split the same haiku pairs by estimated codebase LOC (from GitHub repo size), MCP looks more expensive in several bins:
188185

189186
| Estimated LOC Band | n | BL $/task | MCP $/task | MCP vs BL |
190187
|--------------------|---|-----------|------------|-----------|
@@ -195,7 +192,7 @@ For haiku specifically (same canonical pairing), cost as a function of estimated
195192
| >40M | 97 | 1.8362 | 0.6554 | **-64.31%** |
196193
| unknown | 102 | 0.4277 | 0.5864 | **+37.11%** |
197194

198-
Method note: this figure intentionally excludes opus and uses only haiku paired tasks. Size bins are derived from GitHub repo size (`/repos/{owner}/{repo}.size` in KB) mapped to LOC bands (`<400K`, `400K-2M`, `2M-8M`, `8M-40M`, `>40M`); `unknown` means missing/unresolved repo metadata.
195+
That is a weighting effect, not a contradiction: the `>40M` band has large absolute savings and enough mass to pull the overall weighted average down even when several smaller bands are MCP-expensive.
199196

200197
Speed tells an even cleaner story:
201198

57.2 KB
Loading

0 commit comments

Comments
 (0)