You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/BLOG_POST.md
+9-12Lines changed: 9 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -170,21 +170,18 @@ Could also just be plain ol' agent non-determinism. Retrieval quality alone does
170
170
171
171
Let's take a break from whatever voodoo variables control reward outcomes and talk about costs and timing.
172
172
173
-
I updated cost reporting to a cleaner canonical pairing method on `runs/official/_raw`: for each `(model, task)` pair, keep the latest valid baseline run and latest valid MCP run, then compare those one-to-one (task-weighted, stratified by model).
173
+
For the headline cost comparison, I switched to one canonical paired method on `runs/official/_raw`:
174
174
175
-
Headline cost results from that method:
175
+
1. Normalize task IDs (`mcp_` / `sgonly_` prefixes and random suffixes removed).
176
+
2. For each `(model, task)`, keep the latest valid baseline run and latest valid MCP run.
177
+
3. Valid means `output_tokens > 0` and `agent_execution_seconds >= 10`.
178
+
4. Compare one MCP run to one baseline run per task, then average per model.
176
179
177
-
| Model | n paired tasks | BL $/task | MCP $/task | MCP vs BL |
@@ -195,7 +192,7 @@ For haiku specifically (same canonical pairing), cost as a function of estimated
195
192
| >40M | 97 | 1.8362 | 0.6554 |**-64.31%**|
196
193
| unknown | 102 | 0.4277 | 0.5864 |**+37.11%**|
197
194
198
-
Method note: this figure intentionally excludes opus and uses only haiku paired tasks. Size bins are derived from GitHub repo size (`/repos/{owner}/{repo}.size` in KB) mapped to LOC bands (`<400K`, `400K-2M`, `2M-8M`, `8M-40M`, `>40M`); `unknown` means missing/unresolved repo metadata.
195
+
That is a weighting effect, not a contradiction: the `>40M` band has large absolute savings and enough mass to pull the overall weighted average down even when several smaller bands are MCP-expensive.
0 commit comments