Commit b158a9c
feat(eval): resume-builder gpt-5.4@low — explicit-low is WORSE than default routing
Followup hypothesis: openai-via-or in Slice 1H ran gpt-5.4 at the
model's default reasoning_effort (no kwarg). If gpt-5.4 default
routing through OpenRouter applied some implicit reasoning,
explicitly setting `low` might cut it for a faster/cheaper run at
the same quality.
Wired `gpt-5.4@low` into the candidate slate (4th OpenAI variant
alongside openai-via-or default + mini@med + mini@low). Ran same
16 OpenRouter scenarios.
Hypothesis DISPROVEN — explicit-low is 5x more expensive and 2x
slower than default-routing baseline:
candidate eff lat/scn cost
openai-via-or (default) 16/16 8.3s ~$0.12
gpt-5.4@low (new) 16/16 18.4s $0.647
The 1 raw fail is the same curly-apostrophe matcher bug
("can't" with U+2019); real behavior PASSES.
Useful design lesson: don't assume "low reasoning_effort" means
"cheaper than default". For gpt-5.4 via OpenRouter, default
routing skips reasoning entirely (or uses near-minimal);
explicit "low" forces some reasoning where default forced none.
Qualitative inspection finds gpt-5.4@low IS smarter on edge
cases:
* proactive_offer_after_enough_signal: best summary draft of
any candidate (substantive, ATS-ready prose vs mini's tiny
one-liner)
* github_url_fires_tool: partial smart-clarification on the
OSS-repo trap ("Did you contribute to this repo directly,
and if so, what part or impact should I mention?") — sits
between gpt-5.4-default's "Anything you want to add" and
Sonnet's hard refusal.
But the gain is on ~10-20% of scenarios; doesn't justify 5x cost
across the routine remainder, and the rubric doesn't reward it.
Final resume-builder verdict UNCHANGED: keep gpt-5.4 at default
routing as production default. mini@low is the cost-equivalent
backup. gpt-5.4@low is strictly dominated.
candidate eff per-scn $/scn
openai-via-or 16/16 8.3s $0.008 <- prod
mini@low 16/16 12.5s $0.008 <- backup
mini@med 16/16 15.4s $0.009
sonnet-4.5 14/16 17.1s $0.061
gpt-5.4@low 16/16 18.4s $0.040 <- dominated
DEVLOG Day 61 + report addendum 2 in
`docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 04f3bc6 commit b158a9c
5 files changed
Lines changed: 717 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2655 | 2655 | | |
2656 | 2656 | | |
2657 | 2657 | | |
| 2658 | + | |
| 2659 | + | |
| 2660 | + | |
| 2661 | + | |
| 2662 | + | |
| 2663 | + | |
| 2664 | + | |
| 2665 | + | |
| 2666 | + | |
| 2667 | + | |
| 2668 | + | |
| 2669 | + | |
| 2670 | + | |
| 2671 | + | |
| 2672 | + | |
| 2673 | + | |
| 2674 | + | |
| 2675 | + | |
| 2676 | + | |
| 2677 | + | |
| 2678 | + | |
| 2679 | + | |
| 2680 | + | |
| 2681 | + | |
| 2682 | + | |
| 2683 | + | |
| 2684 | + | |
| 2685 | + | |
| 2686 | + | |
| 2687 | + | |
| 2688 | + | |
| 2689 | + | |
| 2690 | + | |
| 2691 | + | |
| 2692 | + | |
| 2693 | + | |
| 2694 | + | |
| 2695 | + | |
| 2696 | + | |
| 2697 | + | |
| 2698 | + | |
| 2699 | + | |
| 2700 | + | |
| 2701 | + | |
| 2702 | + | |
| 2703 | + | |
| 2704 | + | |
| 2705 | + | |
Lines changed: 65 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
0 commit comments