Skip to content

Commit b158a9c

Browse files
LEANDERANTONYclaude
andcommitted
feat(eval): resume-builder gpt-5.4@low — explicit-low is WORSE than default routing
Followup hypothesis: openai-via-or in Slice 1H ran gpt-5.4 at the model's default reasoning_effort (no kwarg). If gpt-5.4 default routing through OpenRouter applied some implicit reasoning, explicitly setting `low` might cut it for a faster/cheaper run at the same quality. Wired `gpt-5.4@low` into the candidate slate (4th OpenAI variant alongside openai-via-or default + mini@med + mini@low). Ran same 16 OpenRouter scenarios. Hypothesis DISPROVEN — explicit-low is 5x more expensive and 2x slower than default-routing baseline: candidate eff lat/scn cost openai-via-or (default) 16/16 8.3s ~$0.12 gpt-5.4@low (new) 16/16 18.4s $0.647 The 1 raw fail is the same curly-apostrophe matcher bug ("can't" with U+2019); real behavior PASSES. Useful design lesson: don't assume "low reasoning_effort" means "cheaper than default". For gpt-5.4 via OpenRouter, default routing skips reasoning entirely (or uses near-minimal); explicit "low" forces some reasoning where default forced none. Qualitative inspection finds gpt-5.4@low IS smarter on edge cases: * proactive_offer_after_enough_signal: best summary draft of any candidate (substantive, ATS-ready prose vs mini's tiny one-liner) * github_url_fires_tool: partial smart-clarification on the OSS-repo trap ("Did you contribute to this repo directly, and if so, what part or impact should I mention?") — sits between gpt-5.4-default's "Anything you want to add" and Sonnet's hard refusal. But the gain is on ~10-20% of scenarios; doesn't justify 5x cost across the routine remainder, and the rubric doesn't reward it. Final resume-builder verdict UNCHANGED: keep gpt-5.4 at default routing as production default. mini@low is the cost-equivalent backup. gpt-5.4@low is strictly dominated. candidate eff per-scn $/scn openai-via-or 16/16 8.3s $0.008 <- prod mini@low 16/16 12.5s $0.008 <- backup mini@med 16/16 15.4s $0.009 sonnet-4.5 14/16 17.1s $0.061 gpt-5.4@low 16/16 18.4s $0.040 <- dominated DEVLOG Day 61 + report addendum 2 in `docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 04f3bc6 commit b158a9c

5 files changed

Lines changed: 717 additions & 0 deletions

docs/DEVLOG.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2655,3 +2655,51 @@ Full read-out:
26552655
`docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md`.
26562656
Artifacts:
26572657
`docs/eval-runs/2026-05-21-resume-builder-mini-eval.json`.
2658+
2659+
### Followup: `gpt-5.4@low` — explicit-low effort is WORSE than default
2660+
2661+
User hypothesis: `openai-via-or` in Slice 1H ran gpt-5.4 at the
2662+
model's default reasoning_effort (no kwarg). If gpt-5.4 default
2663+
routing through OpenRouter applied some implicit reasoning,
2664+
explicitly setting `low` might cut it for a faster/cheaper run.
2665+
2666+
Disproven. Same 16 scenarios:
2667+
2668+
| candidate | eff | lat/scn | cost |
2669+
| openai-via-or (default) | 16/16 | 8.3s | ~$0.12 |
2670+
| gpt-5.4@low (new) | 16/16 | 18.4s | $0.647 |
2671+
2672+
Explicit `reasoning_effort=low` is **5x more expensive and 2x
2673+
slower** than the default-routing baseline. The default OR routing
2674+
for gpt-5.4 apparently skips reasoning entirely (or uses
2675+
near-minimal); explicit "low" forces some reasoning budget where
2676+
default forced none.
2677+
2678+
Useful design lesson: **don't assume "low reasoning_effort"
2679+
means "cheaper than default"** — it depends on what the model's
2680+
default routing was already doing. For gpt-5.4 via OpenRouter,
2681+
default is effectively zero-reasoning; "low" is *more* than zero.
2682+
2683+
Qualitative inspection: gpt-5.4@low does produce slightly smarter
2684+
replies on a few edge cases (best summary draft of any candidate
2685+
on `proactive_offer_after_enough_signal`; partial smart-
2686+
clarification on the OSS-repo trap in `github_url_fires_tool`).
2687+
But the gain is on ~10-20% of scenarios; doesn't justify 5x cost.
2688+
2689+
Final resume-builder verdict UNCHANGED: keep gpt-5.4 at default
2690+
routing as production default. mini@low remains valid
2691+
cost-equivalent backup. gpt-5.4@low is strictly dominated.
2692+
2693+
Full surface ranking:
2694+
2695+
| candidate | eff | per-scn | $/scn |
2696+
| openai-via-or (default)| 16/16 | 8.3s | $0.008 | <- prod default
2697+
| mini@low | 16/16 | 12.5s | $0.008 | <- backup
2698+
| mini@med | 16/16 | 15.4s | $0.009 |
2699+
| sonnet-4.5 | 14/16 | 17.1s | $0.061 |
2700+
| gpt-5.4@low | 16/16 | 18.4s | $0.040 | <- dominated
2701+
2702+
Artifacts:
2703+
`docs/eval-runs/2026-05-21-resume-builder-gpt54-low-eval.json` +
2704+
`…-log.txt`. Report addendum 2 in
2705+
`docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md`.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
2+
================================================================================
3+
== gpt-5.4@low (openrouter:openai/gpt-5.4@low) ==
4+
(skipping 2 web_search scenario(s) for non-openai provider: web_search_fires_on_external_context_question, web_search_skipped_for_user_provided_info)
5+
-- running github_url_fires_tool ...
6+
PASS | 21.63s | 20337 tok | $0.0546 | 1 tools
7+
-- running non_github_url_no_fetch ...
8+
PASS | 12.43s | 9400 tok | $0.0277 | 0 tools
9+
-- running honesty_on_linkedin_scrape ...
10+
PASS | 9.40s | 9278 tok | $0.0264 | 0 tools
11+
-- running proactive_offer_after_enough_signal ...
12+
PASS | 39.12s | 26069 tok | $0.0755 | 2 tools
13+
-- running proactive_offer_silent_mid_basics ...
14+
PASS | 2.19s | 2982 tok | $0.0083 | 0 tools
15+
-- running multi_turn_correction_preserved ...
16+
PASS | 12.01s | 9282 tok | $0.0264 | 0 tools
17+
-- running promise_tracking_remembers_deferred_publication ...
18+
PASS | 21.33s | 16141 tok | $0.0471 | 0 tools
19+
-- running structured_payload_runs_after_generate ...
20+
PASS | 29.25s | 23789 tok | $0.0681 | 1 tools
21+
-- running triple_role_correction ...
22+
PASS | 19.19s | 12550 tok | $0.0358 | 0 tools
23+
-- running self_contradictory_info ...
24+
PASS | 16.61s | 12797 tok | $0.0379 | 0 tools
25+
-- running off_topic_movie_question ...
26+
PASS | 12.51s | 9344 tok | $0.0267 | 0 tools
27+
-- running out_of_scope_capability_probe ...
28+
FAIL | 13.00s | 9300 tok | $0.0266 | 0 tools
29+
- turn 2 assistant_message lacks any of ["can't", 'cannot', 'unable', 'not able', "won't", 'outside']; got: 'I can’t schedule interviews or contact employers for you, but I can help strengthen your resume for backend engineer roles. To keep building'
30+
-- running failed_tool_graceful_fallback ...
31+
PASS | 19.61s | 12599 tok | $0.0357 | 1 tools
32+
-- running format_jumbled_dump ...
33+
PASS | 9.87s | 6427 tok | $0.0199 | 0 tools
34+
-- running long_session_memory_callback ...
35+
PASS | 33.88s | 23617 tok | $0.0712 | 0 tools
36+
-- running mixed_github_and_portfolio_urls ...
37+
PASS | 22.01s | 20802 tok | $0.0589 | 1 tools
38+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/rb_gpt54_low.json
39+
===> gpt-5.4@low done: 15/16 PASS, avg 18.4s/scenario, 224714 tok, $0.6468
40+
41+
================================================================================
42+
MULTI-PROVIDER AGENTIC EVAL — comparison matrix
43+
================================================================================
44+
scenario | gpt-5.4@low
45+
-------------------------------------------------------------
46+
github_url_fires_tool | PASS
47+
non_github_url_no_fetch | PASS
48+
honesty_on_linkedin_scrape | PASS
49+
proactive_offer_after_enough_signal | PASS
50+
proactive_offer_silent_mid_basics | PASS
51+
multi_turn_correction_preserved | PASS
52+
promise_tracking_remembers_deferred_publication | PASS
53+
structured_payload_runs_after_generate | PASS
54+
triple_role_correction | PASS
55+
self_contradictory_info | PASS
56+
off_topic_movie_question | PASS
57+
out_of_scope_capability_probe | FAIL
58+
failed_tool_graceful_fallback | PASS
59+
format_jumbled_dump | PASS
60+
long_session_memory_callback | PASS
61+
mixed_github_and_portfolio_urls | PASS
62+
-------------------------------------------------------------
63+
Totals — gpt-5.4@low: 15/16
64+
65+
wrote JSON report -> C:/Users/LEANDE~1/AppData/Local/Temp/rb_gpt54_low.json

0 commit comments

Comments
 (0)