Commit 04f3bc6
feat(eval): resume-builder × gpt-5.4-mini@{med,low} sweep — cost story does NOT transfer
Slice 1K's mini@low win on the workspace-assistant surface prompted
the natural followup: does it hold on the heavier resume-builder
surface (tool loop, multi-turn intake, proactive_offer +
pending_followups channels, 11K-char structuring-pass canary)?
Wiring:
- `tests/quality/resume_builder_agentic_runner.py`:
_AGENTIC_CANDIDATES changed from dict[str, str] to dict[str, dict]
so each entry carries {slug, reasoning_effort}. Added
`gpt-5.4-mini@med` + `gpt-5.4-mini@low` to the slate. _build_arm
now passes default_reasoning_effort through to OpenRouterEvalService
and labels the arm with the effort tier for the comparison matrix.
- `tests/quality/openrouter_eval_service.py`:
Added `default_reasoning_effort` constructor param. run_tool_loop
and run_json_prompt now fall back to the instance default when the
per-call kwarg is unset, so the eval matrix can inject the effort
tier per candidate WITHOUT modifying production
resume_builder_service.run_tool_loop callers (which never pass the
kwarg). 22/22 adapter unit tests still pass.
Result on the same 16 OpenRouter scenarios:
candidate raw eff lat cost
gpt-5.4-mini@med 14/16 16/16 247s $0.144
gpt-5.4-mini@low 15/16 16/16 200s $0.127
Re-classifying raw fails: both candidates trip the curly-apostrophe
matcher bug Slice 1H flagged ("can't" / "couldn't" with U+2019).
Real behavior PASSES on every flagged miss — the resume-builder
runner just never got the normalisation patch the assistant runner
has.
Comparison vs Slice 1H baselines on these 16 scenarios:
candidate eff lat/scn cost
openai-via-or 16/16 8.3s ~$0.12
gpt-5.4-mini@low 16/16 12.5s $0.127 <-- new
gpt-5.4-mini@med 16/16 15.4s $0.144 <-- new
sonnet-4.5 14/16 17.1s $0.977
deepseek 14/16 57.6s $0.173
gemini 12/16 34.4s $0.919
Mini matches gpt-5.4 quality AND beats Sonnet / Gemini / DeepSeek
on this surface (which all fail structured_payload and/or
proactive_offer). BUT on this surface the mini cost story does NOT
transfer:
* gpt-5.4-via-OR: 8.3s/scn, ~$0.12 total
* mini@low: 12.5s/scn, $0.127 total
* mini@med: 15.4s/scn, $0.144 total
Mini's 5x per-token discount is eaten by reasoning-token overhead.
On the short retrieve-and-refuse assistant surface reasoning_effort
barely fires; on the long multi-turn agentic resume-builder surface
the model thinks before AND after each tool call. Net cost ends up
similar to gpt-5.4 — and latency is 50-85% higher.
**Recommendation:** keep gpt-5.4 as the resume-builder production
default. mini doesn't earn the switch on this surface. Finding is
genuinely surface-specific:
* Workspace assistant → mini@low (Slice 1K result holds)
* Resume builder → gpt-5.4 (Slice 1H result holds)
Design lesson worth keeping: reasoning models shine when the
inference is short and structured; they're a wash when the agentic
loop is already providing the "reasoning" externally.
DEVLOG Day 61 updated. Full read-out:
`docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 90234e1 commit 04f3bc6
6 files changed
Lines changed: 1423 additions & 17 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2588 | 2588 | | |
2589 | 2589 | | |
2590 | 2590 | | |
| 2591 | + | |
| 2592 | + | |
| 2593 | + | |
| 2594 | + | |
| 2595 | + | |
| 2596 | + | |
| 2597 | + | |
| 2598 | + | |
| 2599 | + | |
| 2600 | + | |
| 2601 | + | |
| 2602 | + | |
| 2603 | + | |
| 2604 | + | |
| 2605 | + | |
| 2606 | + | |
| 2607 | + | |
| 2608 | + | |
| 2609 | + | |
| 2610 | + | |
| 2611 | + | |
| 2612 | + | |
| 2613 | + | |
| 2614 | + | |
| 2615 | + | |
| 2616 | + | |
| 2617 | + | |
| 2618 | + | |
| 2619 | + | |
| 2620 | + | |
| 2621 | + | |
| 2622 | + | |
| 2623 | + | |
| 2624 | + | |
| 2625 | + | |
| 2626 | + | |
| 2627 | + | |
| 2628 | + | |
| 2629 | + | |
| 2630 | + | |
| 2631 | + | |
| 2632 | + | |
| 2633 | + | |
| 2634 | + | |
| 2635 | + | |
| 2636 | + | |
| 2637 | + | |
| 2638 | + | |
| 2639 | + | |
| 2640 | + | |
| 2641 | + | |
| 2642 | + | |
| 2643 | + | |
| 2644 | + | |
| 2645 | + | |
| 2646 | + | |
| 2647 | + | |
| 2648 | + | |
| 2649 | + | |
| 2650 | + | |
| 2651 | + | |
| 2652 | + | |
| 2653 | + | |
| 2654 | + | |
| 2655 | + | |
| 2656 | + | |
| 2657 | + | |
Lines changed: 105 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
Lines changed: 106 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
0 commit comments