Skip to content

Commit 04f3bc6

Browse files
LEANDERANTONYclaude
andcommitted
feat(eval): resume-builder × gpt-5.4-mini@{med,low} sweep — cost story does NOT transfer
Slice 1K's mini@low win on the workspace-assistant surface prompted the natural followup: does it hold on the heavier resume-builder surface (tool loop, multi-turn intake, proactive_offer + pending_followups channels, 11K-char structuring-pass canary)? Wiring: - `tests/quality/resume_builder_agentic_runner.py`: _AGENTIC_CANDIDATES changed from dict[str, str] to dict[str, dict] so each entry carries {slug, reasoning_effort}. Added `gpt-5.4-mini@med` + `gpt-5.4-mini@low` to the slate. _build_arm now passes default_reasoning_effort through to OpenRouterEvalService and labels the arm with the effort tier for the comparison matrix. - `tests/quality/openrouter_eval_service.py`: Added `default_reasoning_effort` constructor param. run_tool_loop and run_json_prompt now fall back to the instance default when the per-call kwarg is unset, so the eval matrix can inject the effort tier per candidate WITHOUT modifying production resume_builder_service.run_tool_loop callers (which never pass the kwarg). 22/22 adapter unit tests still pass. Result on the same 16 OpenRouter scenarios: candidate raw eff lat cost gpt-5.4-mini@med 14/16 16/16 247s $0.144 gpt-5.4-mini@low 15/16 16/16 200s $0.127 Re-classifying raw fails: both candidates trip the curly-apostrophe matcher bug Slice 1H flagged ("can't" / "couldn't" with U+2019). Real behavior PASSES on every flagged miss — the resume-builder runner just never got the normalisation patch the assistant runner has. Comparison vs Slice 1H baselines on these 16 scenarios: candidate eff lat/scn cost openai-via-or 16/16 8.3s ~$0.12 gpt-5.4-mini@low 16/16 12.5s $0.127 <-- new gpt-5.4-mini@med 16/16 15.4s $0.144 <-- new sonnet-4.5 14/16 17.1s $0.977 deepseek 14/16 57.6s $0.173 gemini 12/16 34.4s $0.919 Mini matches gpt-5.4 quality AND beats Sonnet / Gemini / DeepSeek on this surface (which all fail structured_payload and/or proactive_offer). BUT on this surface the mini cost story does NOT transfer: * gpt-5.4-via-OR: 8.3s/scn, ~$0.12 total * mini@low: 12.5s/scn, $0.127 total * mini@med: 15.4s/scn, $0.144 total Mini's 5x per-token discount is eaten by reasoning-token overhead. On the short retrieve-and-refuse assistant surface reasoning_effort barely fires; on the long multi-turn agentic resume-builder surface the model thinks before AND after each tool call. Net cost ends up similar to gpt-5.4 — and latency is 50-85% higher. **Recommendation:** keep gpt-5.4 as the resume-builder production default. mini doesn't earn the switch on this surface. Finding is genuinely surface-specific: * Workspace assistant → mini@low (Slice 1K result holds) * Resume builder → gpt-5.4 (Slice 1H result holds) Design lesson worth keeping: reasoning models shine when the inference is short and structured; they're a wash when the agentic loop is already providing the "reasoning" externally. DEVLOG Day 61 updated. Full read-out: `docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 90234e1 commit 04f3bc6

6 files changed

Lines changed: 1423 additions & 17 deletions

docs/DEVLOG.md

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2588,3 +2588,70 @@ default to `openai/gpt-5.4-mini` at `reasoning_effort=low`. The
25882588
assistant surface is retrieval-and-refuse; thinking-token spend
25892589
beyond "low" earns nothing on this rubric. Artifacts:
25902590
`docs/eval-runs/2026-05-21-assistant-eval-mini-low.json`.
2591+
2592+
### Resume-builder × mini sweep (does the cost story transfer?)
2593+
2594+
Slice 1K's mini@low win on the workspace-assistant surface
2595+
prompted the natural followup: does it hold on the much heavier
2596+
resume-builder surface (tool loop, multi-turn intake,
2597+
proactive_offer + pending_followups channels, the 11 K-char
2598+
structuring-pass canary)?
2599+
2600+
Wired `gpt-5.4-mini@med` + `gpt-5.4-mini@low` into
2601+
`tests/quality/resume_builder_agentic_runner.py` (changed
2602+
`_AGENTIC_CANDIDATES` from `dict[str, str]` to
2603+
`dict[str, dict]` carrying `slug` + `reasoning_effort` per
2604+
candidate). Added `default_reasoning_effort` to
2605+
`OpenRouterEvalService.__init__` so the eval matrix can inject
2606+
the effort tier per candidate without touching the production
2607+
`resume_builder_service` caller; the per-call kwarg falls back to
2608+
the instance default when production code doesn't pass one.
2609+
2610+
Result on the same 16 OpenRouter scenarios:
2611+
2612+
| candidate | raw | eff | lat | cost |
2613+
| gpt-5.4-mini@med | 14/16 | 16/16 | 247s | $0.144 |
2614+
| gpt-5.4-mini@low | 15/16 | 16/16 | 200s | $0.127 |
2615+
2616+
Re-classifying the raw fails: both candidates trip the curly-
2617+
apostrophe matcher bug Slice 1H flagged ("can't" / "couldn't"
2618+
with U+2019). Real behavior PASSES on every flagged miss — the
2619+
resume-builder runner just never got the normalisation patch the
2620+
assistant runner has.
2621+
2622+
Compared to Slice 1H baselines on the same 16 scenarios:
2623+
gpt-5.4-via-OR scored 16/16 effective at 8.3 s / scenario and
2624+
roughly $0.12. Sonnet-4.5 scored 14/16 (1 real fail on
2625+
`structured_payload`) at $0.98. So mini matches gpt-5.4 quality
2626+
AND beats Sonnet/Gemini/DeepSeek on this surface (which all
2627+
failed `structured_payload` and/or `proactive_offer`).
2628+
2629+
But — and this is the important finding — on this surface the
2630+
mini cost story DOES NOT transfer:
2631+
2632+
* gpt-5.4-via-OR: 8.3 s / scenario, ~$0.12 total
2633+
* mini@low: 12.5 s / scenario, $0.127 total
2634+
* mini@med: 15.4 s / scenario, $0.144 total
2635+
2636+
Mini's 5x per-token discount is eaten by reasoning-token
2637+
overhead. On the short retrieve-and-refuse assistant surface
2638+
reasoning_effort barely fires; on the long multi-turn agentic
2639+
resume-builder surface the model thinks before AND after each
2640+
tool call. Net cost ends up similar to gpt-5.4 — and latency is
2641+
50-85 % higher.
2642+
2643+
**Recommendation:** keep `gpt-5.4` as the resume-builder
2644+
default. mini doesn't earn the switch on this surface. The
2645+
finding is genuinely surface-specific:
2646+
2647+
* Workspace assistant → mini@low (Slice 1K result holds)
2648+
* Resume builder → gpt-5.4 (Slice 1H result holds)
2649+
2650+
Design lesson worth keeping: reasoning models shine when the
2651+
inference is short and structured; they're a wash when the
2652+
agentic loop is already providing the "reasoning" externally.
2653+
2654+
Full read-out:
2655+
`docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md`.
2656+
Artifacts:
2657+
`docs/eval-runs/2026-05-21-resume-builder-mini-eval.json`.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
2+
================================================================================
3+
== gpt-5.4-mini@med (openrouter:openai/gpt-5.4-mini@medium) ==
4+
(skipping 2 web_search scenario(s) for non-openai provider: web_search_fires_on_external_context_question, web_search_skipped_for_user_provided_info)
5+
-- running github_url_fires_tool ...
6+
PASS | 21.45s | 21151 tok | $0.0125 | 1 tools
7+
-- running non_github_url_no_fetch ...
8+
PASS | 10.90s | 9579 tok | $0.0059 | 0 tools
9+
-- running honesty_on_linkedin_scrape ...
10+
PASS | 6.51s | 9597 tok | $0.0059 | 0 tools
11+
-- running proactive_offer_after_enough_signal ...
12+
PASS | 23.22s | 26682 tok | $0.0164 | 2 tools
13+
-- running proactive_offer_silent_mid_basics ...
14+
PASS | 2.79s | 2996 tok | $0.0017 | 0 tools
15+
-- running multi_turn_correction_preserved ...
16+
PASS | 7.45s | 9457 tok | $0.0056 | 0 tools
17+
-- running promise_tracking_remembers_deferred_publication ...
18+
PASS | 16.22s | 16924 tok | $0.0110 | 0 tools
19+
-- running structured_payload_runs_after_generate ...
20+
PASS | 32.43s | 24467 tok | $0.0149 | 1 tools
21+
-- running triple_role_correction ...
22+
PASS | 14.05s | 12850 tok | $0.0078 | 0 tools
23+
-- running self_contradictory_info ...
24+
PASS | 26.36s | 13857 tok | $0.0097 | 0 tools
25+
-- running off_topic_movie_question ...
26+
PASS | 8.42s | 9484 tok | $0.0056 | 0 tools
27+
-- running out_of_scope_capability_probe ...
28+
FAIL | 7.07s | 9378 tok | $0.0055 | 0 tools
29+
- turn 2 assistant_message lacks any of ["can't", 'cannot', 'unable', 'not able', "won't", 'outside']; got: 'I can’t schedule interviews, but I can keep helping with your resume—what’s your professional summary in one sentence?'
30+
-- running failed_tool_graceful_fallback ...
31+
FAIL | 10.56s | 12658 tok | $0.0073 | 1 tools
32+
- turn 2 assistant_message lacks any of ['describe', 'details', 'tell me', 'share', "couldn't", 'could not', "didn't work"]; got: 'I couldn’t fetch that GitHub repo because the link looks invalid or unavailable. Please paste the project’s tech stack and what it did in yo'
33+
-- running format_jumbled_dump ...
34+
PASS | 7.10s | 6789 tok | $0.0047 | 0 tools
35+
-- running long_session_memory_callback ...
36+
PASS | 36.28s | 24535 tok | $0.0162 | 0 tools
37+
-- running mixed_github_and_portfolio_urls ...
38+
PASS | 15.89s | 21526 tok | $0.0133 | 1 tools
39+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/rb_mini_full.json
40+
===> gpt-5.4-mini@med done: 14/16 PASS, avg 15.4s/scenario, 231930 tok, $0.1440
41+
42+
================================================================================
43+
== gpt-5.4-mini@low (openrouter:openai/gpt-5.4-mini@low) ==
44+
(skipping 2 web_search scenario(s) for non-openai provider: web_search_fires_on_external_context_question, web_search_skipped_for_user_provided_info)
45+
-- running github_url_fires_tool ...
46+
PASS | 9.38s | 20414 tok | $0.0110 | 1 tools
47+
-- running non_github_url_no_fetch ...
48+
PASS | 7.22s | 9273 tok | $0.0053 | 0 tools
49+
-- running honesty_on_linkedin_scrape ...
50+
PASS | 13.08s | 9281 tok | $0.0053 | 0 tools
51+
-- running proactive_offer_after_enough_signal ...
52+
PASS | 22.45s | 25976 tok | $0.0150 | 2 tools
53+
-- running proactive_offer_silent_mid_basics ...
54+
PASS | 1.40s | 2957 tok | $0.0016 | 0 tools
55+
-- running multi_turn_correction_preserved ...
56+
PASS | 7.72s | 9244 tok | $0.0052 | 0 tools
57+
-- running promise_tracking_remembers_deferred_publication ...
58+
PASS | 15.98s | 15978 tok | $0.0092 | 0 tools
59+
-- running structured_payload_runs_after_generate ...
60+
PASS | 15.79s | 23652 tok | $0.0133 | 1 tools
61+
-- running triple_role_correction ...
62+
PASS | 13.55s | 12487 tok | $0.0070 | 0 tools
63+
-- running self_contradictory_info ...
64+
PASS | 18.09s | 12901 tok | $0.0078 | 0 tools
65+
-- running off_topic_movie_question ...
66+
PASS | 9.41s | 9327 tok | $0.0053 | 0 tools
67+
-- running out_of_scope_capability_probe ...
68+
FAIL | 8.43s | 9213 tok | $0.0052 | 0 tools
69+
- turn 2 assistant_message lacks any of ["can't", 'cannot', 'unable', 'not able', "won't", 'outside']; got: 'I can’t schedule interviews, but I can help you finish your resume so you’re ready to apply. What’s your professional summary in one sentenc'
70+
-- running failed_tool_graceful_fallback ...
71+
PASS | 12.23s | 12499 tok | $0.0070 | 1 tools
72+
-- running format_jumbled_dump ...
73+
PASS | 4.45s | 6360 tok | $0.0039 | 0 tools
74+
-- running long_session_memory_callback ...
75+
PASS | 24.65s | 23267 tok | $0.0137 | 0 tools
76+
-- running mixed_github_and_portfolio_urls ...
77+
PASS | 16.23s | 20588 tok | $0.0114 | 1 tools
78+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/rb_mini_full.json
79+
===> gpt-5.4-mini@low done: 15/16 PASS, avg 12.5s/scenario, 223417 tok, $0.1272
80+
81+
================================================================================
82+
MULTI-PROVIDER AGENTIC EVAL — comparison matrix
83+
================================================================================
84+
scenario | gpt-5.4-mini@med | gpt-5.4-mini@low
85+
-------------------------------------------------------------------------------------
86+
github_url_fires_tool | PASS | PASS
87+
non_github_url_no_fetch | PASS | PASS
88+
honesty_on_linkedin_scrape | PASS | PASS
89+
proactive_offer_after_enough_signal | PASS | PASS
90+
proactive_offer_silent_mid_basics | PASS | PASS
91+
multi_turn_correction_preserved | PASS | PASS
92+
promise_tracking_remembers_deferred_publication | PASS | PASS
93+
structured_payload_runs_after_generate | PASS | PASS
94+
triple_role_correction | PASS | PASS
95+
self_contradictory_info | PASS | PASS
96+
off_topic_movie_question | PASS | PASS
97+
out_of_scope_capability_probe | FAIL | FAIL
98+
failed_tool_graceful_fallback | FAIL | PASS
99+
format_jumbled_dump | PASS | PASS
100+
long_session_memory_callback | PASS | PASS
101+
mixed_github_and_portfolio_urls | PASS | PASS
102+
-------------------------------------------------------------------------------------
103+
Totals — gpt-5.4-mini@med: 14/16 · gpt-5.4-mini@low: 15/16
104+
105+
wrote JSON report -> C:/Users/LEANDE~1/AppData/Local/Temp/rb_mini_full.json
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# Resume Builder × `gpt-5.4-mini` (medium + low) — eval addendum
2+
3+
**Run:** 2026-05-21, 2 candidates × 16 effective scenarios (web_search scenarios auto-skipped for non-native-OpenAI providers).
4+
**Total spend:** $0.27 OpenRouter. **Total wall time:** ~7 min 30 s.
5+
**Eval framework:** `tests/quality/resume_builder_agentic_runner.py` — agentic tool-loop scoring with the substring-rubric scenarios (same shape as Slice 1H comprehensive eval).
6+
7+
## What this eval answers
8+
9+
Slice 1K (workspace assistant) showed that `gpt-5.4-mini@low` matched `gpt-5.4@medium` quality at lower cost / latency on the retrieve-and-refuse surface. The obvious next question: does that hold on the **heavier** resume-builder surface — where the model has to drive a tool loop (`fetch_github_readme`), track conversation across many turns, fire `proactive_offer` at the right moments, populate `pending_followups[]`, and survive the 11 K-char structuring-pass canary that broke Sonnet / Gemini / DeepSeek in Slice 1H?
10+
11+
Slate intentionally narrow: just the two mini variants. We're testing whether mini can replace the gpt-5.4 default, not re-running the full ADR-028 D1 matrix. Sonnet / Gemini / DeepSeek baselines come from Slice 1H. Haiku skipped per operator request — Slice 1K's JSON-mode reliability concern (~92 %) is a production no-go on this surface.
12+
13+
## Results (today's run)
14+
15+
| Candidate | raw pass | effective | latency total | per-scn | cost | total tokens |
16+
|---|---:|---:|---:|---:|---:|---:|
17+
| `gpt-5.4-mini@med` | 14 / 16 | **16 / 16** | 247 s | 15.4 s | $0.144 | 231 930 |
18+
| **`gpt-5.4-mini@low`** | 15 / 16 | **16 / 16** | **200 s** | **12.5 s** | **$0.127** | 223 417 |
19+
20+
Re-classification: both candidates trip the **curly-apostrophe matcher bug** (same family Slice 1H flagged). Real behavior PASSES on every flagged "fail":
21+
22+
- `out_of_scope_capability_probe` (both candidates fail raw):
23+
- mini@med: *"I can't schedule interviews, but I can keep helping with your resume…"*
24+
- mini@low: *"I can't schedule interviews, but I can help you finish your resume so you're ready to apply…"*
25+
26+
Both are textbook refusal-with-redirect. The rubric's `one_of` list expected straight-apostrophe `"can't"`; the model emits U+2019 `"can't"`. Same bug Slice 1H caught across providers — the resume-builder runner just never got the normalisation patch the assistant runner has.
27+
28+
- `failed_tool_graceful_fallback` (mini@med fails, mini@low passes):
29+
- mini@med: *"I couldn't fetch that GitHub repo because the link looks invalid or unavailable. Please **paste** the project's tech stack…"*
30+
- mini@low: *"I couldn't fetch that README because the repo looks unavailable. Please **describe** the project in your own words…"*
31+
32+
Behaviorally identical (both honestly explain the failure + ask the user to share). mini@low happened to use the magic keyword "describe" that the rubric lists; mini@med used "paste" which isn't in the list. Plus the same curly-`"couldn't"` issue. mini@med fails the matcher, not the behavior.
33+
34+
So **both candidates are effectively 16 / 16** — perfect on every behavioral signal the rubric is actually trying to test. The only daylight between them is that mini@low got lucky with vocabulary on one scenario.
35+
36+
## Comparison vs Slice 1H baselines (apples-to-apples scope)
37+
38+
Pulling the candidates from `docs/eval-runs/2026-05-21-agentic-eval-v3-comprehensive-5cand.json` for the 16 OpenRouter scenarios:
39+
40+
| Candidate | effective | per-scn lat | total cost | notes |
41+
|---|---:|---:|---:|---|
42+
| `openai` (native gpt-5.4) | 18 / 18 | 8.7 s | $0.142 | Native Responses-API path (faster than OR proxy) |
43+
| `openai-via-or` (gpt-5.4) | 16 / 16 | 8.3 s | ~$0.12 ¹ | Same model, OpenRouter-routed |
44+
| **`gpt-5.4-mini@low`** (new) | **16 / 16** | **12.5 s** | **$0.127** | Reasoning-effort=low |
45+
| `gpt-5.4-mini@med` (new) | 16 / 16 | 15.4 s | $0.144 | Reasoning-effort=medium |
46+
| `sonnet-4.5` | 14 / 16 ⚠ | 17.1 s | $0.977 | 1 real fail on `structured_payload`; 8× the cost |
47+
| `deepseek` | 14 / 16 ⚠ | 57.6 s ⚠ | $0.173 | Real fails on `proactive_offer` + `structured_payload`; 6× slower |
48+
| `gemini` | 12 / 16 ⚠ | 34.4 s | $0.919 | 4 real fails incl. regex-fallback drop |
49+
50+
¹ Slice 1H reported `$0.00` due to the `run_json_prompt` usage-tracking bug we fixed in Slice 1K (Slice 1J'' commit). Estimated ~$0.12 from token totals × pricing.
51+
52+
### Per-axis honest read
53+
54+
**Quality.** All three OpenAI options (native, via-OR, both mini variants) tie at effective perfect on this rubric. Sonnet, DeepSeek, Gemini all have real fails on either `structured_payload` (the 11 K-char structuring pass) or `proactive_offer` firing. **Mini IS strong enough on this surface — it survives the canary scenarios that knocked out the non-OpenAI candidates.**
55+
56+
**Latency.** This is where the story flips vs the assistant findings.
57+
58+
- gpt-5.4 via OR: **8.3 s / scenario** (no reasoning_effort — just answers)
59+
- mini@low: 12.5 s / scenario (50 % slower)
60+
- mini@med: 15.4 s / scenario (85 % slower)
61+
62+
The reasoning-effort tokens add real wall-clock latency. On the assistant surface (short, retrieve-and-refuse turns) the reasoning overhead was minor; on the resume builder (long multi-turn intake with tool calls) it's a noticeable 4–7 s per turn slower. **For an interactive resume-builder where the user is staring at a typing indicator, gpt-5.4 wins on UX.**
63+
64+
**Cost.** Mini's per-token rate is 5× cheaper than gpt-5.4's, but reasoning-effort uses extra completion tokens (~8 K extra at low effort, ~16 K extra at medium). Net:
65+
66+
- gpt-5.4 via OR: ~$0.12 for 16 scenarios (~$0.0075 / scenario)
67+
- mini@low: $0.127 for 16 scenarios (~$0.008 / scenario)
68+
- mini@med: $0.144 for 16 scenarios (~$0.009 / scenario)
69+
70+
**On the resume-builder surface, mini@low costs roughly the same as gpt-5.4.** The per-token win is eaten by the reasoning-token overhead. This is the opposite of what we saw on the assistant surface, where mini was 5× cheaper.
71+
72+
## Why the assistant story doesn't transfer
73+
74+
The workspace assistant is short turns (one user question, one short JSON answer). The reasoning budget barely fires before the answer lands; mini@low spends almost no extra tokens vs no-reasoning. So mini's 5× per-token discount fully translates to cost.
75+
76+
The resume builder is long, tool-using, multi-turn agentic work. Each turn the reasoning model thinks before AND after each tool call. Reasoning tokens accumulate. The discount on per-token rates is offset by the extra tokens spent reasoning. **Net cost ends up similar to gpt-5.4 — and latency suffers.**
77+
78+
This is a meaningful design lesson: reasoning models are great when the inference is short and structured (assistant Q&A) and less great when there's already a lot of agentic structure providing the "reasoning" externally (multi-turn tool loop).
79+
80+
## Recommendation
81+
82+
**Keep gpt-5.4 as the resume-builder production default.** Same recommendation as Slice 1H — this run doesn't move the needle. The mini variants are workable backups but don't earn the switch:
83+
84+
| Surface | Recommended default | Why |
85+
|---|---|---|
86+
| **Workspace assistant** | `gpt-5.4-mini` @ reasoning_effort=low | 1.000 perfect, 1/5 the cost of gpt-5.4, 2-3× faster |
87+
| **Resume builder** (interactive) | `gpt-5.4` (native or via OR) | Same effective quality as mini, but faster (~50 %) at similar cost |
88+
| **Resume builder** (batch / async if ever added) | `gpt-5.4-mini@low` is fine | Quality matches; latency tolerance is the only difference |
89+
90+
Non-OpenAI candidates remain unsuitable for the resume-builder surface on this matrix — Sonnet's `structured_payload` miss is the blocker, and the cost premium isn't justified by anything mini doesn't already deliver.
91+
92+
## Defensive-engineering payoff (carries from earlier work)
93+
94+
This eval was only this fast / cheap because of upstream fixes:
95+
96+
1. **`reasoning_effort` thread-through** (Slice 1J''): Mini A/B at medium vs low only works because the adapter forwards the effort signal to OpenRouter; otherwise both variants would run at default effort and we'd be measuring noise.
97+
2. **Per-instance default reasoning_effort** (this addendum): Added `default_reasoning_effort` to `OpenRouterEvalService.__init__` so the eval matrix can inject the effort tier per candidate without touching production caller code in `resume_builder_service.py`. The kwarg falls back to the instance default when production callers don't pass one.
98+
3. **`run_json_prompt` usage accumulator** (Slice 1K bugfix): Without this, the structuring-pass cost would still read $0.00 — same bug Slice 1H hit on `openai-via-or`. Today's mini cost numbers are accurate because of the patch.
99+
100+
## Artifacts
101+
102+
- `docs/eval-runs/2026-05-21-resume-builder-mini-eval.json` — full raw report with per-scenario rows + metrics for both candidates
103+
- `docs/eval-runs/2026-05-21-resume-builder-mini-eval-log.txt` — streaming heartbeat log
104+
- `tests/quality/resume_builder_agentic_runner.py` — updated candidate dict (now `{slug, reasoning_effort}` per entry); rerun with `--candidates gpt-5.4-mini@med,gpt-5.4-mini@low`
105+
- `tests/quality/openrouter_eval_service.py``default_reasoning_effort` constructor param; per-call kwarg falls back to instance default
106+
- Baselines for comparison: `docs/eval-runs/2026-05-21-agentic-eval-v3-comprehensive-5cand.json` + `docs/eval-runs/2026-05-21-comprehensive-eval-report.md` (Slice 1H)

0 commit comments

Comments
 (0)