LEANDERANTONY
diff --git a/‎docs/DEVLOG.md‎
Lines changed: 67 additions & 0 deletions b/‎docs/DEVLOG.md‎
Lines changed: 67 additions & 0 deletions
diff --git a/‎docs/eval-runs/2026-05-21-resume-builder-mini-eval-log.txt‎
Lines changed: 105 additions & 0 deletions b/‎docs/eval-runs/2026-05-21-resume-builder-mini-eval-log.txt‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md‎
Lines changed: 106 additions & 0 deletions b/‎docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md‎
Lines changed: 106 additions & 0 deletions
@@ -2588,3 +2588,70 @@ default to `openai/gpt-5.4-mini` at `reasoning_effort=low`. The
 assistant surface is retrieval-and-refuse; thinking-token spend
 beyond "low" earns nothing on this rubric. Artifacts:
 `docs/eval-runs/2026-05-21-assistant-eval-mini-low.json`.
+
+### Resume-builder × mini sweep (does the cost story transfer?)
+
+Slice 1K's mini@low win on the workspace-assistant surface
+prompted the natural followup: does it hold on the much heavier
+resume-builder surface (tool loop, multi-turn intake,
+proactive_offer + pending_followups channels, the 11 K-char
+structuring-pass canary)?
+
+Wired `gpt-5.4-mini@med` + `gpt-5.4-mini@low` into
+`tests/quality/resume_builder_agentic_runner.py` (changed
+`_AGENTIC_CANDIDATES` from `dict[str, str]` to
+`dict[str, dict]` carrying `slug` + `reasoning_effort` per
+candidate). Added `default_reasoning_effort` to
+`OpenRouterEvalService.__init__` so the eval matrix can inject
+the effort tier per candidate without touching the production
+`resume_builder_service` caller; the per-call kwarg falls back to
+the instance default when production code doesn't pass one.
+
+Result on the same 16 OpenRouter scenarios:
+
+  | candidate         | raw   | eff   | lat    | cost   |
+  | gpt-5.4-mini@med  | 14/16 | 16/16 | 247s   | $0.144 |
+  | gpt-5.4-mini@low  | 15/16 | 16/16 | 200s   | $0.127 |
+
+Re-classifying the raw fails: both candidates trip the curly-
+apostrophe matcher bug Slice 1H flagged ("can't" / "couldn't"
+with U+2019). Real behavior PASSES on every flagged miss — the
+resume-builder runner just never got the normalisation patch the
+assistant runner has.
+
+Compared to Slice 1H baselines on the same 16 scenarios:
+gpt-5.4-via-OR scored 16/16 effective at 8.3 s / scenario and
+roughly $0.12. Sonnet-4.5 scored 14/16 (1 real fail on
+`structured_payload`) at $0.98. So mini matches gpt-5.4 quality
+AND beats Sonnet/Gemini/DeepSeek on this surface (which all
+failed `structured_payload` and/or `proactive_offer`).
+
+But — and this is the important finding — on this surface the
+mini cost story DOES NOT transfer:
+
+  * gpt-5.4-via-OR: 8.3 s / scenario, ~$0.12 total
+  * mini@low:       12.5 s / scenario, $0.127 total
+  * mini@med:       15.4 s / scenario, $0.144 total
+
+Mini's 5x per-token discount is eaten by reasoning-token
+overhead. On the short retrieve-and-refuse assistant surface
+reasoning_effort barely fires; on the long multi-turn agentic
+resume-builder surface the model thinks before AND after each
+tool call. Net cost ends up similar to gpt-5.4 — and latency is
+50-85 % higher.
+
+**Recommendation:** keep `gpt-5.4` as the resume-builder
+default. mini doesn't earn the switch on this surface. The
+finding is genuinely surface-specific:
+
+  * Workspace assistant → mini@low (Slice 1K result holds)
+  * Resume builder      → gpt-5.4 (Slice 1H result holds)
+
+Design lesson worth keeping: reasoning models shine when the
+inference is short and structured; they're a wash when the
+agentic loop is already providing the "reasoning" externally.
+
+Full read-out:
+`docs/eval-runs/2026-05-21-resume-builder-mini-eval-report.md`.
+Artifacts:
+`docs/eval-runs/2026-05-21-resume-builder-mini-eval.json`.
@@ -0,0 +1,105 @@
+
+================================================================================
+== gpt-5.4-mini@med (openrouter:openai/gpt-5.4-mini@medium) ==
+   (skipping 2 web_search scenario(s) for non-openai provider: web_search_fires_on_external_context_question, web_search_skipped_for_user_provided_info)
+-- running github_url_fires_tool ...
+   PASS |  21.45s |  21151 tok | $0.0125 | 1 tools
+-- running non_github_url_no_fetch ...
+   PASS |  10.90s |   9579 tok | $0.0059 | 0 tools
+-- running honesty_on_linkedin_scrape ...
+   PASS |   6.51s |   9597 tok | $0.0059 | 0 tools
+-- running proactive_offer_after_enough_signal ...
+   PASS |  23.22s |  26682 tok | $0.0164 | 2 tools
+-- running proactive_offer_silent_mid_basics ...
+   PASS |   2.79s |   2996 tok | $0.0017 | 0 tools
+-- running multi_turn_correction_preserved ...
+   PASS |   7.45s |   9457 tok | $0.0056 | 0 tools
+-- running promise_tracking_remembers_deferred_publication ...
+   PASS |  16.22s |  16924 tok | $0.0110 | 0 tools
+-- running structured_payload_runs_after_generate ...
+   PASS |  32.43s |  24467 tok | $0.0149 | 1 tools
+-- running triple_role_correction ...
+   PASS |  14.05s |  12850 tok | $0.0078 | 0 tools
+-- running self_contradictory_info ...
+   PASS |  26.36s |  13857 tok | $0.0097 | 0 tools
+-- running off_topic_movie_question ...
+   PASS |   8.42s |   9484 tok | $0.0056 | 0 tools
+-- running out_of_scope_capability_probe ...
+   FAIL |   7.07s |   9378 tok | $0.0055 | 0 tools
+     - turn 2 assistant_message lacks any of ["can't", 'cannot', 'unable', 'not able', "won't", 'outside']; got: 'I can’t schedule interviews, but I can keep helping with your resume—what’s your professional summary in one sentence?'
+-- running failed_tool_graceful_fallback ...
+   FAIL |  10.56s |  12658 tok | $0.0073 | 1 tools
+     - turn 2 assistant_message lacks any of ['describe', 'details', 'tell me', 'share', "couldn't", 'could not', "didn't work"]; got: 'I couldn’t fetch that GitHub repo because the link looks invalid or unavailable. Please paste the project’s tech stack and what it did in yo'
+-- running format_jumbled_dump ...
+   PASS |   7.10s |   6789 tok | $0.0047 | 0 tools
+-- running long_session_memory_callback ...
+   PASS |  36.28s |  24535 tok | $0.0162 | 0 tools
+-- running mixed_github_and_portfolio_urls ...
+   PASS |  15.89s |  21526 tok | $0.0133 | 1 tools
+   [checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/rb_mini_full.json
+   ===> gpt-5.4-mini@med done: 14/16 PASS, avg 15.4s/scenario, 231930 tok, $0.1440
+
+================================================================================
+== gpt-5.4-mini@low (openrouter:openai/gpt-5.4-mini@low) ==
+   (skipping 2 web_search scenario(s) for non-openai provider: web_search_fires_on_external_context_question, web_search_skipped_for_user_provided_info)
+-- running github_url_fires_tool ...
+   PASS |   9.38s |  20414 tok | $0.0110 | 1 tools
+-- running non_github_url_no_fetch ...
+   PASS |   7.22s |   9273 tok | $0.0053 | 0 tools
+-- running honesty_on_linkedin_scrape ...
+   PASS |  13.08s |   9281 tok | $0.0053 | 0 tools
+-- running proactive_offer_after_enough_signal ...
+   PASS |  22.45s |  25976 tok | $0.0150 | 2 tools
+-- running proactive_offer_silent_mid_basics ...
+   PASS |   1.40s |   2957 tok | $0.0016 | 0 tools
+-- running multi_turn_correction_preserved ...
+   PASS |   7.72s |   9244 tok | $0.0052 | 0 tools
+-- running promise_tracking_remembers_deferred_publication ...
+   PASS |  15.98s |  15978 tok | $0.0092 | 0 tools
+-- running structured_payload_runs_after_generate ...
+   PASS |  15.79s |  23652 tok | $0.0133 | 1 tools
+-- running triple_role_correction ...
+   PASS |  13.55s |  12487 tok | $0.0070 | 0 tools
+-- running self_contradictory_info ...
+   PASS |  18.09s |  12901 tok | $0.0078 | 0 tools
+-- running off_topic_movie_question ...
+   PASS |   9.41s |   9327 tok | $0.0053 | 0 tools
+-- running out_of_scope_capability_probe ...
+   FAIL |   8.43s |   9213 tok | $0.0052 | 0 tools
+     - turn 2 assistant_message lacks any of ["can't", 'cannot', 'unable', 'not able', "won't", 'outside']; got: 'I can’t schedule interviews, but I can help you finish your resume so you’re ready to apply. What’s your professional summary in one sentenc'
+-- running failed_tool_graceful_fallback ...
+   PASS |  12.23s |  12499 tok | $0.0070 | 1 tools
+-- running format_jumbled_dump ...
+   PASS |   4.45s |   6360 tok | $0.0039 | 0 tools
+-- running long_session_memory_callback ...
+   PASS |  24.65s |  23267 tok | $0.0137 | 0 tools
+-- running mixed_github_and_portfolio_urls ...
+   PASS |  16.23s |  20588 tok | $0.0114 | 1 tools
+   [checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/rb_mini_full.json
+   ===> gpt-5.4-mini@low done: 15/16 PASS, avg 12.5s/scenario, 223417 tok, $0.1272
+
+================================================================================
+MULTI-PROVIDER AGENTIC EVAL — comparison matrix
+================================================================================
+scenario                                        | gpt-5.4-mini@med | gpt-5.4-mini@low
+-------------------------------------------------------------------------------------
+github_url_fires_tool                           | PASS             | PASS            
+non_github_url_no_fetch                         | PASS             | PASS            
+honesty_on_linkedin_scrape                      | PASS             | PASS            
+proactive_offer_after_enough_signal             | PASS             | PASS            
+proactive_offer_silent_mid_basics               | PASS             | PASS            
+multi_turn_correction_preserved                 | PASS             | PASS            
+promise_tracking_remembers_deferred_publication | PASS             | PASS            
+structured_payload_runs_after_generate          | PASS             | PASS            
+triple_role_correction                          | PASS             | PASS            
+self_contradictory_info                         | PASS             | PASS            
+off_topic_movie_question                        | PASS             | PASS            
+out_of_scope_capability_probe                   | FAIL             | FAIL            
+failed_tool_graceful_fallback                   | FAIL             | PASS            
+format_jumbled_dump                             | PASS             | PASS            
+long_session_memory_callback                    | PASS             | PASS            
+mixed_github_and_portfolio_urls                 | PASS             | PASS            
+-------------------------------------------------------------------------------------
+Totals — gpt-5.4-mini@med: 14/16 · gpt-5.4-mini@low: 15/16
+
+wrote JSON report -> C:/Users/LEANDE~1/AppData/Local/Temp/rb_mini_full.json
@@ -0,0 +1,106 @@
+# Resume Builder × `gpt-5.4-mini` (medium + low) — eval addendum
+
+**Run:** 2026-05-21, 2 candidates × 16 effective scenarios (web_search scenarios auto-skipped for non-native-OpenAI providers).
+**Total spend:** $0.27 OpenRouter. **Total wall time:** ~7 min 30 s.
+**Eval framework:** `tests/quality/resume_builder_agentic_runner.py` — agentic tool-loop scoring with the substring-rubric scenarios (same shape as Slice 1H comprehensive eval).
+
+## What this eval answers
+
+Slice 1K (workspace assistant) showed that `gpt-5.4-mini@low` matched `gpt-5.4@medium` quality at lower cost / latency on the retrieve-and-refuse surface. The obvious next question: does that hold on the **heavier** resume-builder surface — where the model has to drive a tool loop (`fetch_github_readme`), track conversation across many turns, fire `proactive_offer` at the right moments, populate `pending_followups[]`, and survive the 11 K-char structuring-pass canary that broke Sonnet / Gemini / DeepSeek in Slice 1H?
+
+Slate intentionally narrow: just the two mini variants. We're testing whether mini can replace the gpt-5.4 default, not re-running the full ADR-028 D1 matrix. Sonnet / Gemini / DeepSeek baselines come from Slice 1H. Haiku skipped per operator request — Slice 1K's JSON-mode reliability concern (~92 %) is a production no-go on this surface.
+
+## Results (today's run)
+
+| Candidate | raw pass | effective | latency total | per-scn | cost | total tokens |
+|---|---:|---:|---:|---:|---:|---:|
+| `gpt-5.4-mini@med` | 14 / 16 | **16 / 16** | 247 s | 15.4 s | $0.144 | 231 930 |
+| **`gpt-5.4-mini@low`** | 15 / 16 | **16 / 16** | **200 s** | **12.5 s** | **$0.127** | 223 417 |
+
+Re-classification: both candidates trip the **curly-apostrophe matcher bug** (same family Slice 1H flagged). Real behavior PASSES on every flagged "fail":
+
+- `out_of_scope_capability_probe` (both candidates fail raw):
+  - mini@med: *"I can't schedule interviews, but I can keep helping with your resume…"*
+  - mini@low: *"I can't schedule interviews, but I can help you finish your resume so you're ready to apply…"*
+
+  Both are textbook refusal-with-redirect. The rubric's `one_of` list expected straight-apostrophe `"can't"`; the model emits U+2019 `"can't"`. Same bug Slice 1H caught across providers — the resume-builder runner just never got the normalisation patch the assistant runner has.
+
+- `failed_tool_graceful_fallback` (mini@med fails, mini@low passes):
+  - mini@med: *"I couldn't fetch that GitHub repo because the link looks invalid or unavailable. Please **paste** the project's tech stack…"*
+  - mini@low: *"I couldn't fetch that README because the repo looks unavailable. Please **describe** the project in your own words…"*
+
+  Behaviorally identical (both honestly explain the failure + ask the user to share). mini@low happened to use the magic keyword "describe" that the rubric lists; mini@med used "paste" which isn't in the list. Plus the same curly-`"couldn't"` issue. mini@med fails the matcher, not the behavior.
+
+So **both candidates are effectively 16 / 16** — perfect on every behavioral signal the rubric is actually trying to test. The only daylight between them is that mini@low got lucky with vocabulary on one scenario.
+
+## Comparison vs Slice 1H baselines (apples-to-apples scope)
+
+Pulling the candidates from `docs/eval-runs/2026-05-21-agentic-eval-v3-comprehensive-5cand.json` for the 16 OpenRouter scenarios:
+
+| Candidate | effective | per-scn lat | total cost | notes |
+|---|---:|---:|---:|---|
+| `openai` (native gpt-5.4) | 18 / 18 | 8.7 s | $0.142 | Native Responses-API path (faster than OR proxy) |
+| `openai-via-or` (gpt-5.4) | 16 / 16 | 8.3 s | ~$0.12 ¹ | Same model, OpenRouter-routed |
+| **`gpt-5.4-mini@low`** (new) | **16 / 16** | **12.5 s** | **$0.127** | Reasoning-effort=low |
+| `gpt-5.4-mini@med` (new) | 16 / 16 | 15.4 s | $0.144 | Reasoning-effort=medium |
+| `sonnet-4.5` | 14 / 16 ⚠ | 17.1 s | $0.977 | 1 real fail on `structured_payload`; 8× the cost |
+| `deepseek` | 14 / 16 ⚠ | 57.6 s ⚠ | $0.173 | Real fails on `proactive_offer` + `structured_payload`; 6× slower |
+| `gemini` | 12 / 16 ⚠ | 34.4 s | $0.919 | 4 real fails incl. regex-fallback drop |
+
+¹ Slice 1H reported `$0.00` due to the `run_json_prompt` usage-tracking bug we fixed in Slice 1K (Slice 1J'' commit). Estimated ~$0.12 from token totals × pricing.
+
+### Per-axis honest read
+
+**Quality.** All three OpenAI options (native, via-OR, both mini variants) tie at effective perfect on this rubric. Sonnet, DeepSeek, Gemini all have real fails on either `structured_payload` (the 11 K-char structuring pass) or `proactive_offer` firing. **Mini IS strong enough on this surface — it survives the canary scenarios that knocked out the non-OpenAI candidates.**
+
+**Latency.** This is where the story flips vs the assistant findings.
+
+- gpt-5.4 via OR: **8.3 s / scenario** (no reasoning_effort — just answers)
+- mini@low: 12.5 s / scenario (50 % slower)
+- mini@med: 15.4 s / scenario (85 % slower)
+
+The reasoning-effort tokens add real wall-clock latency. On the assistant surface (short, retrieve-and-refuse turns) the reasoning overhead was minor; on the resume builder (long multi-turn intake with tool calls) it's a noticeable 4–7 s per turn slower. **For an interactive resume-builder where the user is staring at a typing indicator, gpt-5.4 wins on UX.**
+
+**Cost.** Mini's per-token rate is 5× cheaper than gpt-5.4's, but reasoning-effort uses extra completion tokens (~8 K extra at low effort, ~16 K extra at medium). Net:
+
+- gpt-5.4 via OR: ~$0.12 for 16 scenarios (~$0.0075 / scenario)
+- mini@low: $0.127 for 16 scenarios (~$0.008 / scenario)
+- mini@med: $0.144 for 16 scenarios (~$0.009 / scenario)
+
+**On the resume-builder surface, mini@low costs roughly the same as gpt-5.4.** The per-token win is eaten by the reasoning-token overhead. This is the opposite of what we saw on the assistant surface, where mini was 5× cheaper.
+
+## Why the assistant story doesn't transfer
+
+The workspace assistant is short turns (one user question, one short JSON answer). The reasoning budget barely fires before the answer lands; mini@low spends almost no extra tokens vs no-reasoning. So mini's 5× per-token discount fully translates to cost.
+
+The resume builder is long, tool-using, multi-turn agentic work. Each turn the reasoning model thinks before AND after each tool call. Reasoning tokens accumulate. The discount on per-token rates is offset by the extra tokens spent reasoning. **Net cost ends up similar to gpt-5.4 — and latency suffers.**
+
+This is a meaningful design lesson: reasoning models are great when the inference is short and structured (assistant Q&A) and less great when there's already a lot of agentic structure providing the "reasoning" externally (multi-turn tool loop).
+
+## Recommendation
+
+**Keep gpt-5.4 as the resume-builder production default.** Same recommendation as Slice 1H — this run doesn't move the needle. The mini variants are workable backups but don't earn the switch:
+
+| Surface | Recommended default | Why |
+|---|---|---|
+| **Workspace assistant** | `gpt-5.4-mini` @ reasoning_effort=low | 1.000 perfect, 1/5 the cost of gpt-5.4, 2-3× faster |
+| **Resume builder** (interactive) | `gpt-5.4` (native or via OR) | Same effective quality as mini, but faster (~50 %) at similar cost |
+| **Resume builder** (batch / async if ever added) | `gpt-5.4-mini@low` is fine | Quality matches; latency tolerance is the only difference |
+
+Non-OpenAI candidates remain unsuitable for the resume-builder surface on this matrix — Sonnet's `structured_payload` miss is the blocker, and the cost premium isn't justified by anything mini doesn't already deliver.
+
+## Defensive-engineering payoff (carries from earlier work)
+
+This eval was only this fast / cheap because of upstream fixes:
+
+1. **`reasoning_effort` thread-through** (Slice 1J''): Mini A/B at medium vs low only works because the adapter forwards the effort signal to OpenRouter; otherwise both variants would run at default effort and we'd be measuring noise.
+2. **Per-instance default reasoning_effort** (this addendum): Added `default_reasoning_effort` to `OpenRouterEvalService.__init__` so the eval matrix can inject the effort tier per candidate without touching production caller code in `resume_builder_service.py`. The kwarg falls back to the instance default when production callers don't pass one.
+3. **`run_json_prompt` usage accumulator** (Slice 1K bugfix): Without this, the structuring-pass cost would still read $0.00 — same bug Slice 1H hit on `openai-via-or`. Today's mini cost numbers are accurate because of the patch.
+
+## Artifacts
+
+- `docs/eval-runs/2026-05-21-resume-builder-mini-eval.json` — full raw report with per-scenario rows + metrics for both candidates
+- `docs/eval-runs/2026-05-21-resume-builder-mini-eval-log.txt` — streaming heartbeat log
+- `tests/quality/resume_builder_agentic_runner.py` — updated candidate dict (now `{slug, reasoning_effort}` per entry); rerun with `--candidates gpt-5.4-mini@med,gpt-5.4-mini@low`
+- `tests/quality/openrouter_eval_service.py` — `default_reasoning_effort` constructor param; per-call kwarg falls back to instance default
+- Baselines for comparison: `docs/eval-runs/2026-05-21-agentic-eval-v3-comprehensive-5cand.json` + `docs/eval-runs/2026-05-21-comprehensive-eval-report.md` (Slice 1H)