feat(eval): Slice 1K addendum — gpt-5.4-mini@low sweeps mini@med

LEANDERANTONY · claude · LEANDERANTONY · commit 90234e13ec6b · 2026-05-21T04:37:57.000+05:30
Followup to the Slice 1K verdict: since gpt-5.4-mini@med scored a
perfect 1.000 across all 12 scenarios, the obvious next question was
whether reasoning_effort=medium was earning its keep on this surface
or if low effort would produce the same quality at less cost. Added
`gpt-5.4-mini@low` to the candidate slate and re-ran the same 12
scenarios against it.

Result: also perfect 1.000, but faster + cheaper across the board.

    variant            avg  pass  wall    per-scn  cost      tokens
    gpt-5.4-mini@med   1.0  12/12 40.52s  3.4s     $0.0183   27 197
    gpt-5.4-mini@low   1.0  12/12 27.56s  2.3s     $0.0155   25 723
    delta              0    0     -32%    -32%     -15%      -5%

Verified answer quality directly on the demanding scenarios — neither
long-session callback nor multi-turn correction degraded:

  long_session_memory_callback:
    "You told me it was an 18% reduction in chargeback fraud."
    (correctly recalled from turn 2 of a 7-turn session)

  multi_turn_correction:
    "You're targeting data-science roles right now, and I should
    keep Python in the profile."
    (correctly tracked the user's mid-session pivot AND the
    explicit "keep Python" follow-up — same retention as mini@med)

  theme_list_question: complete six-theme enumeration with ATS /
    non-ATS distinction (same shape as mini@med).

**Refined recommendation:** route the workspace-assistant default to
`openai/gpt-5.4-mini` at `reasoning_effort=low`. The assistant
surface is retrieval-and-refuse — thinking-token spend beyond "low"
earns nothing on this rubric.

Cost-per-scenario, final ranking:

    candidate            $/scn
    gpt-5.4-mini@low     $0.0013   &lt;- new best
    gpt-5.4-mini@med     $0.0015
    haiku-4.5            $0.0035
    o4-mini@high         $0.0067
    gpt-5.4@med          $0.0078
    sonnet-4.5           $0.0096

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/DEVLOG.md b/docs/DEVLOG.md
@@ -2571,3 +2571,20 @@ assistant API spend. Full read-out in
 Slice 1J's history fix paid off concretely: all 5 candidates
 correctly recalled the "18 %" fact from turn 2 in a 7-turn
 session — unscorable before the fix.
+
+### Slice 1K addendum: `gpt-5.4-mini@low` sweep
+
+Since mini@med scored 1.000 the obvious next question was
+whether `reasoning_effort=medium` was earning its keep on the
+assistant surface. Added `gpt-5.4-mini@low` to the candidate
+slate and re-ran the same 12 scenarios. Result: also perfect
+1.000, but 32 % faster (27.6 s vs 40.5 s) and 15 % cheaper
+($0.0155 vs $0.0183). Verified answer quality directly on the
+hardest scenarios (long-session callback, multi-turn correction,
+pricing-tier numeric recall) — no degradation.
+
+**Refined recommendation:** route the workspace-assistant
+default to `openai/gpt-5.4-mini` at `reasoning_effort=low`. The
+assistant surface is retrieval-and-refuse; thinking-token spend
+beyond "low" earns nothing on this rubric. Artifacts:
+`docs/eval-runs/2026-05-21-assistant-eval-mini-low.json`.
diff --git a/docs/eval-runs/2026-05-21-assistant-eval-mini-low-log.txt b/docs/eval-runs/2026-05-21-assistant-eval-mini-low-log.txt
@@ -0,0 +1,30 @@
+== Slice 1K assistant eval: 1 candidate(s) × 12 scenario(s) ==
+   candidates: ['gpt-5.4-mini@low']
+   scenarios:  ['pricing_tiers_question', 'theme_list_question', 'theme_unlock_question', 'quota_assistant_turns', 'quota_resume_builder_lifetime', 'refuse_schedule_interview', 'refuse_login_external', 'pre_resume_grounding', 'post_analysis_grounding', 'off_topic_movie', 'long_session_memory_callback', 'multi_turn_correction']
+
+--- gpt-5.4-mini@low (openai/gpt-5.4-mini, effort=low) ---
+  [pricing_tiers_question          ] score=1.00 |   4.09s |   2253 tok | $0.0016
+  [theme_list_question             ] score=1.00 |   1.81s |   2225 tok | $0.0013
+  [theme_unlock_question           ] score=1.00 |   3.58s |   2134 tok | $0.0014
+  [quota_assistant_turns           ] score=1.00 |   1.82s |   2041 tok | $0.0012
+  [quota_resume_builder_lifetime   ] score=1.00 |   1.67s |   2025 tok | $0.0012
+  [refuse_schedule_interview       ] score=1.00 |   1.67s |   2189 tok | $0.0013
+  [refuse_login_external           ] score=1.00 |   1.69s |   2068 tok | $0.0012
+  [pre_resume_grounding            ] score=1.00 |   1.46s |   2005 tok | $0.0012
+  [post_analysis_grounding         ] score=1.00 |   1.59s |   2201 tok | $0.0013
+  [off_topic_movie                 ] score=1.00 |   1.46s |   2035 tok | $0.0012
+  [long_session_memory_callback    ] score=1.00 |   3.20s |   2414 tok | $0.0014
+  [multi_turn_correction           ] score=1.00 |   3.50s |   2133 tok | $0.0012
+  ===> gpt-5.4-mini@low done — avg=1.0, pass_rate=1.0, 27.56s, 25723 tok, $0.0155
+  [checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_mini_low.json
+
+============================================================================================
+=============== SLICE 1K SUMMARY — avg_score · pass_rate · cost by candidate ===============
+============================================================================================
+candidate             avg     pass    lat       cost
+----------------------------------------------------
+gpt-5.4-mini@low      1.0     1.0     27.56     $0.0155
+
+wrote JSON report -> C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_mini_low.json
+
+Total wall time: 27.6s
diff --git a/docs/eval-runs/2026-05-21-assistant-eval-mini-low.json b/docs/eval-runs/2026-05-21-assistant-eval-mini-low.json
@@ -0,0 +1,261 @@
+{
+  "candidates": {
+    "gpt-5.4-mini@low": {
+      "slug": "openai/gpt-5.4-mini",
+      "reasoning_effort": "low"
+    }
+  },
+  "scenarios": [
+    "pricing_tiers_question",
+    "theme_list_question",
+    "theme_unlock_question",
+    "quota_assistant_turns",
+    "quota_resume_builder_lifetime",
+    "refuse_schedule_interview",
+    "refuse_login_external",
+    "pre_resume_grounding",
+    "post_analysis_grounding",
+    "off_topic_movie",
+    "long_session_memory_callback",
+    "multi_turn_correction"
+  ],
+  "results": {
+    "gpt-5.4-mini@low": {
+      "slug": "openai/gpt-5.4-mini",
+      "reasoning_effort": "low",
+      "rows": [
+        {
+          "fixture": "pricing_tiers_question",
+          "score": 1.0,
+          "reason": "+'free'=True +'pro'=True +'business'=True one_of(['3', '20', '80', '150', '500', '1000', 'unlimited'])=True",
+          "answer": "We offer three tiers: Free, Pro, and Business. Free includes 3 tailored applications/month, 0 premium applications, 20 assistant turns/month, 3 resume parses/month, 50 job searches/month, 5 saved jobs, 1 saved workspace, 1 lifetime resume builder session, 7-day saved-workspace retention, and PDF export in the professional_neutral theme only. Pro includes 20 tailored apps, 5 premium apps, 150 assistant turns, 25 resume parses, unlimited job searches, 100 saved jobs, 5 saved workspaces, 3 resume builder sessions/month, 30-day retention, and PDF or DOCX export in any theme. Business includes 80 tailored apps, 25 premium apps, 500 assistant turns, 100 resume parses, unlimited job searches, unlimited saved jobs, unlimited saved workspaces, 15 resume builder sessions/month, unbounded retention, and PDF or DOCX export in any theme.",
+          "sources": [
+            "Product knowledge: pricing tiers and monthly/persistent caps",
+            "Product knowledge: resume builder sessions and saved-workspace retention",
+            "Product knowledge: export entitlement and themes"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 4.09,
+            "prompt_tokens": 1935,
+            "completion_tokens": 318,
+            "total_tokens": 2253,
+            "cost_usd": 0.0016
+          }
+        },
+        {
+          "fixture": "theme_list_question",
+          "score": 1.0,
+          "reason": "one_of(['professional_neutral', 'professional neutral'])=True one_of(['modern_blue', 'modern blue'])=True one_of(['creative_warm', 'creative warm', 'architect_mono', 'architect mono', 'classic_ats', 'classic ats', 'presentation_twocol', 'presentation', 'two-column', 'two column'])=True",
+          "answer": "You can choose from six resume themes when you export: classic_ats, professional_neutral, modern_blue, creative_warm, architect_mono, and presentation_twocol. The first five are single-column and ATS-safe; presentation_twocol is a two-column designer layout and is flagged non-ATS.",
+          "sources": [
+            "Product knowledge: resume themes and export entitlement",
+            "Workspace state: Analysis step with resume and JD present"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 1.81,
+            "prompt_tokens": 2079,
+            "completion_tokens": 146,
+            "total_tokens": 2225,
+            "cost_usd": 0.0013
+          }
+        },
+        {
+          "fixture": "theme_unlock_question",
+          "score": 1.0,
+          "reason": "-'unlock code'=True -'promo code'=True one_of(['gated', 'non-ats', 'not ats', 'pro', 'business', 'upgrade'])=True one_of(['two-column', 'two column', 'presentation'])=True",
+          "answer": "The two-column theme is the `presentation_twocol` export theme, and it\u2019s only available on Pro or Business. Free users can export PDF only in `professional_neutral`.",
+          "sources": [
+            "product knowledge: resume themes",
+            "product knowledge: export entitlement"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 3.58,
+            "prompt_tokens": 1930,
+            "completion_tokens": 204,
+            "total_tokens": 2134,
+            "cost_usd": 0.0014
+          }
+        },
+        {
+          "fixture": "quota_assistant_turns",
+          "score": 1.0,
+          "reason": "+'20'=True +'150'=True +'500'=True",
+          "answer": "Monthly assistant chat limits are 20 turns on Free, 150 on Pro, and 500 on Business.",
+          "sources": [
+            "Product knowledge: pricing tiers and monthly caps",
+            "workspace_state: current_step=jobs"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 1.82,
+            "prompt_tokens": 1934,
+            "completion_tokens": 107,
+            "total_tokens": 2041,
+            "cost_usd": 0.0012
+          }
+        },
+        {
+          "fixture": "quota_resume_builder_lifetime",
+          "score": 1.0,
+          "reason": "-'resets monthly'=True -'every month'=True -'each month'=True one_of(['lifetime', 'never resets', 'one-time', '1 total', \"doesn't reset\"])=True",
+          "answer": "On Free, resume builder sessions do not reset \u2014 you get 1 lifetime session total.",
+          "sources": [
+            "Product knowledge: Resume builder sessions are LIFETIME on Free (1 total, never resets)"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 1.67,
+            "prompt_tokens": 1887,
+            "completion_tokens": 138,
+            "total_tokens": 2025,
+            "cost_usd": 0.0012
+          }
+        },
+        {
+          "fixture": "refuse_schedule_interview",
+          "score": 1.0,
+          "reason": "-'scheduled'=True -'confirmed for 3'=True -'confirmed at 3'=True -\"i've booked\"=True one_of([\"can't\", 'cannot', \"i'm not able\", 'unable', \"won't be able\", 'do not schedule', \"don't schedule\"])=True",
+          "answer": "I can\u2019t schedule interviews or contact the hiring manager, but I can help you prepare a message or summary for the application.",
+          "sources": [
+            "product_limits",
+            "workflow_context.fit_analysis"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 1.67,
+            "prompt_tokens": 2083,
+            "completion_tokens": 106,
+            "total_tokens": 2189,
+            "cost_usd": 0.0013
+          }
+        },
+        {
+          "fixture": "refuse_login_external",
+          "score": 1.0,
+          "reason": "-'logged in'=True -'submitted the application'=True -'applied on linkedin'=True one_of([\"can't\", 'cannot', \"won't\", 'unable', 'not able', \"don't have access\"])=True",
+          "answer": "I can\u2019t log into LinkedIn or submit an application on your behalf, but I can help you tailor your resume or prepare the job description here so you can apply yourself.",
+          "sources": [
+            "workspace_state.current_step=jobs",
+            "workspace_state.has_resume=true",
+            "product limitations"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 1.69,
+            "prompt_tokens": 1934,
+            "completion_tokens": 134,
+            "total_tokens": 2068,
+            "cost_usd": 0.0012
+          }
+        },
+        {
+          "fixture": "pre_resume_grounding",
+          "score": 1.0,
+          "reason": "-'python,'=True -'javascript,'=True -'your skills include'=True one_of([\"haven't uploaded\", 'no resume', 'not uploaded', 'no profile', 'upload', 'resume builder'])=True",
+          "answer": "Your resume hasn\u2019t been uploaded or parsed yet, so I can\u2019t see any skills right now. Upload a resume or use the resume builder, and I\u2019ll summarize the skills from it.",
+          "sources": [
+            "workspace_state.has_resume=false",
+            "current_page=Resume"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 1.46,
+            "prompt_tokens": 1881,
+            "completion_tokens": 124,
+            "total_tokens": 2005,
+            "cost_usd": 0.0012
+          }
+        },
+        {
+          "fixture": "post_analysis_grounding",
+          "score": 1.0,
+          "reason": "+'72'=True one_of(['aws', 'kubernetes', 'missing'])=True",
+          "answer": "Your fit score is 72, which the analysis labels as Strong. What\u2019s holding it back most is a couple of missing hard skills \u2014 AWS and Kubernetes \u2014 while your experience looks roughly aligned with the role.",
+          "sources": [
+            "workflow_context.fit_analysis",
+            "workspace_state.has_analysis",
+            "jd_summary",
+            "resume_summary"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 1.59,
+            "prompt_tokens": 2079,
+            "completion_tokens": 122,
+            "total_tokens": 2201,
+            "cost_usd": 0.0013
+          }
+        },
+        {
+          "fixture": "off_topic_movie",
+          "score": 1.0,
+          "reason": "-'interstellar'=True -'inception'=True -'blade runner'=True -'the matrix'=True -'dune'=True one_of([\"can't\", 'cannot', \"won't\", 'stay focused', 'focus on', 'outside', 'scope'])=True",
+          "answer": "I can\u2019t help with that here, but I can help you find or tailor job applications instead.",
+          "sources": [
+            "workspace_state.current_step=jobs",
+            "product scope rules"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 1.46,
+            "prompt_tokens": 1931,
+            "completion_tokens": 104,
+            "total_tokens": 2035,
+            "cost_usd": 0.0012
+          }
+        },
+        {
+          "fixture": "long_session_memory_callback",
+          "score": 1.0,
+          "reason": "+'18'=True one_of(['chargeback', 'fraud reduction', 'fraud cut'])=True",
+          "answer": "You told me it was an 18% reduction in chargeback fraud.",
+          "sources": [
+            "Recent conversation history",
+            "Workspace state: resume context (Priya Sharma)"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 3.2,
+            "prompt_tokens": 2287,
+            "completion_tokens": 127,
+            "total_tokens": 2414,
+            "cost_usd": 0.0014
+          }
+        },
+        {
+          "fixture": "multi_turn_correction",
+          "score": 1.0,
+          "reason": "-'backend engineer'=True -'backend role'=True one_of(['data science', 'data scientist', 'data-science'])=True",
+          "answer": "You're targeting data-science roles right now, and I should keep Python in the profile.",
+          "sources": [
+            "Recent conversation history",
+            "Workspace state: current_step=resume"
+          ],
+          "error": null,
+          "metrics": {
+            "elapsed_seconds": 3.5,
+            "prompt_tokens": 2026,
+            "completion_tokens": 107,
+            "total_tokens": 2133,
+            "cost_usd": 0.0012
+          }
+        }
+      ],
+      "totals": {
+        "scenarios": 12,
+        "scored": 12,
+        "avg_score": 1.0,
+        "pass_rate": 1.0,
+        "total_seconds": 27.56,
+        "total_tokens": 25723,
+        "total_cost_usd": 0.0155
+      }
+    }
+  },
+  "started_at": "2026-05-20T23:06:15.450451+00:00",
+  "elapsed_seconds": 27.6,
+  "completed_at": "2026-05-20T23:06:43.002415+00:00"
+}
diff --git a/docs/eval-runs/2026-05-21-assistant-eval-report.md b/docs/eval-runs/2026-05-21-assistant-eval-report.md
@@ -139,6 +139,41 @@ The patches that landed *before* this run:
 
 Without any one of these, the eval would have either failed to score (1, 4), produced garbage answers (2), or used the wrong reasoning tier (3) — and the report would have been misleading.
 
+## Addendum — `gpt-5.4-mini@low` sweep (same 12 scenarios)
+
+Since `gpt-5.4-mini@med` scored perfect 1.000, the obvious next question: is the `reasoning_effort=medium` setting earning its keep on this surface, or could we drop to `low` and save even more? Re-ran the same 12 scenarios with the candidate added to the runner.
+
+| Variant | avg | pass | wall | per-scn | cost | tokens |
+|---|---:|---:|---:|---:|---:|---:|
+| `gpt-5.4-mini@med` | 1.000 | 12/12 | 40.5s | 3.4s | $0.0183 | 27 197 |
+| **`gpt-5.4-mini@low`** | **1.000** | **12/12** | **27.6s** | **2.3s** | **$0.0155** | 25 723 |
+| _delta_ | _0_ | _0_ | _-32 %_ | _-32 %_ | _-15 %_ | _-5 %_ |
+
+Same perfect score across the board. Answer quality verified on the demanding scenarios — neither dropped a beat:
+
+> **mini@low on `long_session_memory_callback`:** *"You told me it was an 18 % reduction in chargeback fraud."* (correctly recalled from turn 2 of 7-turn session)
+>
+> **mini@low on `multi_turn_correction`:** *"You're targeting data-science roles right now, and I should keep Python in the profile."* (correctly tracked the user's mid-session pivot AND remembered the explicit "keep Python" follow-up)
+>
+> **mini@low on `theme_list_question`:** *"You can choose from six resume themes when you export: classic_ats, professional_neutral, modern_blue, creative_warm, architect_mono, and presentation_twocol. The first five are single-column and ATS-safe; presentation_twocol is a two-column designer layout and is flagged non-ATS."*
+>
+> **mini@low on `pricing_tiers_question`:** complete enumeration of all three tiers with all numbers (cap matrix verbatim).
+
+**Refined recommendation:** route the workspace-assistant default to `openai/gpt-5.4-mini` at `reasoning_effort=low`. The medium-effort variant was overkill on this surface — retrieval-and-refuse work doesn't reward thinking-token spend.
+
+Cost-per-scenario ranking after the addendum:
+
+| Candidate | $/scn | rank |
+|---|---:|---:|
+| `gpt-5.4-mini@low` | $0.0013 | 🥇 new best |
+| `gpt-5.4-mini@med` | $0.0015 | |
+| `haiku-4.5` | $0.0035 | |
+| `o4-mini@high` | $0.0067 | |
+| `gpt-5.4@med` | $0.0078 | |
+| `sonnet-4.5` | $0.0096 | |
+
+Artifacts: `docs/eval-runs/2026-05-21-assistant-eval-mini-low.json`, `docs/eval-runs/2026-05-21-assistant-eval-mini-low-log.txt`.
+
 ## Artifacts
 
 - `docs/eval-runs/2026-05-21-assistant-eval-full.json` — full raw report with per-scenario rows + metrics
diff --git a/tests/quality/assistant_agentic_runner.py b/tests/quality/assistant_agentic_runner.py
@@ -103,6 +103,15 @@
         "slug": "openai/gpt-5.4-mini",
         "reasoning_effort": "medium",
     },
+    # Slice 1K addendum: mini@med scored 1.000 perfectly across all 12
+    # scenarios — the assistant surface is retrieval-and-refuse, which
+    # doesn't reward heavy reasoning. Adding the low-effort variant to
+    # see whether we can cut reasoning-token cost further without
+    # losing the long-session callback or grounding correctness.
+    "gpt-5.4-mini@low": {
+        "slug": "openai/gpt-5.4-mini",
+        "reasoning_effort": "low",
+    },
     "o4-mini@high": {
         "slug": "openai/o4-mini",
         "reasoning_effort": "high",