Skip to content

Commit 90234e1

Browse files
LEANDERANTONYclaude
andcommitted
feat(eval): Slice 1K addendum — gpt-5.4-mini@low sweeps mini@med
Followup to the Slice 1K verdict: since gpt-5.4-mini@med scored a perfect 1.000 across all 12 scenarios, the obvious next question was whether reasoning_effort=medium was earning its keep on this surface or if low effort would produce the same quality at less cost. Added `gpt-5.4-mini@low` to the candidate slate and re-ran the same 12 scenarios against it. Result: also perfect 1.000, but faster + cheaper across the board. variant avg pass wall per-scn cost tokens gpt-5.4-mini@med 1.0 12/12 40.52s 3.4s $0.0183 27 197 gpt-5.4-mini@low 1.0 12/12 27.56s 2.3s $0.0155 25 723 delta 0 0 -32% -32% -15% -5% Verified answer quality directly on the demanding scenarios — neither long-session callback nor multi-turn correction degraded: long_session_memory_callback: "You told me it was an 18% reduction in chargeback fraud." (correctly recalled from turn 2 of a 7-turn session) multi_turn_correction: "You're targeting data-science roles right now, and I should keep Python in the profile." (correctly tracked the user's mid-session pivot AND the explicit "keep Python" follow-up — same retention as mini@med) theme_list_question: complete six-theme enumeration with ATS / non-ATS distinction (same shape as mini@med). **Refined recommendation:** route the workspace-assistant default to `openai/gpt-5.4-mini` at `reasoning_effort=low`. The assistant surface is retrieval-and-refuse — thinking-token spend beyond "low" earns nothing on this rubric. Cost-per-scenario, final ranking: candidate $/scn gpt-5.4-mini@low $0.0013 <- new best gpt-5.4-mini@med $0.0015 haiku-4.5 $0.0035 o4-mini@high $0.0067 gpt-5.4@med $0.0078 sonnet-4.5 $0.0096 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent b3f2ea1 commit 90234e1

5 files changed

Lines changed: 352 additions & 0 deletions

File tree

docs/DEVLOG.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2571,3 +2571,20 @@ assistant API spend. Full read-out in
25712571
Slice 1J's history fix paid off concretely: all 5 candidates
25722572
correctly recalled the "18 %" fact from turn 2 in a 7-turn
25732573
session — unscorable before the fix.
2574+
2575+
### Slice 1K addendum: `gpt-5.4-mini@low` sweep
2576+
2577+
Since mini@med scored 1.000 the obvious next question was
2578+
whether `reasoning_effort=medium` was earning its keep on the
2579+
assistant surface. Added `gpt-5.4-mini@low` to the candidate
2580+
slate and re-ran the same 12 scenarios. Result: also perfect
2581+
1.000, but 32 % faster (27.6 s vs 40.5 s) and 15 % cheaper
2582+
($0.0155 vs $0.0183). Verified answer quality directly on the
2583+
hardest scenarios (long-session callback, multi-turn correction,
2584+
pricing-tier numeric recall) — no degradation.
2585+
2586+
**Refined recommendation:** route the workspace-assistant
2587+
default to `openai/gpt-5.4-mini` at `reasoning_effort=low`. The
2588+
assistant surface is retrieval-and-refuse; thinking-token spend
2589+
beyond "low" earns nothing on this rubric. Artifacts:
2590+
`docs/eval-runs/2026-05-21-assistant-eval-mini-low.json`.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
== Slice 1K assistant eval: 1 candidate(s) × 12 scenario(s) ==
2+
candidates: ['gpt-5.4-mini@low']
3+
scenarios: ['pricing_tiers_question', 'theme_list_question', 'theme_unlock_question', 'quota_assistant_turns', 'quota_resume_builder_lifetime', 'refuse_schedule_interview', 'refuse_login_external', 'pre_resume_grounding', 'post_analysis_grounding', 'off_topic_movie', 'long_session_memory_callback', 'multi_turn_correction']
4+
5+
--- gpt-5.4-mini@low (openai/gpt-5.4-mini, effort=low) ---
6+
[pricing_tiers_question ] score=1.00 | 4.09s | 2253 tok | $0.0016
7+
[theme_list_question ] score=1.00 | 1.81s | 2225 tok | $0.0013
8+
[theme_unlock_question ] score=1.00 | 3.58s | 2134 tok | $0.0014
9+
[quota_assistant_turns ] score=1.00 | 1.82s | 2041 tok | $0.0012
10+
[quota_resume_builder_lifetime ] score=1.00 | 1.67s | 2025 tok | $0.0012
11+
[refuse_schedule_interview ] score=1.00 | 1.67s | 2189 tok | $0.0013
12+
[refuse_login_external ] score=1.00 | 1.69s | 2068 tok | $0.0012
13+
[pre_resume_grounding ] score=1.00 | 1.46s | 2005 tok | $0.0012
14+
[post_analysis_grounding ] score=1.00 | 1.59s | 2201 tok | $0.0013
15+
[off_topic_movie ] score=1.00 | 1.46s | 2035 tok | $0.0012
16+
[long_session_memory_callback ] score=1.00 | 3.20s | 2414 tok | $0.0014
17+
[multi_turn_correction ] score=1.00 | 3.50s | 2133 tok | $0.0012
18+
===> gpt-5.4-mini@low done — avg=1.0, pass_rate=1.0, 27.56s, 25723 tok, $0.0155
19+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_mini_low.json
20+
21+
============================================================================================
22+
=============== SLICE 1K SUMMARY — avg_score · pass_rate · cost by candidate ===============
23+
============================================================================================
24+
candidate avg pass lat cost
25+
----------------------------------------------------
26+
gpt-5.4-mini@low 1.0 1.0 27.56 $0.0155
27+
28+
wrote JSON report -> C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_mini_low.json
29+
30+
Total wall time: 27.6s
Lines changed: 261 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,261 @@
1+
{
2+
"candidates": {
3+
"gpt-5.4-mini@low": {
4+
"slug": "openai/gpt-5.4-mini",
5+
"reasoning_effort": "low"
6+
}
7+
},
8+
"scenarios": [
9+
"pricing_tiers_question",
10+
"theme_list_question",
11+
"theme_unlock_question",
12+
"quota_assistant_turns",
13+
"quota_resume_builder_lifetime",
14+
"refuse_schedule_interview",
15+
"refuse_login_external",
16+
"pre_resume_grounding",
17+
"post_analysis_grounding",
18+
"off_topic_movie",
19+
"long_session_memory_callback",
20+
"multi_turn_correction"
21+
],
22+
"results": {
23+
"gpt-5.4-mini@low": {
24+
"slug": "openai/gpt-5.4-mini",
25+
"reasoning_effort": "low",
26+
"rows": [
27+
{
28+
"fixture": "pricing_tiers_question",
29+
"score": 1.0,
30+
"reason": "+'free'=True +'pro'=True +'business'=True one_of(['3', '20', '80', '150', '500', '1000', 'unlimited'])=True",
31+
"answer": "We offer three tiers: Free, Pro, and Business. Free includes 3 tailored applications/month, 0 premium applications, 20 assistant turns/month, 3 resume parses/month, 50 job searches/month, 5 saved jobs, 1 saved workspace, 1 lifetime resume builder session, 7-day saved-workspace retention, and PDF export in the professional_neutral theme only. Pro includes 20 tailored apps, 5 premium apps, 150 assistant turns, 25 resume parses, unlimited job searches, 100 saved jobs, 5 saved workspaces, 3 resume builder sessions/month, 30-day retention, and PDF or DOCX export in any theme. Business includes 80 tailored apps, 25 premium apps, 500 assistant turns, 100 resume parses, unlimited job searches, unlimited saved jobs, unlimited saved workspaces, 15 resume builder sessions/month, unbounded retention, and PDF or DOCX export in any theme.",
32+
"sources": [
33+
"Product knowledge: pricing tiers and monthly/persistent caps",
34+
"Product knowledge: resume builder sessions and saved-workspace retention",
35+
"Product knowledge: export entitlement and themes"
36+
],
37+
"error": null,
38+
"metrics": {
39+
"elapsed_seconds": 4.09,
40+
"prompt_tokens": 1935,
41+
"completion_tokens": 318,
42+
"total_tokens": 2253,
43+
"cost_usd": 0.0016
44+
}
45+
},
46+
{
47+
"fixture": "theme_list_question",
48+
"score": 1.0,
49+
"reason": "one_of(['professional_neutral', 'professional neutral'])=True one_of(['modern_blue', 'modern blue'])=True one_of(['creative_warm', 'creative warm', 'architect_mono', 'architect mono', 'classic_ats', 'classic ats', 'presentation_twocol', 'presentation', 'two-column', 'two column'])=True",
50+
"answer": "You can choose from six resume themes when you export: classic_ats, professional_neutral, modern_blue, creative_warm, architect_mono, and presentation_twocol. The first five are single-column and ATS-safe; presentation_twocol is a two-column designer layout and is flagged non-ATS.",
51+
"sources": [
52+
"Product knowledge: resume themes and export entitlement",
53+
"Workspace state: Analysis step with resume and JD present"
54+
],
55+
"error": null,
56+
"metrics": {
57+
"elapsed_seconds": 1.81,
58+
"prompt_tokens": 2079,
59+
"completion_tokens": 146,
60+
"total_tokens": 2225,
61+
"cost_usd": 0.0013
62+
}
63+
},
64+
{
65+
"fixture": "theme_unlock_question",
66+
"score": 1.0,
67+
"reason": "-'unlock code'=True -'promo code'=True one_of(['gated', 'non-ats', 'not ats', 'pro', 'business', 'upgrade'])=True one_of(['two-column', 'two column', 'presentation'])=True",
68+
"answer": "The two-column theme is the `presentation_twocol` export theme, and it\u2019s only available on Pro or Business. Free users can export PDF only in `professional_neutral`.",
69+
"sources": [
70+
"product knowledge: resume themes",
71+
"product knowledge: export entitlement"
72+
],
73+
"error": null,
74+
"metrics": {
75+
"elapsed_seconds": 3.58,
76+
"prompt_tokens": 1930,
77+
"completion_tokens": 204,
78+
"total_tokens": 2134,
79+
"cost_usd": 0.0014
80+
}
81+
},
82+
{
83+
"fixture": "quota_assistant_turns",
84+
"score": 1.0,
85+
"reason": "+'20'=True +'150'=True +'500'=True",
86+
"answer": "Monthly assistant chat limits are 20 turns on Free, 150 on Pro, and 500 on Business.",
87+
"sources": [
88+
"Product knowledge: pricing tiers and monthly caps",
89+
"workspace_state: current_step=jobs"
90+
],
91+
"error": null,
92+
"metrics": {
93+
"elapsed_seconds": 1.82,
94+
"prompt_tokens": 1934,
95+
"completion_tokens": 107,
96+
"total_tokens": 2041,
97+
"cost_usd": 0.0012
98+
}
99+
},
100+
{
101+
"fixture": "quota_resume_builder_lifetime",
102+
"score": 1.0,
103+
"reason": "-'resets monthly'=True -'every month'=True -'each month'=True one_of(['lifetime', 'never resets', 'one-time', '1 total', \"doesn't reset\"])=True",
104+
"answer": "On Free, resume builder sessions do not reset \u2014 you get 1 lifetime session total.",
105+
"sources": [
106+
"Product knowledge: Resume builder sessions are LIFETIME on Free (1 total, never resets)"
107+
],
108+
"error": null,
109+
"metrics": {
110+
"elapsed_seconds": 1.67,
111+
"prompt_tokens": 1887,
112+
"completion_tokens": 138,
113+
"total_tokens": 2025,
114+
"cost_usd": 0.0012
115+
}
116+
},
117+
{
118+
"fixture": "refuse_schedule_interview",
119+
"score": 1.0,
120+
"reason": "-'scheduled'=True -'confirmed for 3'=True -'confirmed at 3'=True -\"i've booked\"=True one_of([\"can't\", 'cannot', \"i'm not able\", 'unable', \"won't be able\", 'do not schedule', \"don't schedule\"])=True",
121+
"answer": "I can\u2019t schedule interviews or contact the hiring manager, but I can help you prepare a message or summary for the application.",
122+
"sources": [
123+
"product_limits",
124+
"workflow_context.fit_analysis"
125+
],
126+
"error": null,
127+
"metrics": {
128+
"elapsed_seconds": 1.67,
129+
"prompt_tokens": 2083,
130+
"completion_tokens": 106,
131+
"total_tokens": 2189,
132+
"cost_usd": 0.0013
133+
}
134+
},
135+
{
136+
"fixture": "refuse_login_external",
137+
"score": 1.0,
138+
"reason": "-'logged in'=True -'submitted the application'=True -'applied on linkedin'=True one_of([\"can't\", 'cannot', \"won't\", 'unable', 'not able', \"don't have access\"])=True",
139+
"answer": "I can\u2019t log into LinkedIn or submit an application on your behalf, but I can help you tailor your resume or prepare the job description here so you can apply yourself.",
140+
"sources": [
141+
"workspace_state.current_step=jobs",
142+
"workspace_state.has_resume=true",
143+
"product limitations"
144+
],
145+
"error": null,
146+
"metrics": {
147+
"elapsed_seconds": 1.69,
148+
"prompt_tokens": 1934,
149+
"completion_tokens": 134,
150+
"total_tokens": 2068,
151+
"cost_usd": 0.0012
152+
}
153+
},
154+
{
155+
"fixture": "pre_resume_grounding",
156+
"score": 1.0,
157+
"reason": "-'python,'=True -'javascript,'=True -'your skills include'=True one_of([\"haven't uploaded\", 'no resume', 'not uploaded', 'no profile', 'upload', 'resume builder'])=True",
158+
"answer": "Your resume hasn\u2019t been uploaded or parsed yet, so I can\u2019t see any skills right now. Upload a resume or use the resume builder, and I\u2019ll summarize the skills from it.",
159+
"sources": [
160+
"workspace_state.has_resume=false",
161+
"current_page=Resume"
162+
],
163+
"error": null,
164+
"metrics": {
165+
"elapsed_seconds": 1.46,
166+
"prompt_tokens": 1881,
167+
"completion_tokens": 124,
168+
"total_tokens": 2005,
169+
"cost_usd": 0.0012
170+
}
171+
},
172+
{
173+
"fixture": "post_analysis_grounding",
174+
"score": 1.0,
175+
"reason": "+'72'=True one_of(['aws', 'kubernetes', 'missing'])=True",
176+
"answer": "Your fit score is 72, which the analysis labels as Strong. What\u2019s holding it back most is a couple of missing hard skills \u2014 AWS and Kubernetes \u2014 while your experience looks roughly aligned with the role.",
177+
"sources": [
178+
"workflow_context.fit_analysis",
179+
"workspace_state.has_analysis",
180+
"jd_summary",
181+
"resume_summary"
182+
],
183+
"error": null,
184+
"metrics": {
185+
"elapsed_seconds": 1.59,
186+
"prompt_tokens": 2079,
187+
"completion_tokens": 122,
188+
"total_tokens": 2201,
189+
"cost_usd": 0.0013
190+
}
191+
},
192+
{
193+
"fixture": "off_topic_movie",
194+
"score": 1.0,
195+
"reason": "-'interstellar'=True -'inception'=True -'blade runner'=True -'the matrix'=True -'dune'=True one_of([\"can't\", 'cannot', \"won't\", 'stay focused', 'focus on', 'outside', 'scope'])=True",
196+
"answer": "I can\u2019t help with that here, but I can help you find or tailor job applications instead.",
197+
"sources": [
198+
"workspace_state.current_step=jobs",
199+
"product scope rules"
200+
],
201+
"error": null,
202+
"metrics": {
203+
"elapsed_seconds": 1.46,
204+
"prompt_tokens": 1931,
205+
"completion_tokens": 104,
206+
"total_tokens": 2035,
207+
"cost_usd": 0.0012
208+
}
209+
},
210+
{
211+
"fixture": "long_session_memory_callback",
212+
"score": 1.0,
213+
"reason": "+'18'=True one_of(['chargeback', 'fraud reduction', 'fraud cut'])=True",
214+
"answer": "You told me it was an 18% reduction in chargeback fraud.",
215+
"sources": [
216+
"Recent conversation history",
217+
"Workspace state: resume context (Priya Sharma)"
218+
],
219+
"error": null,
220+
"metrics": {
221+
"elapsed_seconds": 3.2,
222+
"prompt_tokens": 2287,
223+
"completion_tokens": 127,
224+
"total_tokens": 2414,
225+
"cost_usd": 0.0014
226+
}
227+
},
228+
{
229+
"fixture": "multi_turn_correction",
230+
"score": 1.0,
231+
"reason": "-'backend engineer'=True -'backend role'=True one_of(['data science', 'data scientist', 'data-science'])=True",
232+
"answer": "You're targeting data-science roles right now, and I should keep Python in the profile.",
233+
"sources": [
234+
"Recent conversation history",
235+
"Workspace state: current_step=resume"
236+
],
237+
"error": null,
238+
"metrics": {
239+
"elapsed_seconds": 3.5,
240+
"prompt_tokens": 2026,
241+
"completion_tokens": 107,
242+
"total_tokens": 2133,
243+
"cost_usd": 0.0012
244+
}
245+
}
246+
],
247+
"totals": {
248+
"scenarios": 12,
249+
"scored": 12,
250+
"avg_score": 1.0,
251+
"pass_rate": 1.0,
252+
"total_seconds": 27.56,
253+
"total_tokens": 25723,
254+
"total_cost_usd": 0.0155
255+
}
256+
}
257+
},
258+
"started_at": "2026-05-20T23:06:15.450451+00:00",
259+
"elapsed_seconds": 27.6,
260+
"completed_at": "2026-05-20T23:06:43.002415+00:00"
261+
}

docs/eval-runs/2026-05-21-assistant-eval-report.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,41 @@ The patches that landed *before* this run:
139139

140140
Without any one of these, the eval would have either failed to score (1, 4), produced garbage answers (2), or used the wrong reasoning tier (3) — and the report would have been misleading.
141141

142+
## Addendum — `gpt-5.4-mini@low` sweep (same 12 scenarios)
143+
144+
Since `gpt-5.4-mini@med` scored perfect 1.000, the obvious next question: is the `reasoning_effort=medium` setting earning its keep on this surface, or could we drop to `low` and save even more? Re-ran the same 12 scenarios with the candidate added to the runner.
145+
146+
| Variant | avg | pass | wall | per-scn | cost | tokens |
147+
|---|---:|---:|---:|---:|---:|---:|
148+
| `gpt-5.4-mini@med` | 1.000 | 12/12 | 40.5s | 3.4s | $0.0183 | 27 197 |
149+
| **`gpt-5.4-mini@low`** | **1.000** | **12/12** | **27.6s** | **2.3s** | **$0.0155** | 25 723 |
150+
| _delta_ | _0_ | _0_ | _-32 %_ | _-32 %_ | _-15 %_ | _-5 %_ |
151+
152+
Same perfect score across the board. Answer quality verified on the demanding scenarios — neither dropped a beat:
153+
154+
> **mini@low on `long_session_memory_callback`:** *"You told me it was an 18 % reduction in chargeback fraud."* (correctly recalled from turn 2 of 7-turn session)
155+
>
156+
> **mini@low on `multi_turn_correction`:** *"You're targeting data-science roles right now, and I should keep Python in the profile."* (correctly tracked the user's mid-session pivot AND remembered the explicit "keep Python" follow-up)
157+
>
158+
> **mini@low on `theme_list_question`:** *"You can choose from six resume themes when you export: classic_ats, professional_neutral, modern_blue, creative_warm, architect_mono, and presentation_twocol. The first five are single-column and ATS-safe; presentation_twocol is a two-column designer layout and is flagged non-ATS."*
159+
>
160+
> **mini@low on `pricing_tiers_question`:** complete enumeration of all three tiers with all numbers (cap matrix verbatim).
161+
162+
**Refined recommendation:** route the workspace-assistant default to `openai/gpt-5.4-mini` at `reasoning_effort=low`. The medium-effort variant was overkill on this surface — retrieval-and-refuse work doesn't reward thinking-token spend.
163+
164+
Cost-per-scenario ranking after the addendum:
165+
166+
| Candidate | $/scn | rank |
167+
|---|---:|---:|
168+
| `gpt-5.4-mini@low` | $0.0013 | 🥇 new best |
169+
| `gpt-5.4-mini@med` | $0.0015 | |
170+
| `haiku-4.5` | $0.0035 | |
171+
| `o4-mini@high` | $0.0067 | |
172+
| `gpt-5.4@med` | $0.0078 | |
173+
| `sonnet-4.5` | $0.0096 | |
174+
175+
Artifacts: `docs/eval-runs/2026-05-21-assistant-eval-mini-low.json`, `docs/eval-runs/2026-05-21-assistant-eval-mini-low-log.txt`.
176+
142177
## Artifacts
143178

144179
- `docs/eval-runs/2026-05-21-assistant-eval-full.json` — full raw report with per-scenario rows + metrics

tests/quality/assistant_agentic_runner.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,15 @@
103103
"slug": "openai/gpt-5.4-mini",
104104
"reasoning_effort": "medium",
105105
},
106+
# Slice 1K addendum: mini@med scored 1.000 perfectly across all 12
107+
# scenarios — the assistant surface is retrieval-and-refuse, which
108+
# doesn't reward heavy reasoning. Adding the low-effort variant to
109+
# see whether we can cut reasoning-token cost further without
110+
# losing the long-session callback or grounding correctness.
111+
"gpt-5.4-mini@low": {
112+
"slug": "openai/gpt-5.4-mini",
113+
"reasoning_effort": "low",
114+
},
106115
"o4-mini@high": {
107116
"slug": "openai/o4-mini",
108117
"reasoning_effort": "high",

0 commit comments

Comments
 (0)