You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Followup to the Slice 1K verdict: since gpt-5.4-mini@med scored a
perfect 1.000 across all 12 scenarios, the obvious next question was
whether reasoning_effort=medium was earning its keep on this surface
or if low effort would produce the same quality at less cost. Added
`gpt-5.4-mini@low` to the candidate slate and re-ran the same 12
scenarios against it.
Result: also perfect 1.000, but faster + cheaper across the board.
variant avg pass wall per-scn cost tokens
gpt-5.4-mini@med 1.0 12/12 40.52s 3.4s $0.0183 27 197
gpt-5.4-mini@low 1.0 12/12 27.56s 2.3s $0.0155 25 723
delta 0 0 -32% -32% -15% -5%
Verified answer quality directly on the demanding scenarios — neither
long-session callback nor multi-turn correction degraded:
long_session_memory_callback:
"You told me it was an 18% reduction in chargeback fraud."
(correctly recalled from turn 2 of a 7-turn session)
multi_turn_correction:
"You're targeting data-science roles right now, and I should
keep Python in the profile."
(correctly tracked the user's mid-session pivot AND the
explicit "keep Python" follow-up — same retention as mini@med)
theme_list_question: complete six-theme enumeration with ATS /
non-ATS distinction (same shape as mini@med).
**Refined recommendation:** route the workspace-assistant default to
`openai/gpt-5.4-mini` at `reasoning_effort=low`. The assistant
surface is retrieval-and-refuse — thinking-token spend beyond "low"
earns nothing on this rubric.
Cost-per-scenario, final ranking:
candidate $/scn
gpt-5.4-mini@low $0.0013 <- new best
gpt-5.4-mini@med $0.0015
haiku-4.5 $0.0035
o4-mini@high $0.0067
gpt-5.4@med $0.0078
sonnet-4.5 $0.0096
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
"answer": "You can choose from six resume themes when you export: classic_ats, professional_neutral, modern_blue, creative_warm, architect_mono, and presentation_twocol. The first five are single-column and ATS-safe; presentation_twocol is a two-column designer layout and is flagged non-ATS.",
51
+
"sources": [
52
+
"Product knowledge: resume themes and export entitlement",
53
+
"Workspace state: Analysis step with resume and JD present"
"answer": "The two-column theme is the `presentation_twocol` export theme, and it\u2019s only available on Pro or Business. Free users can export PDF only in `professional_neutral`.",
69
+
"sources": [
70
+
"product knowledge: resume themes",
71
+
"product knowledge: export entitlement"
72
+
],
73
+
"error": null,
74
+
"metrics": {
75
+
"elapsed_seconds": 3.58,
76
+
"prompt_tokens": 1930,
77
+
"completion_tokens": 204,
78
+
"total_tokens": 2134,
79
+
"cost_usd": 0.0014
80
+
}
81
+
},
82
+
{
83
+
"fixture": "quota_assistant_turns",
84
+
"score": 1.0,
85
+
"reason": "+'20'=True +'150'=True +'500'=True",
86
+
"answer": "Monthly assistant chat limits are 20 turns on Free, 150 on Pro, and 500 on Business.",
87
+
"sources": [
88
+
"Product knowledge: pricing tiers and monthly caps",
"answer": "On Free, resume builder sessions do not reset \u2014 you get 1 lifetime session total.",
105
+
"sources": [
106
+
"Product knowledge: Resume builder sessions are LIFETIME on Free (1 total, never resets)"
107
+
],
108
+
"error": null,
109
+
"metrics": {
110
+
"elapsed_seconds": 1.67,
111
+
"prompt_tokens": 1887,
112
+
"completion_tokens": 138,
113
+
"total_tokens": 2025,
114
+
"cost_usd": 0.0012
115
+
}
116
+
},
117
+
{
118
+
"fixture": "refuse_schedule_interview",
119
+
"score": 1.0,
120
+
"reason": "-'scheduled'=True -'confirmed for 3'=True -'confirmed at 3'=True -\"i've booked\"=True one_of([\"can't\", 'cannot', \"i'm not able\", 'unable', \"won't be able\", 'do not schedule', \"don't schedule\"])=True",
121
+
"answer": "I can\u2019t schedule interviews or contact the hiring manager, but I can help you prepare a message or summary for the application.",
122
+
"sources": [
123
+
"product_limits",
124
+
"workflow_context.fit_analysis"
125
+
],
126
+
"error": null,
127
+
"metrics": {
128
+
"elapsed_seconds": 1.67,
129
+
"prompt_tokens": 2083,
130
+
"completion_tokens": 106,
131
+
"total_tokens": 2189,
132
+
"cost_usd": 0.0013
133
+
}
134
+
},
135
+
{
136
+
"fixture": "refuse_login_external",
137
+
"score": 1.0,
138
+
"reason": "-'logged in'=True -'submitted the application'=True -'applied on linkedin'=True one_of([\"can't\", 'cannot', \"won't\", 'unable', 'not able', \"don't have access\"])=True",
139
+
"answer": "I can\u2019t log into LinkedIn or submit an application on your behalf, but I can help you tailor your resume or prepare the job description here so you can apply yourself.",
"answer": "Your resume hasn\u2019t been uploaded or parsed yet, so I can\u2019t see any skills right now. Upload a resume or use the resume builder, and I\u2019ll summarize the skills from it.",
"answer": "Your fit score is 72, which the analysis labels as Strong. What\u2019s holding it back most is a couple of missing hard skills \u2014 AWS and Kubernetes \u2014 while your experience looks roughly aligned with the role.",
Copy file name to clipboardExpand all lines: docs/eval-runs/2026-05-21-assistant-eval-report.md
+35Lines changed: 35 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -139,6 +139,41 @@ The patches that landed *before* this run:
139
139
140
140
Without any one of these, the eval would have either failed to score (1, 4), produced garbage answers (2), or used the wrong reasoning tier (3) — and the report would have been misleading.
Since `gpt-5.4-mini@med` scored perfect 1.000, the obvious next question: is the `reasoning_effort=medium` setting earning its keep on this surface, or could we drop to `low` and save even more? Re-ran the same 12 scenarios with the candidate added to the runner.
Same perfect score across the board. Answer quality verified on the demanding scenarios — neither dropped a beat:
153
+
154
+
> **mini@low on `long_session_memory_callback`:***"You told me it was an 18 % reduction in chargeback fraud."* (correctly recalled from turn 2 of 7-turn session)
155
+
>
156
+
> **mini@low on `multi_turn_correction`:***"You're targeting data-science roles right now, and I should keep Python in the profile."* (correctly tracked the user's mid-session pivot AND remembered the explicit "keep Python" follow-up)
157
+
>
158
+
> **mini@low on `theme_list_question`:***"You can choose from six resume themes when you export: classic_ats, professional_neutral, modern_blue, creative_warm, architect_mono, and presentation_twocol. The first five are single-column and ATS-safe; presentation_twocol is a two-column designer layout and is flagged non-ATS."*
159
+
>
160
+
> **mini@low on `pricing_tiers_question`:** complete enumeration of all three tiers with all numbers (cap matrix verbatim).
161
+
162
+
**Refined recommendation:** route the workspace-assistant default to `openai/gpt-5.4-mini` at `reasoning_effort=low`. The medium-effort variant was overkill on this surface — retrieval-and-refuse work doesn't reward thinking-token spend.
0 commit comments