Skip to content

Commit b3f2ea1

Browse files
LEANDERANTONYclaude
andcommitted
feat(eval): Slice 1K — workspace-assistant 5-candidate eval + report
New runner `tests/quality/assistant_agentic_runner.py` mirrors the Phase B pattern (incremental-checkpoint + tail-f-friendly heartbeat) but targets the assistant prompt surface directly via `build_assistant_prompt` + `run_json_prompt`. Twelve scenarios across four failure modes: * Product-knowledge fluency (5): pricing tiers, theme inventory, two-column gating, monthly assistant_turns, lifetime-vs-monthly resume-builder quota * Honest refusals (2): schedule interview, LinkedIn login * Grounding discipline (3): off-topic movie, pre-resume "what skills?", post-analysis "what's my fit score?" * Multi-turn memory (2): 7-turn callback to a fact stated on turn 2, mid-session correction (latest stated truth wins) Substring-matcher rubric with the same normalisation pattern as Slice 1H (smart-quote / em-dash collapse) so the matcher bugs that plagued the comprehensive eval don't re-appear. Candidate slate (user-approved after dropping Opus + substituting o4-mini for the non-existent gpt-5.1-mini): gpt-5.4@medium, gpt-5.4-mini@medium, o4-mini@high, sonnet-4.5, haiku-4.5. All five routed through OpenRouter for transport-fair comparison (Slice 1H proved the proxy overhead is ~0s). Headline result (full data in `docs/eval-runs/2026-05-21-assistant-eval-full.json`): candidate | avg | pass | wall | cost gpt-5.4@med | 0.986 | 1.000 | 74.7s | $0.094 gpt-5.4-mini@med | 1.000 | 1.000 | 40.5s | $0.018 o4-mini@high | 1.000 | 1.000 | 117.3s | $0.081 sonnet-4.5 | 1.000 | 1.000 | 161.3s | $0.116 haiku-4.5 | 0.917 | 0.917 | 37.6s | $0.038 **`gpt-5.4-mini@med` wins on all three axes** — perfect quality, fastest, cheapest. The assistant surface is mostly retrieval-and- refuse (pulling facts from the new product-knowledge block, declining off-topic asks, recalling earlier turns), so heavy reasoning is wasted; smart-but-cheap wins. The two sub-1.0 scores re-classify cleanly: gpt-5.4@med's 0.833 on off_topic_movie is a matcher-bug (the model said "I can only help with your job application workflow here" — a perfect refusal that wasn't in the rubric's `one_of` list); haiku-4.5's 0.000 on quota_resume_builder_lifetime is a real JSON-mode fidelity miss (invalid JSON returned; same ~92% drift Phase B caught on Anthropic via OpenRouter for parser/JD). **Recommendation:** route the workspace-assistant default to `openai/gpt-5.4-mini` at `reasoning_effort=medium`. Real departure from the resume-builder default (gpt-5.4) and the Phase B verdict (gpt-5.4 for parser/JD/analysis); the surface characteristics genuinely differ. Expected ~80% savings on assistant API spend. Full read-out in `docs/eval-runs/2026-05-21-assistant-eval-report.md`. DEVLOG Day 61 added covering Slice 1J + 1J' + 1J'' + 1K end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 02e9853 commit b3f2ea1

5 files changed

Lines changed: 2303 additions & 0 deletions

File tree

docs/DEVLOG.md

Lines changed: 152 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2419,3 +2419,155 @@ smart clarifying question" (FAIL but BETTER) from "hallucinated
24192419
capability" (FAIL and WORSE). A v2 rubric with LLM-as-judge
24202420
1-5 quality scoring per scenario would catch this honestly.
24212421
Parked.
2422+
2423+
2424+
## Day 61: Workspace assistant — history fix, product-knowledge block, adapter reasoning_effort, and Slice 1K eval
2425+
2426+
Today extended the agentic-upgrade work to the OTHER chat surface
2427+
in the app: the workspace assistant (`src/assistant_service.py`,
2428+
prompts at `prompts/assistant/v1.json` and
2429+
`prompts/assistant_text/v1.json`). The audit found the SAME
2430+
history-truncation bug the resume builder had at the start of
2431+
Slice 1B — except worse: assistant code was at `history[-4:]`,
2432+
whereas the resume builder had been at `history[-12:]`. Four
2433+
turns is not enough for any real conversation.
2434+
2435+
### Slice 1J: drop the `history[-4:]` slice on the assistant prompts
2436+
2437+
New constant `ASSISTANT_HISTORY_CHAR_BUDGET = 18000` (smaller than
2438+
the resume builder's 30 k because the assistant carries a heavier
2439+
`assistant_context` payload alongside — workspace_state +
2440+
workflow_context + the WORKSPACE STATE guidance rules). Both
2441+
`build_assistant_prompt` and `build_assistant_text_prompt` now
2442+
call `_slice_history_for_budget(history, max_chars=...)` instead
2443+
of the hard suffix slice. Same drop-oldest-first semantics as the
2444+
resume builder, with the most-recent turn guaranteed retained.
2445+
2446+
Why this matters concretely: the Slice 1K `long_session_memory_callback`
2447+
scenario is a 7-turn conversation where the user states "we cut
2448+
chargeback fraud by 18% using XGBoost" on turn 2, then on turn 6
2449+
asks "what number did I tell you?". With `history[-4:]` the model
2450+
sees turns 3-6 ONLY — the 18% fact has scrolled off and the
2451+
question is literally unanswerable. After the fix, all 5 candidates
2452+
in the Slice 1K eval correctly recalled "18%".
2453+
2454+
### Slice 1J': `_PRODUCT_KNOWLEDGE_BLOCK` — stop sounding ignorant
2455+
2456+
The WORKSPACE STATE block teaches the assistant how to READ live
2457+
runtime state but said nothing about pricing, themes, the agentic
2458+
pipeline, or the assistant's own limits. When a user asked "what
2459+
tiers do you have?" / "what themes can I use?" / "can you book
2460+
me an interview?", the answers ranged from "I don't have that
2461+
info" to outright fabrications.
2462+
2463+
Added a new module-level constant in `src/prompts.py` that's also
2464+
pre-baked into both registry JSONs (Pattern A per the prompt-
2465+
registry migration notes). It backstops all of these questions
2466+
with authoritative numbers pulled from `backend/tiers.py`
2467+
(TIER_CAPS), `src/resume_builder.py` (RESUME_THEMES),
2468+
`backend/tiers.py` (FREE_EXPORT_FORMAT/THEME), and `src/agents/*`
2469+
(the orchestrator chain):
2470+
2471+
* **Tier caps**: Free / Pro / Business — tailored applications
2472+
(3 / 20 / 80), assistant turns (20 / 150 / 500), resume parses
2473+
(3 / 25 / 100), saved jobs (5 / 1000 / unlimited), saved
2474+
workspaces (1 / 5 / unlimited), retention (7 days / 30 days /
2475+
unbounded).
2476+
* **Lifetime gotcha**: resume_builder_sessions on Free is
2477+
LIFETIME (never resets), monthly on Pro (3) / Business (15).
2478+
The block calls this out verbatim so the model stops
2479+
answering "resets monthly" by reflex.
2480+
* **Theme inventory**: six themes (classic_ats,
2481+
professional_neutral, modern_blue, creative_warm,
2482+
architect_mono, presentation_twocol); first five are
2483+
single-column ATS-safe; presentation_twocol is gated +
2484+
non-ATS.
2485+
* **Export entitlement**: Free = PDF + professional_neutral
2486+
only; Pro/Business = PDF or DOCX + any theme.
2487+
* **Agentic chain**: tailoring → review → resume gen → cover
2488+
letter, with conservative-correction posture in the review
2489+
pass.
2490+
* **Honest cannot-do list**: schedule interviews, send emails,
2491+
log in to LinkedIn / Indeed, scrape arbitrary URLs, edit the
2492+
resume file directly, change subscription tier, remember
2493+
across sessions when signed out.
2494+
2495+
If any of these source-of-truth numbers ever drift, the
2496+
byte-mirror tests
2497+
(`test_assistant_prompt_matches_pre_migration_system_byte_for_byte`)
2498+
fail on the next CI run.
2499+
2500+
### Slice 1J'': thread `reasoning_effort` through both eval adapters
2501+
2502+
`OpenRouterEvalService.run_tool_loop`, `run_json_prompt`, and
2503+
`run_structured_prompt` — plus the symmetric `KimiEvalService._chat`
2504+
+ entry points — now forward the `reasoning_effort` kwarg to
2505+
`chat.completions.create`. Conditional (only when truthy) because
2506+
non-reasoning slugs (Sonnet, Haiku, DeepSeek v4) 400 if it's set.
2507+
2508+
Added pricing entries for the new Slice 1K candidates:
2509+
`openai/o4-mini` ($1.10 / $4.40 per Mtok — substituted for the
2510+
non-existent `openai/gpt-5.1-mini`) and `anthropic/claude-haiku-4.5`
2511+
($1.00 / $5.00 per Mtok).
2512+
2513+
Bonus bugfix from the Slice 1K smoke run: the
2514+
`OpenRouterEvalService.run_json_prompt` path was never
2515+
accumulating `response.usage` into the snapshot — only
2516+
`run_tool_loop` was. The smoke at first reported $0.0000 for
2517+
every call. Mirrored the accumulator into the single-shot path so
2518+
the assistant / parser / structuring suites all surface accurate
2519+
per-call cost.
2520+
2521+
### Slice 1K: 5-candidate assistant eval (12 scenarios)
2522+
2523+
New runner `tests/quality/assistant_agentic_runner.py` — mirrors
2524+
the Phase B incremental-checkpoint + heartbeat pattern but
2525+
targets the assistant prompt surface directly via
2526+
`build_assistant_prompt` + `run_json_prompt`. Twelve scenarios
2527+
across product-knowledge fluency, honest refusals, grounding
2528+
discipline, and multi-turn memory. Substring-matcher rubric with
2529+
the same normalisation (smart quotes / em-dashes) as Slice 1H.
2530+
2531+
Candidate slate (user-approved after dropping Opus + substituting
2532+
o4-mini for the non-existent gpt-5.1-mini): gpt-5.4@med,
2533+
gpt-5.4-mini@med, o4-mini@high, sonnet-4.5, haiku-4.5.
2534+
2535+
Headline result (full per-candidate × per-scenario data in
2536+
`docs/eval-runs/2026-05-21-assistant-eval-full.json`):
2537+
2538+
| candidate | avg | pass | wall | cost |
2539+
| gpt-5.4@med | 0.986 | 1.000 | 74.7s | $0.094 |
2540+
| gpt-5.4-mini@med | 1.000 | 1.000 | 40.5s | $0.018 |
2541+
| o4-mini@high | 1.000 | 1.000 | 117.3s | $0.081 |
2542+
| sonnet-4.5 | 1.000 | 1.000 | 161.3s | $0.116 |
2543+
| haiku-4.5 | 0.917 | 0.917 | 37.6s | $0.038 |
2544+
2545+
**Surprise:** `gpt-5.4-mini@med` is the winner on all three axes
2546+
— quality 1.000, fastest at 40 s, cheapest at $0.018 (1/5 the
2547+
cost of gpt-5.4@med which scored 0.986). The assistant surface
2548+
is mostly retrieval-and-refuse: pulling facts from the new
2549+
product-knowledge block, declining off-topic asks, recalling
2550+
earlier turns. Heavy reasoning is wasted; smart-but-cheap wins.
2551+
2552+
The two sub-1.0 scores re-classify cleanly:
2553+
* `gpt-5.4@med` :: `off_topic_movie` (0.833) — matcher-bug:
2554+
the model's "I can only help with your job application
2555+
workflow here" is a textbook refusal but wasn't in the
2556+
`one_of` rubric. Real behavior = PASS.
2557+
* `haiku-4.5` :: `quota_resume_builder_lifetime` (0.000) —
2558+
real JSON-mode fidelity miss; haiku returned content that
2559+
didn't parse. The other 11 scenarios were valid JSON. Same
2560+
drift pattern Phase B caught for parser/JD on Anthropic
2561+
via OpenRouter (~92 % reliability).
2562+
2563+
**Recommendation:** route the workspace-assistant default to
2564+
`openai/gpt-5.4-mini` at `reasoning_effort=medium`. This is a
2565+
real departure from the resume-builder default (gpt-5.4) and the
2566+
Phase B verdict (gpt-5.4 for parser/JD/analysis); the surface
2567+
characteristics genuinely differ. Expected ~80 % savings on
2568+
assistant API spend. Full read-out in
2569+
`docs/eval-runs/2026-05-21-assistant-eval-report.md`.
2570+
2571+
Slice 1J's history fix paid off concretely: all 5 candidates
2572+
correctly recalled the "18 %" fact from turn 2 in a 7-turn
2573+
session — unscorable before the fix.
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
== Slice 1K assistant eval: 5 candidate(s) × 12 scenario(s) ==
2+
candidates: ['gpt-5.4@med', 'gpt-5.4-mini@med', 'o4-mini@high', 'sonnet-4.5', 'haiku-4.5']
3+
scenarios: ['pricing_tiers_question', 'theme_list_question', 'theme_unlock_question', 'quota_assistant_turns', 'quota_resume_builder_lifetime', 'refuse_schedule_interview', 'refuse_login_external', 'pre_resume_grounding', 'post_analysis_grounding', 'off_topic_movie', 'long_session_memory_callback', 'multi_turn_correction']
4+
5+
--- gpt-5.4@med (openai/gpt-5.4, effort=medium) ---
6+
[pricing_tiers_question ] score=1.00 | 12.37s | 2571 tok | $0.0112
7+
[theme_list_question ] score=1.00 | 4.78s | 2418 tok | $0.0086
8+
[theme_unlock_question ] score=1.00 | 5.91s | 2258 tok | $0.0081
9+
[quota_assistant_turns ] score=1.00 | 4.69s | 2151 tok | $0.0070
10+
[quota_resume_builder_lifetime ] score=1.00 | 5.19s | 2127 tok | $0.0071
11+
[refuse_schedule_interview ] score=1.00 | 4.31s | 2300 tok | $0.0074
12+
[refuse_login_external ] score=1.00 | 4.66s | 2224 tok | $0.0077
13+
[pre_resume_grounding ] score=1.00 | 4.06s | 2027 tok | $0.0062
14+
[post_analysis_grounding ] score=1.00 | 5.55s | 2332 tok | $0.0077
15+
[off_topic_movie ] score=0.83 | 5.23s | 2099 tok | $0.0065
16+
[long_session_memory_callback ] score=1.00 | 4.47s | 2544 tok | $0.0083
17+
[multi_turn_correction ] score=1.00 | 13.51s | 2291 tok | $0.0077
18+
===> gpt-5.4@med done — avg=0.986, pass_rate=1.0, 74.74s, 27342 tok, $0.0935
19+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
20+
21+
--- gpt-5.4-mini@med (openai/gpt-5.4-mini, effort=medium) ---
22+
[pricing_tiers_question ] score=1.00 | 4.51s | 2454 tok | $0.0020
23+
[theme_list_question ] score=1.00 | 3.44s | 2278 tok | $0.0014
24+
[theme_unlock_question ] score=1.00 | 8.72s | 2473 tok | $0.0021
25+
[quota_assistant_turns ] score=1.00 | 1.53s | 2118 tok | $0.0013
26+
[quota_resume_builder_lifetime ] score=1.00 | 4.21s | 2101 tok | $0.0014
27+
[refuse_schedule_interview ] score=1.00 | 2.50s | 2367 tok | $0.0016
28+
[refuse_login_external ] score=1.00 | 2.37s | 2223 tok | $0.0015
29+
[pre_resume_grounding ] score=1.00 | 3.35s | 2076 tok | $0.0013
30+
[post_analysis_grounding ] score=1.00 | 2.53s | 2283 tok | $0.0014
31+
[off_topic_movie ] score=1.00 | 2.36s | 2085 tok | $0.0013
32+
[long_session_memory_callback ] score=1.00 | 2.94s | 2451 tok | $0.0015
33+
[multi_turn_correction ] score=1.00 | 2.06s | 2288 tok | $0.0015
34+
===> gpt-5.4-mini@med done — avg=1.0, pass_rate=1.0, 40.52s, 27197 tok, $0.0183
35+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
36+
37+
--- o4-mini@high (openai/o4-mini, effort=high) ---
38+
[pricing_tiers_question ] score=1.00 | 11.00s | 3518 tok | $0.0091
39+
[theme_list_question ] score=1.00 | 11.35s | 3400 tok | $0.0081
40+
[theme_unlock_question ] score=1.00 | 11.07s | 3204 tok | $0.0077
41+
[quota_assistant_turns ] score=1.00 | 7.56s | 2657 tok | $0.0053
42+
[quota_resume_builder_lifetime ] score=1.00 | 7.46s | 2484 tok | $0.0047
43+
[refuse_schedule_interview ] score=1.00 | 8.56s | 3117 tok | $0.0068
44+
[refuse_login_external ] score=1.00 | 5.23s | 2526 tok | $0.0047
45+
[pre_resume_grounding ] score=1.00 | 4.78s | 2496 tok | $0.0048
46+
[post_analysis_grounding ] score=1.00 | 15.44s | 3240 tok | $0.0074
47+
[off_topic_movie ] score=1.00 | 14.97s | 3050 tok | $0.0070
48+
[long_session_memory_callback ] score=1.00 | 10.08s | 3585 tok | $0.0082
49+
[multi_turn_correction ] score=1.00 | 9.79s | 3093 tok | $0.0069
50+
===> o4-mini@high done — avg=1.0, pass_rate=1.0, 117.31s, 36370 tok, $0.0807
51+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
52+
53+
--- sonnet-4.5 (anthropic/claude-sonnet-4.5, effort=-) ---
54+
[pricing_tiers_question ] score=1.00 | 8.38s | 2583 tok | $0.0125
55+
[theme_list_question ] score=1.00 | 5.47s | 2544 tok | $0.0098
56+
[theme_unlock_question ] score=1.00 | 8.95s | 2404 tok | $0.0099
57+
[quota_assistant_turns ] score=1.00 | 22.42s | 2310 tok | $0.0085
58+
[quota_resume_builder_lifetime ] score=1.00 | 10.21s | 2278 tok | $0.0087
59+
[refuse_schedule_interview ] score=1.00 | 8.46s | 2537 tok | $0.0097
60+
[refuse_login_external ] score=1.00 | 10.46s | 2353 tok | $0.0091
61+
[pre_resume_grounding ] score=1.00 | 26.50s | 2261 tok | $0.0085
62+
[post_analysis_grounding ] score=1.00 | 8.45s | 2563 tok | $0.0101
63+
[off_topic_movie ] score=1.00 | 40.95s | 2326 tok | $0.0087
64+
[long_session_memory_callback ] score=1.00 | 4.89s | 2772 tok | $0.0104
65+
[multi_turn_correction ] score=1.00 | 6.11s | 2476 tok | $0.0097
66+
===> sonnet-4.5 done — avg=1.0, pass_rate=1.0, 161.26s, 29407 tok, $0.1156
67+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
68+
69+
--- haiku-4.5 (anthropic/claude-haiku-4.5, effort=-) ---
70+
[pricing_tiers_question ] score=1.00 | 4.47s | 2478 tok | $0.0037
71+
[theme_list_question ] score=1.00 | 2.79s | 2523 tok | $0.0032
72+
[theme_unlock_question ] score=1.00 | 2.60s | 2331 tok | $0.0029
73+
[quota_assistant_turns ] score=1.00 | 2.33s | 2314 tok | $0.0028
74+
[quota_resume_builder_lifetime ] score=0.00 | 4.31s | 2276 tok | $0.0029 ERROR: AgentExecutionError: OpenRouter run_json_prompt returned invalid JSON.
75+
[refuse_schedule_interview ] score=1.00 | 3.40s | 2605 tok | $0.0036
76+
[refuse_login_external ] score=1.00 | 3.61s | 2387 tok | $0.0032
77+
[pre_resume_grounding ] score=1.00 | 2.91s | 2261 tok | $0.0028
78+
[post_analysis_grounding ] score=1.00 | 3.33s | 2621 tok | $0.0037
79+
[off_topic_movie ] score=1.00 | 2.53s | 2370 tok | $0.0031
80+
[long_session_memory_callback ] score=1.00 | 2.16s | 2726 tok | $0.0032
81+
[multi_turn_correction ] score=1.00 | 3.20s | 2451 tok | $0.0031
82+
===> haiku-4.5 done — avg=0.917, pass_rate=0.917, 37.64s, 29343 tok, $0.0382
83+
[checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
84+
85+
============================================================================================
86+
=============== SLICE 1K SUMMARY — avg_score · pass_rate · cost by candidate ===============
87+
============================================================================================
88+
candidate avg pass lat cost
89+
----------------------------------------------------
90+
gpt-5.4@med 0.986 1.0 74.74 $0.0935
91+
gpt-5.4-mini@med 1.0 1.0 40.52 $0.0183
92+
o4-mini@high 1.0 1.0 117.31 $0.0807
93+
sonnet-4.5 1.0 1.0 161.26 $0.1156
94+
haiku-4.5 0.917 0.917 37.64 $0.0382
95+
96+
wrote JSON report -> C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
97+
98+
Total wall time: 431.5s

0 commit comments

Comments
 (0)