Commit b3f2ea1
feat(eval): Slice 1K — workspace-assistant 5-candidate eval + report
New runner `tests/quality/assistant_agentic_runner.py` mirrors the
Phase B pattern (incremental-checkpoint + tail-f-friendly heartbeat)
but targets the assistant prompt surface directly via
`build_assistant_prompt` + `run_json_prompt`. Twelve scenarios across
four failure modes:
* Product-knowledge fluency (5): pricing tiers, theme inventory,
two-column gating, monthly assistant_turns, lifetime-vs-monthly
resume-builder quota
* Honest refusals (2): schedule interview, LinkedIn login
* Grounding discipline (3): off-topic movie, pre-resume "what
skills?", post-analysis "what's my fit score?"
* Multi-turn memory (2): 7-turn callback to a fact stated on
turn 2, mid-session correction (latest stated truth wins)
Substring-matcher rubric with the same normalisation pattern as
Slice 1H (smart-quote / em-dash collapse) so the matcher bugs that
plagued the comprehensive eval don't re-appear.
Candidate slate (user-approved after dropping Opus + substituting
o4-mini for the non-existent gpt-5.1-mini): gpt-5.4@medium,
gpt-5.4-mini@medium, o4-mini@high, sonnet-4.5, haiku-4.5. All five
routed through OpenRouter for transport-fair comparison (Slice 1H
proved the proxy overhead is ~0s).
Headline result (full data in
`docs/eval-runs/2026-05-21-assistant-eval-full.json`):
candidate | avg | pass | wall | cost
gpt-5.4@med | 0.986 | 1.000 | 74.7s | $0.094
gpt-5.4-mini@med | 1.000 | 1.000 | 40.5s | $0.018
o4-mini@high | 1.000 | 1.000 | 117.3s | $0.081
sonnet-4.5 | 1.000 | 1.000 | 161.3s | $0.116
haiku-4.5 | 0.917 | 0.917 | 37.6s | $0.038
**`gpt-5.4-mini@med` wins on all three axes** — perfect quality,
fastest, cheapest. The assistant surface is mostly retrieval-and-
refuse (pulling facts from the new product-knowledge block,
declining off-topic asks, recalling earlier turns), so heavy
reasoning is wasted; smart-but-cheap wins.
The two sub-1.0 scores re-classify cleanly: gpt-5.4@med's 0.833 on
off_topic_movie is a matcher-bug (the model said "I can only help
with your job application workflow here" — a perfect refusal that
wasn't in the rubric's `one_of` list); haiku-4.5's 0.000 on
quota_resume_builder_lifetime is a real JSON-mode fidelity miss
(invalid JSON returned; same ~92% drift Phase B caught on Anthropic
via OpenRouter for parser/JD).
**Recommendation:** route the workspace-assistant default to
`openai/gpt-5.4-mini` at `reasoning_effort=medium`. Real departure
from the resume-builder default (gpt-5.4) and the Phase B verdict
(gpt-5.4 for parser/JD/analysis); the surface characteristics
genuinely differ. Expected ~80% savings on assistant API spend. Full
read-out in `docs/eval-runs/2026-05-21-assistant-eval-report.md`.
DEVLOG Day 61 added covering Slice 1J + 1J' + 1J'' + 1K end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 02e9853 commit b3f2ea1
5 files changed
Lines changed: 2303 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2419 | 2419 | | |
2420 | 2420 | | |
2421 | 2421 | | |
| 2422 | + | |
| 2423 | + | |
| 2424 | + | |
| 2425 | + | |
| 2426 | + | |
| 2427 | + | |
| 2428 | + | |
| 2429 | + | |
| 2430 | + | |
| 2431 | + | |
| 2432 | + | |
| 2433 | + | |
| 2434 | + | |
| 2435 | + | |
| 2436 | + | |
| 2437 | + | |
| 2438 | + | |
| 2439 | + | |
| 2440 | + | |
| 2441 | + | |
| 2442 | + | |
| 2443 | + | |
| 2444 | + | |
| 2445 | + | |
| 2446 | + | |
| 2447 | + | |
| 2448 | + | |
| 2449 | + | |
| 2450 | + | |
| 2451 | + | |
| 2452 | + | |
| 2453 | + | |
| 2454 | + | |
| 2455 | + | |
| 2456 | + | |
| 2457 | + | |
| 2458 | + | |
| 2459 | + | |
| 2460 | + | |
| 2461 | + | |
| 2462 | + | |
| 2463 | + | |
| 2464 | + | |
| 2465 | + | |
| 2466 | + | |
| 2467 | + | |
| 2468 | + | |
| 2469 | + | |
| 2470 | + | |
| 2471 | + | |
| 2472 | + | |
| 2473 | + | |
| 2474 | + | |
| 2475 | + | |
| 2476 | + | |
| 2477 | + | |
| 2478 | + | |
| 2479 | + | |
| 2480 | + | |
| 2481 | + | |
| 2482 | + | |
| 2483 | + | |
| 2484 | + | |
| 2485 | + | |
| 2486 | + | |
| 2487 | + | |
| 2488 | + | |
| 2489 | + | |
| 2490 | + | |
| 2491 | + | |
| 2492 | + | |
| 2493 | + | |
| 2494 | + | |
| 2495 | + | |
| 2496 | + | |
| 2497 | + | |
| 2498 | + | |
| 2499 | + | |
| 2500 | + | |
| 2501 | + | |
| 2502 | + | |
| 2503 | + | |
| 2504 | + | |
| 2505 | + | |
| 2506 | + | |
| 2507 | + | |
| 2508 | + | |
| 2509 | + | |
| 2510 | + | |
| 2511 | + | |
| 2512 | + | |
| 2513 | + | |
| 2514 | + | |
| 2515 | + | |
| 2516 | + | |
| 2517 | + | |
| 2518 | + | |
| 2519 | + | |
| 2520 | + | |
| 2521 | + | |
| 2522 | + | |
| 2523 | + | |
| 2524 | + | |
| 2525 | + | |
| 2526 | + | |
| 2527 | + | |
| 2528 | + | |
| 2529 | + | |
| 2530 | + | |
| 2531 | + | |
| 2532 | + | |
| 2533 | + | |
| 2534 | + | |
| 2535 | + | |
| 2536 | + | |
| 2537 | + | |
| 2538 | + | |
| 2539 | + | |
| 2540 | + | |
| 2541 | + | |
| 2542 | + | |
| 2543 | + | |
| 2544 | + | |
| 2545 | + | |
| 2546 | + | |
| 2547 | + | |
| 2548 | + | |
| 2549 | + | |
| 2550 | + | |
| 2551 | + | |
| 2552 | + | |
| 2553 | + | |
| 2554 | + | |
| 2555 | + | |
| 2556 | + | |
| 2557 | + | |
| 2558 | + | |
| 2559 | + | |
| 2560 | + | |
| 2561 | + | |
| 2562 | + | |
| 2563 | + | |
| 2564 | + | |
| 2565 | + | |
| 2566 | + | |
| 2567 | + | |
| 2568 | + | |
| 2569 | + | |
| 2570 | + | |
| 2571 | + | |
| 2572 | + | |
| 2573 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
0 commit comments