LEANDERANTONY
diff --git a/‎docs/DEVLOG.md‎
Lines changed: 152 additions & 0 deletions b/‎docs/DEVLOG.md‎
Lines changed: 152 additions & 0 deletions
diff --git a/‎docs/eval-runs/2026-05-21-assistant-eval-full-log.txt‎
Lines changed: 98 additions & 0 deletions b/‎docs/eval-runs/2026-05-21-assistant-eval-full-log.txt‎
Lines changed: 98 additions & 0 deletions
@@ -2419,3 +2419,155 @@ smart clarifying question" (FAIL but BETTER) from "hallucinated
 capability" (FAIL and WORSE). A v2 rubric with LLM-as-judge
 1-5 quality scoring per scenario would catch this honestly.
 Parked.
+
+
+## Day 61: Workspace assistant — history fix, product-knowledge block, adapter reasoning_effort, and Slice 1K eval
+
+Today extended the agentic-upgrade work to the OTHER chat surface
+in the app: the workspace assistant (`src/assistant_service.py`,
+prompts at `prompts/assistant/v1.json` and
+`prompts/assistant_text/v1.json`). The audit found the SAME
+history-truncation bug the resume builder had at the start of
+Slice 1B — except worse: assistant code was at `history[-4:]`,
+whereas the resume builder had been at `history[-12:]`. Four
+turns is not enough for any real conversation.
+
+### Slice 1J: drop the `history[-4:]` slice on the assistant prompts
+
+New constant `ASSISTANT_HISTORY_CHAR_BUDGET = 18000` (smaller than
+the resume builder's 30 k because the assistant carries a heavier
+`assistant_context` payload alongside — workspace_state +
+workflow_context + the WORKSPACE STATE guidance rules). Both
+`build_assistant_prompt` and `build_assistant_text_prompt` now
+call `_slice_history_for_budget(history, max_chars=...)` instead
+of the hard suffix slice. Same drop-oldest-first semantics as the
+resume builder, with the most-recent turn guaranteed retained.
+
+Why this matters concretely: the Slice 1K `long_session_memory_callback`
+scenario is a 7-turn conversation where the user states "we cut
+chargeback fraud by 18% using XGBoost" on turn 2, then on turn 6
+asks "what number did I tell you?". With `history[-4:]` the model
+sees turns 3-6 ONLY — the 18% fact has scrolled off and the
+question is literally unanswerable. After the fix, all 5 candidates
+in the Slice 1K eval correctly recalled "18%".
+
+### Slice 1J': `_PRODUCT_KNOWLEDGE_BLOCK` — stop sounding ignorant
+
+The WORKSPACE STATE block teaches the assistant how to READ live
+runtime state but said nothing about pricing, themes, the agentic
+pipeline, or the assistant's own limits. When a user asked "what
+tiers do you have?" / "what themes can I use?" / "can you book
+me an interview?", the answers ranged from "I don't have that
+info" to outright fabrications.
+
+Added a new module-level constant in `src/prompts.py` that's also
+pre-baked into both registry JSONs (Pattern A per the prompt-
+registry migration notes). It backstops all of these questions
+with authoritative numbers pulled from `backend/tiers.py`
+(TIER_CAPS), `src/resume_builder.py` (RESUME_THEMES),
+`backend/tiers.py` (FREE_EXPORT_FORMAT/THEME), and `src/agents/*`
+(the orchestrator chain):
+
+  * **Tier caps**: Free / Pro / Business — tailored applications
+    (3 / 20 / 80), assistant turns (20 / 150 / 500), resume parses
+    (3 / 25 / 100), saved jobs (5 / 1000 / unlimited), saved
+    workspaces (1 / 5 / unlimited), retention (7 days / 30 days /
+    unbounded).
+  * **Lifetime gotcha**: resume_builder_sessions on Free is
+    LIFETIME (never resets), monthly on Pro (3) / Business (15).
+    The block calls this out verbatim so the model stops
+    answering "resets monthly" by reflex.
+  * **Theme inventory**: six themes (classic_ats,
+    professional_neutral, modern_blue, creative_warm,
+    architect_mono, presentation_twocol); first five are
+    single-column ATS-safe; presentation_twocol is gated +
+    non-ATS.
+  * **Export entitlement**: Free = PDF + professional_neutral
+    only; Pro/Business = PDF or DOCX + any theme.
+  * **Agentic chain**: tailoring → review → resume gen → cover
+    letter, with conservative-correction posture in the review
+    pass.
+  * **Honest cannot-do list**: schedule interviews, send emails,
+    log in to LinkedIn / Indeed, scrape arbitrary URLs, edit the
+    resume file directly, change subscription tier, remember
+    across sessions when signed out.
+
+If any of these source-of-truth numbers ever drift, the
+byte-mirror tests
+(`test_assistant_prompt_matches_pre_migration_system_byte_for_byte`)
+fail on the next CI run.
+
+### Slice 1J'': thread `reasoning_effort` through both eval adapters
+
+`OpenRouterEvalService.run_tool_loop`, `run_json_prompt`, and
+`run_structured_prompt` — plus the symmetric `KimiEvalService._chat`
++ entry points — now forward the `reasoning_effort` kwarg to
+`chat.completions.create`. Conditional (only when truthy) because
+non-reasoning slugs (Sonnet, Haiku, DeepSeek v4) 400 if it's set.
+
+Added pricing entries for the new Slice 1K candidates:
+`openai/o4-mini` ($1.10 / $4.40 per Mtok — substituted for the
+non-existent `openai/gpt-5.1-mini`) and `anthropic/claude-haiku-4.5`
+($1.00 / $5.00 per Mtok).
+
+Bonus bugfix from the Slice 1K smoke run: the
+`OpenRouterEvalService.run_json_prompt` path was never
+accumulating `response.usage` into the snapshot — only
+`run_tool_loop` was. The smoke at first reported $0.0000 for
+every call. Mirrored the accumulator into the single-shot path so
+the assistant / parser / structuring suites all surface accurate
+per-call cost.
+
+### Slice 1K: 5-candidate assistant eval (12 scenarios)
+
+New runner `tests/quality/assistant_agentic_runner.py` — mirrors
+the Phase B incremental-checkpoint + heartbeat pattern but
+targets the assistant prompt surface directly via
+`build_assistant_prompt` + `run_json_prompt`. Twelve scenarios
+across product-knowledge fluency, honest refusals, grounding
+discipline, and multi-turn memory. Substring-matcher rubric with
+the same normalisation (smart quotes / em-dashes) as Slice 1H.
+
+Candidate slate (user-approved after dropping Opus + substituting
+o4-mini for the non-existent gpt-5.1-mini): gpt-5.4@med,
+gpt-5.4-mini@med, o4-mini@high, sonnet-4.5, haiku-4.5.
+
+Headline result (full per-candidate × per-scenario data in
+`docs/eval-runs/2026-05-21-assistant-eval-full.json`):
+
+  | candidate         | avg   | pass  | wall    | cost   |
+  | gpt-5.4@med       | 0.986 | 1.000 | 74.7s   | $0.094 |
+  | gpt-5.4-mini@med  | 1.000 | 1.000 | 40.5s   | $0.018 |
+  | o4-mini@high      | 1.000 | 1.000 | 117.3s  | $0.081 |
+  | sonnet-4.5        | 1.000 | 1.000 | 161.3s  | $0.116 |
+  | haiku-4.5         | 0.917 | 0.917 | 37.6s   | $0.038 |
+
+**Surprise:** `gpt-5.4-mini@med` is the winner on all three axes
+— quality 1.000, fastest at 40 s, cheapest at $0.018 (1/5 the
+cost of gpt-5.4@med which scored 0.986). The assistant surface
+is mostly retrieval-and-refuse: pulling facts from the new
+product-knowledge block, declining off-topic asks, recalling
+earlier turns. Heavy reasoning is wasted; smart-but-cheap wins.
+
+The two sub-1.0 scores re-classify cleanly:
+  * `gpt-5.4@med` :: `off_topic_movie` (0.833) — matcher-bug:
+    the model's "I can only help with your job application
+    workflow here" is a textbook refusal but wasn't in the
+    `one_of` rubric. Real behavior = PASS.
+  * `haiku-4.5` :: `quota_resume_builder_lifetime` (0.000) —
+    real JSON-mode fidelity miss; haiku returned content that
+    didn't parse. The other 11 scenarios were valid JSON. Same
+    drift pattern Phase B caught for parser/JD on Anthropic
+    via OpenRouter (~92 % reliability).
+
+**Recommendation:** route the workspace-assistant default to
+`openai/gpt-5.4-mini` at `reasoning_effort=medium`. This is a
+real departure from the resume-builder default (gpt-5.4) and the
+Phase B verdict (gpt-5.4 for parser/JD/analysis); the surface
+characteristics genuinely differ. Expected ~80 % savings on
+assistant API spend. Full read-out in
+`docs/eval-runs/2026-05-21-assistant-eval-report.md`.
+
+Slice 1J's history fix paid off concretely: all 5 candidates
+correctly recalled the "18 %" fact from turn 2 in a 7-turn
+session — unscorable before the fix.
@@ -0,0 +1,98 @@
+== Slice 1K assistant eval: 5 candidate(s) × 12 scenario(s) ==
+   candidates: ['gpt-5.4@med', 'gpt-5.4-mini@med', 'o4-mini@high', 'sonnet-4.5', 'haiku-4.5']
+   scenarios:  ['pricing_tiers_question', 'theme_list_question', 'theme_unlock_question', 'quota_assistant_turns', 'quota_resume_builder_lifetime', 'refuse_schedule_interview', 'refuse_login_external', 'pre_resume_grounding', 'post_analysis_grounding', 'off_topic_movie', 'long_session_memory_callback', 'multi_turn_correction']
+
+--- gpt-5.4@med (openai/gpt-5.4, effort=medium) ---
+  [pricing_tiers_question          ] score=1.00 |  12.37s |   2571 tok | $0.0112
+  [theme_list_question             ] score=1.00 |   4.78s |   2418 tok | $0.0086
+  [theme_unlock_question           ] score=1.00 |   5.91s |   2258 tok | $0.0081
+  [quota_assistant_turns           ] score=1.00 |   4.69s |   2151 tok | $0.0070
+  [quota_resume_builder_lifetime   ] score=1.00 |   5.19s |   2127 tok | $0.0071
+  [refuse_schedule_interview       ] score=1.00 |   4.31s |   2300 tok | $0.0074
+  [refuse_login_external           ] score=1.00 |   4.66s |   2224 tok | $0.0077
+  [pre_resume_grounding            ] score=1.00 |   4.06s |   2027 tok | $0.0062
+  [post_analysis_grounding         ] score=1.00 |   5.55s |   2332 tok | $0.0077
+  [off_topic_movie                 ] score=0.83 |   5.23s |   2099 tok | $0.0065
+  [long_session_memory_callback    ] score=1.00 |   4.47s |   2544 tok | $0.0083
+  [multi_turn_correction           ] score=1.00 |  13.51s |   2291 tok | $0.0077
+  ===> gpt-5.4@med done — avg=0.986, pass_rate=1.0, 74.74s, 27342 tok, $0.0935
+  [checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
+
+--- gpt-5.4-mini@med (openai/gpt-5.4-mini, effort=medium) ---
+  [pricing_tiers_question          ] score=1.00 |   4.51s |   2454 tok | $0.0020
+  [theme_list_question             ] score=1.00 |   3.44s |   2278 tok | $0.0014
+  [theme_unlock_question           ] score=1.00 |   8.72s |   2473 tok | $0.0021
+  [quota_assistant_turns           ] score=1.00 |   1.53s |   2118 tok | $0.0013
+  [quota_resume_builder_lifetime   ] score=1.00 |   4.21s |   2101 tok | $0.0014
+  [refuse_schedule_interview       ] score=1.00 |   2.50s |   2367 tok | $0.0016
+  [refuse_login_external           ] score=1.00 |   2.37s |   2223 tok | $0.0015
+  [pre_resume_grounding            ] score=1.00 |   3.35s |   2076 tok | $0.0013
+  [post_analysis_grounding         ] score=1.00 |   2.53s |   2283 tok | $0.0014
+  [off_topic_movie                 ] score=1.00 |   2.36s |   2085 tok | $0.0013
+  [long_session_memory_callback    ] score=1.00 |   2.94s |   2451 tok | $0.0015
+  [multi_turn_correction           ] score=1.00 |   2.06s |   2288 tok | $0.0015
+  ===> gpt-5.4-mini@med done — avg=1.0, pass_rate=1.0, 40.52s, 27197 tok, $0.0183
+  [checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
+
+--- o4-mini@high (openai/o4-mini, effort=high) ---
+  [pricing_tiers_question          ] score=1.00 |  11.00s |   3518 tok | $0.0091
+  [theme_list_question             ] score=1.00 |  11.35s |   3400 tok | $0.0081
+  [theme_unlock_question           ] score=1.00 |  11.07s |   3204 tok | $0.0077
+  [quota_assistant_turns           ] score=1.00 |   7.56s |   2657 tok | $0.0053
+  [quota_resume_builder_lifetime   ] score=1.00 |   7.46s |   2484 tok | $0.0047
+  [refuse_schedule_interview       ] score=1.00 |   8.56s |   3117 tok | $0.0068
+  [refuse_login_external           ] score=1.00 |   5.23s |   2526 tok | $0.0047
+  [pre_resume_grounding            ] score=1.00 |   4.78s |   2496 tok | $0.0048
+  [post_analysis_grounding         ] score=1.00 |  15.44s |   3240 tok | $0.0074
+  [off_topic_movie                 ] score=1.00 |  14.97s |   3050 tok | $0.0070
+  [long_session_memory_callback    ] score=1.00 |  10.08s |   3585 tok | $0.0082
+  [multi_turn_correction           ] score=1.00 |   9.79s |   3093 tok | $0.0069
+  ===> o4-mini@high done — avg=1.0, pass_rate=1.0, 117.31s, 36370 tok, $0.0807
+  [checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
+
+--- sonnet-4.5 (anthropic/claude-sonnet-4.5, effort=-) ---
+  [pricing_tiers_question          ] score=1.00 |   8.38s |   2583 tok | $0.0125
+  [theme_list_question             ] score=1.00 |   5.47s |   2544 tok | $0.0098
+  [theme_unlock_question           ] score=1.00 |   8.95s |   2404 tok | $0.0099
+  [quota_assistant_turns           ] score=1.00 |  22.42s |   2310 tok | $0.0085
+  [quota_resume_builder_lifetime   ] score=1.00 |  10.21s |   2278 tok | $0.0087
+  [refuse_schedule_interview       ] score=1.00 |   8.46s |   2537 tok | $0.0097
+  [refuse_login_external           ] score=1.00 |  10.46s |   2353 tok | $0.0091
+  [pre_resume_grounding            ] score=1.00 |  26.50s |   2261 tok | $0.0085
+  [post_analysis_grounding         ] score=1.00 |   8.45s |   2563 tok | $0.0101
+  [off_topic_movie                 ] score=1.00 |  40.95s |   2326 tok | $0.0087
+  [long_session_memory_callback    ] score=1.00 |   4.89s |   2772 tok | $0.0104
+  [multi_turn_correction           ] score=1.00 |   6.11s |   2476 tok | $0.0097
+  ===> sonnet-4.5 done — avg=1.0, pass_rate=1.0, 161.26s, 29407 tok, $0.1156
+  [checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
+
+--- haiku-4.5 (anthropic/claude-haiku-4.5, effort=-) ---
+  [pricing_tiers_question          ] score=1.00 |   4.47s |   2478 tok | $0.0037
+  [theme_list_question             ] score=1.00 |   2.79s |   2523 tok | $0.0032
+  [theme_unlock_question           ] score=1.00 |   2.60s |   2331 tok | $0.0029
+  [quota_assistant_turns           ] score=1.00 |   2.33s |   2314 tok | $0.0028
+  [quota_resume_builder_lifetime   ] score=0.00 |   4.31s |   2276 tok | $0.0029  ERROR: AgentExecutionError: OpenRouter run_json_prompt returned invalid JSON.
+  [refuse_schedule_interview       ] score=1.00 |   3.40s |   2605 tok | $0.0036
+  [refuse_login_external           ] score=1.00 |   3.61s |   2387 tok | $0.0032
+  [pre_resume_grounding            ] score=1.00 |   2.91s |   2261 tok | $0.0028
+  [post_analysis_grounding         ] score=1.00 |   3.33s |   2621 tok | $0.0037
+  [off_topic_movie                 ] score=1.00 |   2.53s |   2370 tok | $0.0031
+  [long_session_memory_callback    ] score=1.00 |   2.16s |   2726 tok | $0.0032
+  [multi_turn_correction           ] score=1.00 |   3.20s |   2451 tok | $0.0031
+  ===> haiku-4.5 done — avg=0.917, pass_rate=0.917, 37.64s, 29343 tok, $0.0382
+  [checkpoint] wrote partial results to C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
+
+============================================================================================
+=============== SLICE 1K SUMMARY — avg_score · pass_rate · cost by candidate ===============
+============================================================================================
+candidate             avg     pass    lat       cost
+----------------------------------------------------
+gpt-5.4@med           0.986   1.0     74.74     $0.0935
+gpt-5.4-mini@med      1.0     1.0     40.52     $0.0183
+o4-mini@high          1.0     1.0     117.31    $0.0807
+sonnet-4.5            1.0     1.0     161.26    $0.1156
+haiku-4.5             0.917   0.917   37.64     $0.0382
+
+wrote JSON report -> C:/Users/LEANDE~1/AppData/Local/Temp/slice_1k_full.json
+
+Total wall time: 431.5s