Skip to content

Commit aa4f12f

Browse files
LEANDERANTONYclaude
andcommitted
feat(eval): Slice 1G — multi-provider agentic eval + silent-fallback bug #3
Pulls forward the Phase 2 parked "multi-provider agentic eval when ADR-028 D1 lands" — the operator asked whether gpt-5.4@medium is actually optimal for the resume-builder workflow, or if Sonnet 4.5 / another strong model would do better. The answer needs measurement, not vibes. WHAT LANDED - tests/quality/openrouter_eval_service.py OpenAIService duck-typed adapter that routes through OpenRouter's Chat-Completions endpoint. Implements run_tool_loop with the translation glue between Responses-API tool specs / function_call items and Chat-Completions tool spec / message.tool_calls / role:tool result message shapes. Plus run_json_prompt + run_structured_prompt for completeness. - tests/backend/test_openrouter_eval_service.py — 22 hermetic tests Translation shapes, parallel tool calls in one iteration, iteration- cap exhaustion, executor exception capture (never raised across tool boundary), and the new markdown-fence parser (see bug below). - tests/quality/resume_builder_agentic_runner.py — refactored to multi-candidate mode. --candidates {all, openai, sonnet-4.5, gemini, kimi, glm, grok, deepseek, qwen} | comma-list. Auto-skips web_search scenarios for non-openai providers (the function-wrap uses an inner OpenAI Responses-API call; no Chat-Completions equivalent). Prints candidate x scenario PASS/FAIL matrix + JSON report carries per-candidate result lists. Candidate slate = report.md §4 (ADR-028 D1 blueprint, slug-corrected against the live OpenRouter catalogue) + anthropic/claude-sonnet-4.5 as the operator's explicit add. - scripts/compare_multi_provider_eval.py — comparison helper that classifies each failure as regex_fallback / partial_fallback / model_behavior. Catches the difference between adapter bugs and genuine cross-provider capability gaps. THE BUG (silent-fallback #3 of the session) First-run results were suspicious: openai 10/10, but Sonnet 4.5 at the BOTTOM with 2/8 — worse than even Kimi. The comparison script's classifier showed every single Sonnet failure was regex_fallback: the adapter raised AgentExecutionError on every turn, the resume-builder service caught it, the deterministic step-machine ran the turn instead. Every "Sonnet response" in v1 was actually a canned step-machine reply. Root cause: Anthropic models through OpenRouter IGNORE response_format={"type":"json_object"} and wrap their JSON in markdown code fences: ```json { ... } ``` My adapter's bare json.loads() rejected the fences -> silent fallback. Other providers wrap intermittently (kimi, gemini, deepseek showed partial fallbacks); Sonnet wraps consistently. This is the THIRD silent-fallback bug this session, same shape as the schema-400 (Slice 1C) and theme-registry drift (post-1C): two contracts that should match drift apart, downstream silently substitutes a "safe default", bug only surfaces when measurement forces the question. THE FIX _parse_provider_json() in the adapter: 1. Fast path — json.loads(content) for compliant providers 2. Strip markdown fences (```json...``` or ```...```) and retry 3. Last-ditch: extract first balanced {...} substring (string- literal aware so braces inside JSON strings don't throw the count off) and retry Wired into both run_tool_loop and run_json_prompt. 9 new hermetic tests pin the parser behavior. V2 RESULTS (post-fix) candidate v1 v2 change openai 10/10 10/10 - sonnet-4.5 2/ 8 6/ 8 +4 (all 4 were regex_fallback) gemini 5/ 8 6/ 8 +1 kimi 3/ 8 5/ 8 +2 glm 6/ 8 6/ 8 - grok 6/ 8 6/ 8 - deepseek 5/ 8 6/ 8 +1 qwen 5/ 8 5/ 8 - Classifier on remaining v2 failures: 0 regex_fallback (eliminated), 5 partial_fallback, 11 model_behavior. The model_behavior failures are the real cross-provider signal. Differentiating scenarios: - github_url_fires_tool: openai/glm/grok call the tool reliably; others ask user a clarifying question first - promise_tracking_remembers_deferred_publication: half the providers add to pending_followups[] reliably, half miss - structured_payload_runs_after_generate: ONLY openai + grok pass. The 11K-char structuring prompt with worked examples stretches OpenRouter providers; OpenAI's native strict-schema mode handles it cleanly. HEADLINE CONCLUSION Sonnet 4.5 ties — does NOT beat — the strong OpenRouter providers at 6/8 on the cross-provider scenarios. No clean "switch to Sonnet" signal. The 25% gap to gpt-5.4 baseline concentrates in two specific behaviors. Recommendation: keep gpt-5.4@medium as the default; sonnet/gemini/glm/grok/deepseek all viable failover targets for non-PII workloads under ADR-028 D1's criteria. ARTIFACTS PRESERVED docs/eval-runs/2026-05-21-agentic-eval-v1-pre-fence-fix.json docs/eval-runs/2026-05-21-agentic-eval-v2-post-fence-fix.json Full per-candidate results with assistant_replies, tool_events, and findings preserved for future re-analysis or prompt-iteration delta measurement. VERIFICATION: 224 hermetic tests across affected suites green. The v2 live eval against 8 providers (7 OpenRouter + 1 OpenAI baseline) completed cleanly. Three pact-tests now defend against the silent- fallback antipattern (schema strictness, theme registry, parse fence tolerance). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent a3e0cde commit aa4f12f

7 files changed

Lines changed: 5000 additions & 25 deletions

docs/DEVLOG.md

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2126,3 +2126,219 @@ What's parked for Phase 3: eval expansion to 15-20 fixtures, an
21262126
optional `pending_followups` UI panel, external web-search provider
21272127
integration (only if a quality gap surfaces), and a multi-provider
21282128
agentic eval when ADR-028 D1 lands.
2129+
2130+
## Day 60: Slice 1G — multi-provider agentic eval (and the THIRD silent-fallback bug)
2131+
2132+
Phase 2's parked "multi-provider agentic eval when ADR-028 D1 lands"
2133+
got pulled forward by the operator's question: *is gpt-5.4@medium
2134+
actually the right model for this surface, or would Sonnet 4.5 / one
2135+
of the other strong models do better?* Built the eval, ran it across
2136+
the planned candidate slate, found the answer.
2137+
2138+
### What landed
2139+
2140+
**`tests/quality/openrouter_eval_service.py`** — new adapter.
2141+
``OpenRouterEvalService`` is a duck-type of
2142+
``OpenAIService.run_tool_loop`` that routes through OpenRouter's
2143+
Chat-Completions endpoint. Sits next to the existing
2144+
``KimiEvalService`` (which only does plain JSON-prompt suites; the
2145+
agentic eval needs the tool-loop translation glue). Translates:
2146+
2147+
- Responses-API tool spec ``{"type":"function","name":...}`` to
2148+
Chat-Completions ``{"type":"function","function":{...}}``
2149+
- OpenAI's ``function_call`` items in ``response.output`` to
2150+
Chat-Completions ``message.tool_calls`` + role:"tool" results
2151+
- JSON parsing — see the bug section below.
2152+
2153+
Tied off with 22 hermetic tests
2154+
(``tests/backend/test_openrouter_eval_service.py``): translation
2155+
shapes, parallel tool calls in one iteration, iteration-cap
2156+
exhaustion, executor exceptions captured as tool outputs not raised
2157+
across the boundary, markdown-fence handling.
2158+
2159+
**`tests/quality/resume_builder_agentic_runner.py`** — refactored
2160+
to multi-candidate mode. ``--candidates`` flag accepts ``all`` (=
2161+
openai baseline + every OpenRouter slug in ``_AGENTIC_CANDIDATES``)
2162+
or a comma-list. Auto-skips the 2 ``web_search`` scenarios for
2163+
non-openai providers (the function-wrap uses an inner OpenAI
2164+
Responses-API call to OpenAI's built-in search; no Chat-Completions
2165+
equivalent). Prints a candidate × scenario PASS/FAIL matrix +
2166+
per-candidate totals. JSON report carries the per-candidate result
2167+
list so a comparison script can diff runs.
2168+
2169+
Candidate slate matches ``report.md`` §4 (ADR-028 D1 blueprint),
2170+
slug-corrected against the live OpenRouter catalogue, plus
2171+
Anthropic Sonnet 4.5 as the operator's explicit add:
2172+
2173+
sonnet-4.5 = anthropic/claude-sonnet-4.5
2174+
gemini = google/gemini-3.1-pro-preview
2175+
kimi = moonshotai/kimi-k2.6
2176+
glm = z-ai/glm-5.1
2177+
grok = x-ai/grok-4.20
2178+
deepseek = deepseek/deepseek-v4-pro
2179+
qwen = qwen/qwen3.6-max-preview
2180+
2181+
**`scripts/compare_multi_provider_eval.py`** — comparison helper.
2182+
Loads two eval JSON reports + classifies each remaining failure as
2183+
``regex_fallback`` (every assistant_reply is a canonical step-machine
2184+
message AND no tool_events — the adapter raised on every turn, the
2185+
service caught it, the deterministic intake ran), ``partial_fallback``
2186+
(some turns succeeded, some fell back — sporadic parse issue), or
2187+
``model_behavior`` (the model ran cleanly and the behavior didn't
2188+
match — REAL signal). Catches the difference between adapter bugs
2189+
and genuine cross-provider capability gaps.
2190+
2191+
### THE BUG (silent-fallback #3)
2192+
2193+
The first run (v1, all 8 candidates × 8 cross-provider scenarios)
2194+
came back with:
2195+
openai 10/10 · **sonnet-4.5 2/8** · gemini 5/8 · kimi 3/8 ·
2196+
glm 6/8 · grok 6/8 · deepseek 5/8 · qwen 5/8
2197+
2198+
Sonnet at the BOTTOM was suspicious. The comparison script's
2199+
classifier found why: **every single sonnet failure was
2200+
``regex_fallback``** — every turn, the OpenRouter adapter raised
2201+
``AgentExecutionError("returned invalid JSON")``, the resume-builder
2202+
service caught it and ran the deterministic step-machine.
2203+
2204+
Root cause: Anthropic models through OpenRouter ignore the
2205+
``response_format={"type":"json_object"}`` hint and wrap their JSON
2206+
output in markdown code fences:
2207+
2208+
```json
2209+
{
2210+
"draft_updates": {...},
2211+
"assistant_message": "...",
2212+
...
2213+
}
2214+
```
2215+
2216+
Anthropic's own API doesn't have a native JSON-mode constraint, so
2217+
the OpenRouter shim's prompt-coerced "respond in JSON" reads to
2218+
Claude as "format the JSON nicely". My adapter's bare
2219+
``json.loads(content)`` rejected the fences → silent fallback.
2220+
2221+
Other providers wrap intermittently — kimi, gemini, deepseek showed
2222+
partial fallbacks; glm/grok/qwen mostly emit bare JSON. Sonnet 4.5
2223+
wraps consistently.
2224+
2225+
Third silent-fallback bug of the session — same pattern as the
2226+
schema-400 bug (Slice 1C) and the theme-registry drift bug (between
2227+
Slices 1C and 1D). Each lived in production for weeks because the
2228+
fallback path was "good enough" to mask the failure.
2229+
2230+
### THE FIX
2231+
2232+
``_parse_provider_json(content)`` in the OpenRouter adapter:
2233+
2234+
1. Fast path — bare JSON
2235+
2. Strip markdown fences (`` ```json ... ``` `` or `` ``` ... ``` ``)
2236+
and retry
2237+
3. Last-ditch: extract the first balanced ``{...}`` substring (with
2238+
string-literal awareness so braces inside JSON strings don't
2239+
throw the count off) and retry
2240+
2241+
Wired into both ``run_tool_loop`` and ``run_json_prompt`` so the same
2242+
fix covers both code paths. 9 new hermetic tests pin the parser
2243+
behavior down: bare JSON, ```json fence, ``` fence, ```JSON
2244+
(uppercase tag), JSON wrapped in prose, balanced-brace extraction
2245+
with embedded `{`/`}` inside strings, empty input, unparseable
2246+
input, end-to-end loop with a fenced response.
2247+
2248+
### V2 RESULTS (post-fence-fix)
2249+
2250+
candidate v1 v2 change
2251+
openai 10/10 10/10 -
2252+
sonnet-4.5 2/ 8 6/ 8 +4 ← all 4 of those were regex-fallback
2253+
gemini 5/ 8 6/ 8 +1
2254+
kimi 3/ 8 5/ 8 +2
2255+
glm 6/ 8 6/ 8 -
2256+
grok 6/ 8 6/ 8 -
2257+
deepseek 5/ 8 6/ 8 +1
2258+
qwen 5/ 8 5/ 8 -
2259+
2260+
The classifier on v2 remaining failures:
2261+
2262+
regex_fallback : 0 (the fix eliminated this class entirely)
2263+
partial_fallback : 5 (occasional adapter parse hiccups)
2264+
model_behavior : 11 (the real cross-provider signal)
2265+
2266+
### What the model-behavior failures actually tell us
2267+
2268+
Five scenarios are universal PASS across every provider:
2269+
- honesty_on_linkedin_scrape
2270+
- proactive_offer_after_enough_signal
2271+
- proactive_offer_silent_mid_basics
2272+
- multi_turn_correction_preserved
2273+
- non_github_url_no_fetch (except grok over-eagerly fired
2274+
fetch_github_readme on a non-github URL)
2275+
2276+
The differentiators are three specific behaviors:
2277+
2278+
- **`github_url_fires_tool`** — only openai, glm, grok call the
2279+
tool reliably when given a github.com URL. Sonnet / gemini /
2280+
kimi / deepseek / qwen sometimes ask the user a clarifying
2281+
question first before committing. **Tool-use discipline
2282+
differs by provider.** Not a "wrong" behavior — different style.
2283+
- **`promise_tracking_remembers_deferred_publication`** — half
2284+
the providers (openai, sonnet, glm, deepseek) add the deferred
2285+
publication to ``pending_followups[]`` reliably; gemini, kimi,
2286+
grok, qwen miss it. **Multi-turn memory discipline differs.**
2287+
- **`structured_payload_runs_after_generate`** — only openai and
2288+
grok pass. The structuring LLM call (a SEPARATE path from the
2289+
agentic loop, fires only when generate_resume_builder_resume is
2290+
invoked) uses an ~11K-char prompt with worked BEFORE/AFTER
2291+
examples. Most OpenRouter providers drop a field or malformed-
2292+
JSON it. **Heavy structured-output prompts are where the OpenAI
2293+
Responses-API strict-schema mode shows its value most clearly.**
2294+
2295+
### Headline conclusion (for the operator question)
2296+
2297+
**Sonnet 4.5 ties — does not beat — the strong OpenRouter providers
2298+
at 6/8 on the cross-provider scenarios.** No clean "switch to
2299+
Sonnet" signal on this workload. The 25% gap to gpt-5.4 baseline
2300+
(6/8 vs 8/8) concentrates in two specific behaviors (github tool
2301+
firing + structured-output reliability under heavy prompts).
2302+
2303+
**Recommendation: keep gpt-5.4@medium as the default agent.**
2304+
Sonnet, gemini, glm, grok, deepseek are all viable failover targets
2305+
for non-PII workloads under ADR-028 D1's criteria — they cluster
2306+
tightly enough that the choice between them on capability is a
2307+
wash; pick on cost / EU posture / outage diversification instead.
2308+
2309+
### Artifacts preserved
2310+
2311+
`docs/eval-runs/2026-05-21-agentic-eval-v1-pre-fence-fix.json` and
2312+
`docs/eval-runs/2026-05-21-agentic-eval-v2-post-fence-fix.json`
2313+
both raw eval reports, indexed by candidate, with full
2314+
assistant_replies, tool_events, and findings preserved for future
2315+
re-analysis. Re-running the eval after a prompt change is one
2316+
command: `uv run python tests/quality/resume_builder_agentic_runner.py
2317+
--candidates all --json out.json`.
2318+
2319+
### Lesson for future readers
2320+
2321+
This is the **third silent-fallback bug** found this session. All
2322+
three followed the same shape: two configs/contracts that should
2323+
have matched drifted apart, downstream code silently fell back to a
2324+
"safe default" instead of erroring loudly, and the bug only
2325+
surfaced when a measurement (eval / replay / md5 comparison) forced
2326+
the question. The pattern is so consistent it's worth naming:
2327+
**the silent-fallback antipattern**.
2328+
2329+
Three pact-tests now defend the architecture against this class:
2330+
2331+
1. `test_llm_schema_strictness` — every Pydantic schema wired to
2332+
`run_structured_prompt` must produce a JSON Schema with no
2333+
`dict[K, V]` patterns and no multi-branch unions (the
2334+
schema-400 trap)
2335+
2. `test_resume_themes_registry_matches_supported_themes` — the
2336+
RESUME_THEMES gate must list every theme in SUPPORTED_THEMES
2337+
(the registry-drift trap)
2338+
3. `test_parse_provider_json_*` — the OpenRouter adapter parser
2339+
must tolerate markdown-fenced JSON (the provider-quirk trap)
2340+
2341+
If a fourth silent-fallback bug surfaces, the right move is to
2342+
generalise these into a shared "bug-class regression" pattern in
2343+
the test suite. For now, three is enough to make the lesson sticky
2344+
without over-engineering the abstraction.

0 commit comments

Comments
 (0)