Commit aa4f12f
feat(eval): Slice 1G — multi-provider agentic eval + silent-fallback bug #3
Pulls forward the Phase 2 parked "multi-provider agentic eval when ADR-028
D1 lands" — the operator asked whether gpt-5.4@medium is actually optimal
for the resume-builder workflow, or if Sonnet 4.5 / another strong model
would do better. The answer needs measurement, not vibes.
WHAT LANDED
- tests/quality/openrouter_eval_service.py
OpenAIService duck-typed adapter that routes through OpenRouter's
Chat-Completions endpoint. Implements run_tool_loop with the
translation glue between Responses-API tool specs / function_call
items and Chat-Completions tool spec / message.tool_calls / role:tool
result message shapes. Plus run_json_prompt + run_structured_prompt
for completeness.
- tests/backend/test_openrouter_eval_service.py — 22 hermetic tests
Translation shapes, parallel tool calls in one iteration, iteration-
cap exhaustion, executor exception capture (never raised across
tool boundary), and the new markdown-fence parser (see bug below).
- tests/quality/resume_builder_agentic_runner.py — refactored to
multi-candidate mode. --candidates {all, openai, sonnet-4.5, gemini,
kimi, glm, grok, deepseek, qwen} | comma-list. Auto-skips
web_search scenarios for non-openai providers (the function-wrap
uses an inner OpenAI Responses-API call; no Chat-Completions
equivalent). Prints candidate x scenario PASS/FAIL matrix + JSON
report carries per-candidate result lists.
Candidate slate = report.md §4 (ADR-028 D1 blueprint, slug-corrected
against the live OpenRouter catalogue) + anthropic/claude-sonnet-4.5
as the operator's explicit add.
- scripts/compare_multi_provider_eval.py — comparison helper that
classifies each failure as regex_fallback / partial_fallback /
model_behavior. Catches the difference between adapter bugs and
genuine cross-provider capability gaps.
THE BUG (silent-fallback #3 of the session)
First-run results were suspicious: openai 10/10, but Sonnet 4.5
at the BOTTOM with 2/8 — worse than even Kimi. The comparison
script's classifier showed every single Sonnet failure was
regex_fallback: the adapter raised AgentExecutionError on every
turn, the resume-builder service caught it, the deterministic
step-machine ran the turn instead. Every "Sonnet response" in v1
was actually a canned step-machine reply.
Root cause: Anthropic models through OpenRouter IGNORE
response_format={"type":"json_object"} and wrap their JSON in
markdown code fences:
```json
{ ... }
```
My adapter's bare json.loads() rejected the fences -> silent
fallback. Other providers wrap intermittently (kimi, gemini,
deepseek showed partial fallbacks); Sonnet wraps consistently.
This is the THIRD silent-fallback bug this session, same shape as
the schema-400 (Slice 1C) and theme-registry drift (post-1C):
two contracts that should match drift apart, downstream silently
substitutes a "safe default", bug only surfaces when measurement
forces the question.
THE FIX
_parse_provider_json() in the adapter:
1. Fast path — json.loads(content) for compliant providers
2. Strip markdown fences (```json...``` or ```...```) and retry
3. Last-ditch: extract first balanced {...} substring (string-
literal aware so braces inside JSON strings don't throw the
count off) and retry
Wired into both run_tool_loop and run_json_prompt. 9 new hermetic
tests pin the parser behavior.
V2 RESULTS (post-fix)
candidate v1 v2 change
openai 10/10 10/10 -
sonnet-4.5 2/ 8 6/ 8 +4 (all 4 were regex_fallback)
gemini 5/ 8 6/ 8 +1
kimi 3/ 8 5/ 8 +2
glm 6/ 8 6/ 8 -
grok 6/ 8 6/ 8 -
deepseek 5/ 8 6/ 8 +1
qwen 5/ 8 5/ 8 -
Classifier on remaining v2 failures: 0 regex_fallback (eliminated),
5 partial_fallback, 11 model_behavior. The model_behavior failures
are the real cross-provider signal.
Differentiating scenarios:
- github_url_fires_tool: openai/glm/grok call the tool reliably;
others ask user a clarifying question first
- promise_tracking_remembers_deferred_publication: half the
providers add to pending_followups[] reliably, half miss
- structured_payload_runs_after_generate: ONLY openai + grok
pass. The 11K-char structuring prompt with worked examples
stretches OpenRouter providers; OpenAI's native strict-schema
mode handles it cleanly.
HEADLINE CONCLUSION
Sonnet 4.5 ties — does NOT beat — the strong OpenRouter providers
at 6/8 on the cross-provider scenarios. No clean "switch to Sonnet"
signal. The 25% gap to gpt-5.4 baseline concentrates in two specific
behaviors. Recommendation: keep gpt-5.4@medium as the default;
sonnet/gemini/glm/grok/deepseek all viable failover targets for
non-PII workloads under ADR-028 D1's criteria.
ARTIFACTS PRESERVED
docs/eval-runs/2026-05-21-agentic-eval-v1-pre-fence-fix.json
docs/eval-runs/2026-05-21-agentic-eval-v2-post-fence-fix.json
Full per-candidate results with assistant_replies, tool_events,
and findings preserved for future re-analysis or prompt-iteration
delta measurement.
VERIFICATION: 224 hermetic tests across affected suites green. The
v2 live eval against 8 providers (7 OpenRouter + 1 OpenAI baseline)
completed cleanly. Three pact-tests now defend against the silent-
fallback antipattern (schema strictness, theme registry, parse
fence tolerance).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent a3e0cde commit aa4f12f
7 files changed
Lines changed: 5000 additions & 25 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2126 | 2126 | | |
2127 | 2127 | | |
2128 | 2128 | | |
| 2129 | + | |
| 2130 | + | |
| 2131 | + | |
| 2132 | + | |
| 2133 | + | |
| 2134 | + | |
| 2135 | + | |
| 2136 | + | |
| 2137 | + | |
| 2138 | + | |
| 2139 | + | |
| 2140 | + | |
| 2141 | + | |
| 2142 | + | |
| 2143 | + | |
| 2144 | + | |
| 2145 | + | |
| 2146 | + | |
| 2147 | + | |
| 2148 | + | |
| 2149 | + | |
| 2150 | + | |
| 2151 | + | |
| 2152 | + | |
| 2153 | + | |
| 2154 | + | |
| 2155 | + | |
| 2156 | + | |
| 2157 | + | |
| 2158 | + | |
| 2159 | + | |
| 2160 | + | |
| 2161 | + | |
| 2162 | + | |
| 2163 | + | |
| 2164 | + | |
| 2165 | + | |
| 2166 | + | |
| 2167 | + | |
| 2168 | + | |
| 2169 | + | |
| 2170 | + | |
| 2171 | + | |
| 2172 | + | |
| 2173 | + | |
| 2174 | + | |
| 2175 | + | |
| 2176 | + | |
| 2177 | + | |
| 2178 | + | |
| 2179 | + | |
| 2180 | + | |
| 2181 | + | |
| 2182 | + | |
| 2183 | + | |
| 2184 | + | |
| 2185 | + | |
| 2186 | + | |
| 2187 | + | |
| 2188 | + | |
| 2189 | + | |
| 2190 | + | |
| 2191 | + | |
| 2192 | + | |
| 2193 | + | |
| 2194 | + | |
| 2195 | + | |
| 2196 | + | |
| 2197 | + | |
| 2198 | + | |
| 2199 | + | |
| 2200 | + | |
| 2201 | + | |
| 2202 | + | |
| 2203 | + | |
| 2204 | + | |
| 2205 | + | |
| 2206 | + | |
| 2207 | + | |
| 2208 | + | |
| 2209 | + | |
| 2210 | + | |
| 2211 | + | |
| 2212 | + | |
| 2213 | + | |
| 2214 | + | |
| 2215 | + | |
| 2216 | + | |
| 2217 | + | |
| 2218 | + | |
| 2219 | + | |
| 2220 | + | |
| 2221 | + | |
| 2222 | + | |
| 2223 | + | |
| 2224 | + | |
| 2225 | + | |
| 2226 | + | |
| 2227 | + | |
| 2228 | + | |
| 2229 | + | |
| 2230 | + | |
| 2231 | + | |
| 2232 | + | |
| 2233 | + | |
| 2234 | + | |
| 2235 | + | |
| 2236 | + | |
| 2237 | + | |
| 2238 | + | |
| 2239 | + | |
| 2240 | + | |
| 2241 | + | |
| 2242 | + | |
| 2243 | + | |
| 2244 | + | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
| 2248 | + | |
| 2249 | + | |
| 2250 | + | |
| 2251 | + | |
| 2252 | + | |
| 2253 | + | |
| 2254 | + | |
| 2255 | + | |
| 2256 | + | |
| 2257 | + | |
| 2258 | + | |
| 2259 | + | |
| 2260 | + | |
| 2261 | + | |
| 2262 | + | |
| 2263 | + | |
| 2264 | + | |
| 2265 | + | |
| 2266 | + | |
| 2267 | + | |
| 2268 | + | |
| 2269 | + | |
| 2270 | + | |
| 2271 | + | |
| 2272 | + | |
| 2273 | + | |
| 2274 | + | |
| 2275 | + | |
| 2276 | + | |
| 2277 | + | |
| 2278 | + | |
| 2279 | + | |
| 2280 | + | |
| 2281 | + | |
| 2282 | + | |
| 2283 | + | |
| 2284 | + | |
| 2285 | + | |
| 2286 | + | |
| 2287 | + | |
| 2288 | + | |
| 2289 | + | |
| 2290 | + | |
| 2291 | + | |
| 2292 | + | |
| 2293 | + | |
| 2294 | + | |
| 2295 | + | |
| 2296 | + | |
| 2297 | + | |
| 2298 | + | |
| 2299 | + | |
| 2300 | + | |
| 2301 | + | |
| 2302 | + | |
| 2303 | + | |
| 2304 | + | |
| 2305 | + | |
| 2306 | + | |
| 2307 | + | |
| 2308 | + | |
| 2309 | + | |
| 2310 | + | |
| 2311 | + | |
| 2312 | + | |
| 2313 | + | |
| 2314 | + | |
| 2315 | + | |
| 2316 | + | |
| 2317 | + | |
| 2318 | + | |
| 2319 | + | |
| 2320 | + | |
| 2321 | + | |
| 2322 | + | |
| 2323 | + | |
| 2324 | + | |
| 2325 | + | |
| 2326 | + | |
| 2327 | + | |
| 2328 | + | |
| 2329 | + | |
| 2330 | + | |
| 2331 | + | |
| 2332 | + | |
| 2333 | + | |
| 2334 | + | |
| 2335 | + | |
| 2336 | + | |
| 2337 | + | |
| 2338 | + | |
| 2339 | + | |
| 2340 | + | |
| 2341 | + | |
| 2342 | + | |
| 2343 | + | |
| 2344 | + | |
0 commit comments