Skip to content

Commit 15f16ac

Browse files
LEANDERANTONYclaude
andcommitted
feat(eval): Slice 1H — comprehensive 5-candidate eval with latency / cost / qual judging
Phase A of the comprehensive multi-provider eval the operator commissioned. 5 candidates × 18 scenarios with first-class latency, token, cost, and tool-discipline metrics, plus me-as-judge qualitative re-classification of every failure. WHAT LANDED - Expanded scenario set from 10 to 18: 8 new conflict / adversarial / long-session scenarios (triple_role_correction, self_contradictory_ info, off_topic_movie_question, out_of_scope_capability_probe, failed_tool_graceful_fallback, format_jumbled_dump, long_session_memory_callback, mixed_github_and_portfolio_urls). These are the ones that distinguish the top-4 candidates more sharply than the existing scenarios. - Per-scenario metrics now first-class: wall-clock latency, prompt/completion tokens (from response.usage), USD cost (via tests/quality/provider_pricing.py), tool-call count. Surfaced on a tabular heartbeat line per scenario for live monitoring + in the JSON report for post-run analysis. - Incremental JSON checkpoint after EVERY candidate completes — partial results survive a mid-run failure (kill, network drop, provider hang). The killed v1 of this run flushed nothing despite running 5 minutes; checkpointing prevents that recurrence. - Bumped OPENAI_MAX_COMPLETION_TOKENS_RESUME_BUILDER_STRUCTURING from 4000 to 6000 via env-var override at runner import time. Slice 1G found the 11K-char structuring prompt occasionally truncates; 6000 gives 50% headroom for the worst-case full output without inflating cost on the easy cases. - 5th candidate `openai-via-or` (gpt-5.4 routed through OpenRouter) added so the latency comparison is apples-to-apples. The native gpt-5.4 baseline is also kept for production-reality reference. - Two matcher fixes — `off_topic_movie_question` no longer flags polite refusals containing "recommend" as a substring; `out_of_scope_capability_probe` switched from forbidden-phrase to positive-signal check for cleaner cross-provider scoring. KEY FINDINGS Latency: openai-native (8.7s) ≈ openai-via-or (8.3s) ≈ in a tie. OpenRouter proxy overhead is essentially zero. The latency gap to the other OpenRouter candidates is REAL model-inference time: sonnet-4.5: 17.1s (2× slower) gemini: 34.4s (4× slower; max 107s) deepseek: 57.6s (7× slower; max 128s) Quality (me-as-judge re-classified, real behavior): openai (native): 18/18 effective (2 "fails" both curly-apostrophe matcher bugs) openai-via-or: 16/16 effective sonnet-4.5: 14/16 effective (smart-clarification on github URL beat the matcher) deepseek: 14/16 effective gemini: 12/16 effective (4 real failures incl. regex-fallback on mixed_github) Cost (4-provider OpenRouter spend for the full run): deepseek: $0.17 (cheapest) openai-via-or: ~$0.13 (would be — pricing slug was missing this run) gemini: $0.92 sonnet-4.5: $0.98 Tool discipline differences emerged on `failed_tool_graceful_fallback` (Sonnet preempted the failure without calling the tool; others called + reacted) and on `format_jumbled_dump` (Sonnet over-eagerly called fetch on a URL the user said wasn't theirs). RECOMMENDATION - Production default: keep gpt-5.4 native. Fastest, cheapest in our setup, highest effective quality. No reason to change. - For diversification / OpenRouter failover (ADR-028 D1): * DeepSeek for cost-sensitive batch workloads (parser, JD analysis) — 6× cheaper than sonnet, acceptable quality, but 6-7× slower so NOT for interactive chat. * Sonnet 4.5 for premium chat experience if cost is secondary — 2× slower than openai, but with the smart-clarification edge that GPT-5.4 doesn't have. - Skip Gemini for this workload: worst quality + second-slowest + same cost as Sonnet. ARTIFACTS PRESERVED - docs/eval-runs/2026-05-21-agentic-eval-v3-comprehensive-5cand.json Full JSON report with per-provider, per-scenario: assistant_replies, tool_events, findings, latency, tokens, cost. - docs/eval-runs/2026-05-21-comprehensive-eval-report.md The me-as-judge analysis report with quantitative metrics + qualitative judgments + final recommendation. - docs/eval-runs/2026-05-21-comprehensive-eval-v3-live-log.txt Live streaming log (heartbeat lines, checkpoints, roll-ups) for audit-trail of the run shape. WHAT THIS REVEALED ABOUT THE EVAL FRAMEWORK Curly-apostrophe matcher bugs broke 5 scenarios across 3 providers. Smart-clarification beat the matchers twice. Tool-discipline differences emerged that pass/fail couldn't capture. Phase 3 candidate: normalise apostrophes in matcher bodies (quick fix), and graduate the manual me-as-judge pass to a structured rubric (MT-Bench-style) for repeatable future runs. COST RECONCILIATION Operator topped up OpenRouter +$30. This run consumed: openai-via-or ~$0.13 + sonnet $0.98 + gemini $0.92 + deepseek $0.17 = ~$2.20 of OpenRouter. Plus $0.14 of OpenAI native. Total: ~$2.34 — well under the $30 budget and the $25 estimate. The smaller-scenario shorter- conversation default kept the spend modest. Phase B (full pipeline eval against parser/JD/analysis gold suites) remains parked. If the operator wants it: ~$20-25 additional spend, 1-day infrastructure work to extend provider_ab_runner with the same metrics + checkpointing pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0c9895f commit 15f16ac

5 files changed

Lines changed: 3640 additions & 2 deletions

0 commit comments

Comments
 (0)