Commit 15f16ac
feat(eval): Slice 1H — comprehensive 5-candidate eval with latency / cost / qual judging
Phase A of the comprehensive multi-provider eval the operator commissioned.
5 candidates × 18 scenarios with first-class latency, token, cost, and
tool-discipline metrics, plus me-as-judge qualitative re-classification
of every failure.
WHAT LANDED
- Expanded scenario set from 10 to 18: 8 new conflict / adversarial /
long-session scenarios (triple_role_correction, self_contradictory_
info, off_topic_movie_question, out_of_scope_capability_probe,
failed_tool_graceful_fallback, format_jumbled_dump,
long_session_memory_callback, mixed_github_and_portfolio_urls).
These are the ones that distinguish the top-4 candidates more
sharply than the existing scenarios.
- Per-scenario metrics now first-class: wall-clock latency,
prompt/completion tokens (from response.usage), USD cost (via
tests/quality/provider_pricing.py), tool-call count. Surfaced on
a tabular heartbeat line per scenario for live monitoring + in
the JSON report for post-run analysis.
- Incremental JSON checkpoint after EVERY candidate completes —
partial results survive a mid-run failure (kill, network drop,
provider hang). The killed v1 of this run flushed nothing
despite running 5 minutes; checkpointing prevents that recurrence.
- Bumped OPENAI_MAX_COMPLETION_TOKENS_RESUME_BUILDER_STRUCTURING
from 4000 to 6000 via env-var override at runner import time.
Slice 1G found the 11K-char structuring prompt occasionally
truncates; 6000 gives 50% headroom for the worst-case full
output without inflating cost on the easy cases.
- 5th candidate `openai-via-or` (gpt-5.4 routed through OpenRouter)
added so the latency comparison is apples-to-apples. The native
gpt-5.4 baseline is also kept for production-reality reference.
- Two matcher fixes — `off_topic_movie_question` no longer flags
polite refusals containing "recommend" as a substring;
`out_of_scope_capability_probe` switched from forbidden-phrase
to positive-signal check for cleaner cross-provider scoring.
KEY FINDINGS
Latency: openai-native (8.7s) ≈ openai-via-or (8.3s) ≈ in a tie.
OpenRouter proxy overhead is essentially zero. The latency gap to
the other OpenRouter candidates is REAL model-inference time:
sonnet-4.5: 17.1s (2× slower)
gemini: 34.4s (4× slower; max 107s)
deepseek: 57.6s (7× slower; max 128s)
Quality (me-as-judge re-classified, real behavior):
openai (native): 18/18 effective (2 "fails" both curly-apostrophe matcher bugs)
openai-via-or: 16/16 effective
sonnet-4.5: 14/16 effective (smart-clarification on github URL beat the matcher)
deepseek: 14/16 effective
gemini: 12/16 effective (4 real failures incl. regex-fallback on mixed_github)
Cost (4-provider OpenRouter spend for the full run):
deepseek: $0.17 (cheapest)
openai-via-or: ~$0.13 (would be — pricing slug was missing this run)
gemini: $0.92
sonnet-4.5: $0.98
Tool discipline differences emerged on `failed_tool_graceful_fallback`
(Sonnet preempted the failure without calling the tool; others called
+ reacted) and on `format_jumbled_dump` (Sonnet over-eagerly called
fetch on a URL the user said wasn't theirs).
RECOMMENDATION
- Production default: keep gpt-5.4 native. Fastest, cheapest in our
setup, highest effective quality. No reason to change.
- For diversification / OpenRouter failover (ADR-028 D1):
* DeepSeek for cost-sensitive batch workloads (parser, JD
analysis) — 6× cheaper than sonnet, acceptable quality, but
6-7× slower so NOT for interactive chat.
* Sonnet 4.5 for premium chat experience if cost is secondary
— 2× slower than openai, but with the smart-clarification
edge that GPT-5.4 doesn't have.
- Skip Gemini for this workload: worst quality + second-slowest +
same cost as Sonnet.
ARTIFACTS PRESERVED
- docs/eval-runs/2026-05-21-agentic-eval-v3-comprehensive-5cand.json
Full JSON report with per-provider, per-scenario:
assistant_replies, tool_events, findings, latency, tokens, cost.
- docs/eval-runs/2026-05-21-comprehensive-eval-report.md
The me-as-judge analysis report with quantitative metrics +
qualitative judgments + final recommendation.
- docs/eval-runs/2026-05-21-comprehensive-eval-v3-live-log.txt
Live streaming log (heartbeat lines, checkpoints, roll-ups) for
audit-trail of the run shape.
WHAT THIS REVEALED ABOUT THE EVAL FRAMEWORK
Curly-apostrophe matcher bugs broke 5 scenarios across 3 providers.
Smart-clarification beat the matchers twice. Tool-discipline
differences emerged that pass/fail couldn't capture.
Phase 3 candidate: normalise apostrophes in matcher bodies (quick
fix), and graduate the manual me-as-judge pass to a structured
rubric (MT-Bench-style) for repeatable future runs.
COST RECONCILIATION
Operator topped up OpenRouter +$30. This run consumed: openai-via-or
~$0.13 + sonnet $0.98 + gemini $0.92 + deepseek $0.17 = ~$2.20 of
OpenRouter. Plus $0.14 of OpenAI native. Total: ~$2.34 — well under
the $30 budget and the $25 estimate. The smaller-scenario shorter-
conversation default kept the spend modest.
Phase B (full pipeline eval against parser/JD/analysis gold suites)
remains parked. If the operator wants it: ~$20-25 additional spend,
1-day infrastructure work to extend provider_ab_runner with the same
metrics + checkpointing pattern.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 0c9895f commit 15f16ac
5 files changed
Lines changed: 3640 additions & 2 deletions
0 commit comments