Use compare_results.py to see what changed between two runs — useful for A/B tests ("is the new config better?"), regression tracking ("did the upgrade hurt anything?"), and model selection ("GPT‑4o or GPT‑5 as the baseline?").
What is a "run"? Every time you execute
python scripts/run_eval.py, the output goes into a directory underresults/. Each of those directories is a run. Comparing runs means diffing theresults.jsonfiles produced by two of them.
python scripts/compare_results.py results/run-a results/run-bOutput:
## Comparison: run-a vs run-b
| Category | Metric | run-a | run-b | Delta | Change |
|----------|---------------------|---------|---------|---------|--------|
| Cost | model_router ($) | 0.0058 | 0.0052 | -0.0006 | -10.3% |
| Cost | baseline ($) | 0.0056 | 0.0056 | +0.0000 | +0.0% |
| Latency | model_router mean | 320.5 | 295.2 | -25.3 | -7.9% |
| Quality | router_win_rate | 0.4000 | 0.5500 | +0.1500 | +37.5% |
Negative deltas on cost and latency are usually good. Positive deltas on win rate are good.
python scripts/compare_results.py results/run-a results/run-b --format csv > comparison.csv# Run 1: router vs gpt-5
python scripts/run_eval.py --output-dir results/vs-gpt5
# Run 2: router vs gpt-4o (edit configs/default.yaml: deployment_name: gpt-4o)
python scripts/run_eval.py --output-dir results/vs-gpt4o
# Compare
python scripts/compare_results.py results/vs-gpt5 results/vs-gpt4oThis tells you which baseline Model Router beats more decisively — and therefore which baseline is the right one for your use case.
python scripts/run_eval.py --dataset data/customer_support.jsonl --output-dir results/support
python scripts/run_eval.py --dataset data/code_review.jsonl --output-dir results/code
python scripts/compare_results.py results/support results/codeUseful for spotting that Model Router is great for one workload (say, customer support) but neutral or worse for another (say, deep code review).
python scripts/compare_results.py results/2026-04-week1 results/2026-04-week2Good for catching regressions after Azure model updates or pricing changes.
| Category | Metrics |
|---|---|
| Cost | Total cost per endpoint |
| Latency | Mean, p90, p99 per endpoint |
| Quality | Win rate, loss rate, tie rate |
| Requests | Total request count |
Both directories must contain a results.json file (generated by the evaluation pipeline). If a run was interrupted and never finished, resume it with --resume first — only completed runs produce a results.json.