Skip to content

Latest commit

 

History

History
82 lines (54 loc) · 2.87 KB

File metadata and controls

82 lines (54 loc) · 2.87 KB

How To: Compare Two Evaluation Runs

Use compare_results.py to see what changed between two runs — useful for A/B tests ("is the new config better?"), regression tracking ("did the upgrade hurt anything?"), and model selection ("GPT‑4o or GPT‑5 as the baseline?").

What is a "run"? Every time you execute python scripts/run_eval.py, the output goes into a directory under results/. Each of those directories is a run. Comparing runs means diffing the results.json files produced by two of them.

Basic usage

python scripts/compare_results.py results/run-a results/run-b

Output:

## Comparison: run-a vs run-b

| Category | Metric              | run-a   | run-b   | Delta   | Change |
|----------|---------------------|---------|---------|---------|--------|
| Cost     | model_router ($)    | 0.0058  | 0.0052  | -0.0006 | -10.3% |
| Cost     | baseline ($)        | 0.0056  | 0.0056  | +0.0000 | +0.0%  |
| Latency  | model_router mean   | 320.5   | 295.2   | -25.3   | -7.9%  |
| Quality  | router_win_rate     | 0.4000  | 0.5500  | +0.1500 | +37.5% |

Negative deltas on cost and latency are usually good. Positive deltas on win rate are good.

Export as CSV

python scripts/compare_results.py results/run-a results/run-b --format csv > comparison.csv

Common scenarios

Compare different baseline models

# Run 1: router vs gpt-5
python scripts/run_eval.py --output-dir results/vs-gpt5

# Run 2: router vs gpt-4o (edit configs/default.yaml: deployment_name: gpt-4o)
python scripts/run_eval.py --output-dir results/vs-gpt4o

# Compare
python scripts/compare_results.py results/vs-gpt5 results/vs-gpt4o

This tells you which baseline Model Router beats more decisively — and therefore which baseline is the right one for your use case.

Compare different datasets

python scripts/run_eval.py --dataset data/customer_support.jsonl --output-dir results/support
python scripts/run_eval.py --dataset data/code_review.jsonl --output-dir results/code

python scripts/compare_results.py results/support results/code

Useful for spotting that Model Router is great for one workload (say, customer support) but neutral or worse for another (say, deep code review).

Track improvement over time

python scripts/compare_results.py results/2026-04-week1 results/2026-04-week2

Good for catching regressions after Azure model updates or pricing changes.

What's compared

Category Metrics
Cost Total cost per endpoint
Latency Mean, p90, p99 per endpoint
Quality Win rate, loss rate, tie rate
Requests Total request count

Requirements

Both directories must contain a results.json file (generated by the evaluation pipeline). If a run was interrupted and never finished, resume it with --resume first — only completed runs produce a results.json.