How To: Compare Two Evaluation Runs

Use compare_results.py to see what changed between two runs — useful for A/B tests ("is the new config better?"), regression tracking ("did the upgrade hurt anything?"), and model selection ("GPT‑4o or GPT‑5 as the baseline?").

What is a "run"? Every time you execute python scripts/run_eval.py, the output goes into a directory under results/. Each of those directories is a run. Comparing runs means diffing the results.json files produced by two of them.

Basic usage

python scripts/compare_results.py results/run-a results/run-b

Output:

## Comparison: run-a vs run-b

| Category | Metric              | run-a   | run-b   | Delta   | Change |
|----------|---------------------|---------|---------|---------|--------|
| Cost     | model_router ($)    | 0.0058  | 0.0052  | -0.0006 | -10.3% |
| Cost     | baseline ($)        | 0.0056  | 0.0056  | +0.0000 | +0.0%  |
| Latency  | model_router mean   | 320.5   | 295.2   | -25.3   | -7.9%  |
| Quality  | router_win_rate     | 0.4000  | 0.5500  | +0.1500 | +37.5% |

Negative deltas on cost and latency are usually good. Positive deltas on win rate are good.

Export as CSV

python scripts/compare_results.py results/run-a results/run-b --format csv > comparison.csv

Common scenarios

Compare different baseline models

# Run 1: router vs gpt-5
python scripts/run_eval.py --output-dir results/vs-gpt5

# Run 2: router vs gpt-4o (edit configs/default.yaml: deployment_name: gpt-4o)
python scripts/run_eval.py --output-dir results/vs-gpt4o

# Compare
python scripts/compare_results.py results/vs-gpt5 results/vs-gpt4o

This tells you which baseline Model Router beats more decisively — and therefore which baseline is the right one for your use case.

Compare different datasets

python scripts/run_eval.py --dataset data/customer_support.jsonl --output-dir results/support
python scripts/run_eval.py --dataset data/code_review.jsonl --output-dir results/code

python scripts/compare_results.py results/support results/code

Useful for spotting that Model Router is great for one workload (say, customer support) but neutral or worse for another (say, deep code review).

Track improvement over time

python scripts/compare_results.py results/2026-04-week1 results/2026-04-week2

Good for catching regressions after Azure model updates or pricing changes.

What's compared

Category	Metrics
Cost	Total cost per endpoint
Latency	Mean, p90, p99 per endpoint
Quality	Win rate, loss rate, tie rate
Requests	Total request count

Requirements

Both directories must contain a results.json file (generated by the evaluation pipeline). If a run was interrupted and never finished, resume it with --resume first — only completed runs produce a results.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How To: Compare Two Evaluation Runs

Basic usage

Export as CSV

Common scenarios

Compare different baseline models

Compare different datasets

Track improvement over time

What's compared

Requirements

Uh oh!

FilesExpand file tree

how-to-compare-runs.md

Latest commit

History

how-to-compare-runs.md

File metadata and controls

How To: Compare Two Evaluation Runs

Basic usage

Export as CSV

Common scenarios

Compare different baseline models

Compare different datasets

Track improvement over time

What's compared

Requirements