A plain-language guide to every output file, chart, and metric the evaluation produces — so you can answer the only question that really matters: "Should I use Model Router for my workload?"
First time here? Run the QUICKSTART demo first. The dashboard you generated there has exactly the same charts as a real evaluation — just filled with synthetic data — so you can use this guide alongside the demo output.
Open dashboard.html and look at three things:
- Cost comparison chart → Did Model Router save money? (Lower bar = cheaper.)
- Latency comparison chart → Did Model Router add lag? (Look at p95/p99, not just mean.)
- Pairwise win rates / absolute scores → Did quality hold up?
If cost dropped, latency stayed flat or improved, and quality is within 0.3 points of the baseline, Model Router is a win for your workload. The rest of this page explains why each chart says what it does.
| Term | Plain English |
|---|---|
| Router | The Model Router endpoint — the system being evaluated |
| Baseline | The fixed model you compared against (e.g. GPT‑5) |
| p50 / p90 / p95 / p99 | Latency percentiles. p95 = 95% of requests were faster than this. The high percentiles tell you about worst-case user experience, not the average |
| Pairwise win rate | Out of all prompts, what % did the judge prefer the router's answer for? |
| Absolute score | A 1–5 grade per response on accuracy/completeness/clarity/helpfulness, given by the judge model |
| Tie | The judge couldn't pick a clear winner — either both responses were similar quality, or position bias was detected and the result was discarded |
| Position bias | LLM judges tend to prefer whichever response they see first. We run each comparison twice with swapped order to detect and neutralise this |
After an evaluation run, the output directory contains:
| File | What it is |
|---|---|
dashboard.html |
Start here. Self-contained HTML with all charts and metrics — open in any browser, no server needed |
report.md |
Markdown summary — good for pasting into docs, PRs, or Slack |
detailed_results.csv |
Per-prompt data: latency, tokens, model, status — open in Excel for further analysis |
results.json |
Machine-readable metrics for programmatic consumption (CI gates, dashboards) |
raw_results.jsonl |
Full API responses including response text — useful when debugging quality issues |
chart_*.png |
Individual chart images — paste into slides |
Bar chart comparing total estimated cost (USD) for Model Router vs baseline. The router cost includes the markup on input tokens plus the underlying model's per-token pricing. Lower is better.
Bar chart showing mean, p50, p90, p95, p99 latency for both endpoints. The mean can hide a long tail — always check p95/p99 to understand the worst user experience, not just the average.
Histogram showing the spread of per-prompt latencies. A tighter distribution means more predictable response times, which often matters more in production than raw average speed.
Grouped bar chart breaking down mean latency by prompt category (if your dataset has them). Helps identify which types of prompts the router handles faster or slower than the baseline.
Stacked bar chart showing prompt tokens vs completion tokens for each endpoint. If the router and baseline produce very different completion lengths, that affects both cost and latency — this chart shows it directly.
Pie chart showing which underlying models the router picked, and how often. This is the chart that explains how Model Router achieved its cost/latency numbers — e.g. routing simple prompts to cheaper models and complex ones to more capable models.
Bar chart showing the percentage of prompts where Model Router won, lost, or tied against the baseline in a head-to-head judge comparison. Uses dual-ordering to cancel position bias (see Glossary).
Bar chart comparing mean scores (1–5) across four dimensions: Accuracy, Completeness, Clarity, and Helpfulness. Useful when the pairwise rate is close — absolute scores reveal whether "tie" means "both great" or "both mediocre".
| Metric | Description |
|---|---|
estimated_cost_usd |
Total estimated cost based on token counts and configured pricing |
cost_per_prompt_avg_usd |
Average cost per prompt |
cost_savings_ratio |
(baseline_cost - router_cost) / baseline_cost — positive means router is cheaper |
| Metric | Description |
|---|---|
mean_ms |
Average response time (in milliseconds) |
median_ms |
50th percentile (p50) |
p90_ms / p95_ms / p99_ms |
Tail latency percentiles |
latency_diff_mean_ms |
router_mean - baseline_mean — negative means router is faster |
| Metric | Description |
|---|---|
router_win_rate |
% of prompts where router response was preferred |
baseline_win_rate |
% of prompts where baseline was preferred |
tie_rate |
% of prompts with no clear winner |
router_mean_overall |
Mean absolute score (1–5) across all dimensions |
baseline_mean_overall |
Same for baseline |
| Metric | Description |
|---|---|
model_distribution |
Dict mapping model name → count of prompts served |
report.md follows this structure:
- Header — evaluation name, timestamp, prompt count
- Cost summary — total and per-prompt costs, savings ratio
- Latency summary — percentile table, per-category breakdown
- Model distribution — which models the router picked
- Quality summary — win rates, absolute scores (if judge was enabled)
- Configuration — endpoints, pricing, concurrency settings used (so the run is reproducible)
A "Model Router is working" result typically has:
- Cost savings > 0% — the router found cheaper models for at least some prompts
- Latency diff near 0 or negative — router isn't adding significant overhead
- Win rate ≥ 40%, loss rate < 20% — quality is maintained or improved
- Absolute scores within 0.3 of baseline — no significant quality degradation
If cost is way down but quality also dropped, the router is just picking cheap-and-bad models for your workload — that's not a win, and you may need a stricter routing policy or a different baseline.
If you also ran python scripts/run_foundry_eval.py, the results/foundry-eval/ directory contains:
| File | What it is |
|---|---|
report.md |
Grader results table with mean scores and pass rates |
results.json |
Machine-readable: grader_summary, per_item_scores, result_counts |
foundry_input.jsonl |
The transformed data that was uploaded to Foundry |
The grader_summary in results.json contains per-grader statistics:
| Field | Description |
|---|---|
mean |
Average score across all prompts |
pass_rate |
Percentage of prompts that passed the threshold |
count |
Number of prompts scored |
| Grader | Type | Pass threshold | What it measures |
|---|---|---|---|
quality_absolute_router |
score_model |
≥ 3/5 | Router response quality |
quality_absolute_baseline |
score_model |
≥ 3/5 | Baseline response quality |
quality_pairwise |
score_model |
≥ 3/5 | Head-to-head: is router ≥ baseline? |
mr_cost_comparison |
python |
≥ 0.5 | Cost savings ratio: 1 - (router/baseline) |
mr_latency_comparison |
python |
≥ 0.5 | Latency improvement ratio |
The per_item_scores array shows each prompt's individual grader results, useful for identifying which prompts the router handles well or poorly.
Run python scripts/cross_validate.py to compare local and Foundry results side by side. Both pipelines should agree on quality direction, cost savings (±5%), and latency trends. Larger disagreements usually mean the local judge model is materially different from the Foundry grader model.
| Metric | Description |
|---|---|
estimated_cost_usd |
Total estimated cost based on token counts and configured pricing |
cost_per_prompt_avg_usd |
Average cost per prompt |
cost_savings_ratio |
(baseline_cost - router_cost) / baseline_cost — positive means router is cheaper |
| Metric | Description |
|---|---|
mean_ms |
Average response time |
median_ms |
50th percentile (p50) |
p90_ms / p95_ms / p99_ms |
Tail latency percentiles |
latency_diff_mean_ms |
router_mean - baseline_mean — negative means router is faster |
| Metric | Description |
|---|---|
router_win_rate |
% of prompts where router response was preferred |
baseline_win_rate |
% of prompts where baseline was preferred |
tie_rate |
% of prompts with no clear winner |
router_mean_overall |
Mean absolute score (1–5) across all dimensions |
baseline_mean_overall |
Same for baseline |
| Metric | Description |
|---|---|
model_distribution |
Dict mapping model name → count of prompts served |
The report.md follows this structure:
- Header — evaluation name, timestamp, prompt count
- Cost Summary — total and per-prompt costs, savings ratio
- Latency Summary — percentile table, per-category breakdown
- Model Distribution — which models the router selected
- Quality Summary — win rates, absolute scores (if judge was enabled)
- Configuration — endpoints, pricing, concurrency settings used
- Cost savings > 0% — the router found cheaper models for some prompts
- Latency diff near 0 or negative — router isn't adding significant overhead
- Win rate >= 40%, loss rate < 20% — quality is maintained or improved
- Absolute scores within 0.3 of baseline — no significant quality degradation
If you ran python scripts/run_foundry_eval.py, the results/foundry-eval/ directory contains:
| File | What it is |
|---|---|
report.md |
Grader results table with mean scores and pass rates |
results.json |
Machine-readable: grader_summary, per_item_scores, result_counts |
foundry_input.jsonl |
The transformed data that was uploaded to Foundry |
The grader_summary in results.json contains per-grader statistics:
| Field | Description |
|---|---|
mean |
Average score across all prompts |
pass_rate |
Percentage of prompts that passed the threshold |
count |
Number of prompts scored |
| Grader | Type | Pass Threshold | What it measures |
|---|---|---|---|
quality_absolute_router |
score_model |
≥ 3/5 | Router response quality |
quality_absolute_baseline |
score_model |
≥ 3/5 | Baseline response quality |
quality_pairwise |
score_model |
≥ 3/5 | Head-to-head: is router ≥ baseline? |
mr_cost_comparison |
python |
≥ 0.5 | Cost savings ratio: 1 - (router/baseline) |
mr_latency_comparison |
python |
≥ 0.5 | Latency improvement ratio |
The per_item_scores array shows each prompt’s individual grader results, useful for identifying which prompts the router handles well or poorly.
Run python scripts/cross_validate.py to compare local and Foundry results side by side. Both pipelines should agree on quality direction, cost savings (±5%), and latency trends.