How To: Interpret Results

A plain-language guide to every output file, chart, and metric the evaluation produces — so you can answer the only question that really matters: "Should I use Model Router for my workload?"

First time here? Run the QUICKSTART demo first. The dashboard you generated there has exactly the same charts as a real evaluation — just filled with synthetic data — so you can use this guide alongside the demo output.

How to read a run in 60 seconds

Open dashboard.html and look at three things:

Cost comparison chart → Did Model Router save money? (Lower bar = cheaper.)
Latency comparison chart → Did Model Router add lag? (Look at p95/p99, not just mean.)
Pairwise win rates / absolute scores → Did quality hold up?

If cost dropped, latency stayed flat or improved, and quality is within 0.3 points of the baseline, Model Router is a win for your workload. The rest of this page explains why each chart says what it does.

Glossary (terms that show up in the report)

Term	Plain English
Router	The Model Router endpoint — the system being evaluated
Baseline	The fixed model you compared against (e.g. GPT‑5)
p50 / p90 / p95 / p99	Latency percentiles. p95 = 95% of requests were faster than this. The high percentiles tell you about worst-case user experience, not the average
Pairwise win rate	Out of all prompts, what % did the judge prefer the router's answer for?
Absolute score	A 1–5 grade per response on accuracy/completeness/clarity/helpfulness, given by the judge model
Tie	The judge couldn't pick a clear winner — either both responses were similar quality, or position bias was detected and the result was discarded
Position bias	LLM judges tend to prefer whichever response they see first. We run each comparison twice with swapped order to detect and neutralise this

Output files

After an evaluation run, the output directory contains:

File	What it is
`dashboard.html`	Start here. Self-contained HTML with all charts and metrics — open in any browser, no server needed
`report.md`	Markdown summary — good for pasting into docs, PRs, or Slack
`detailed_results.csv`	Per-prompt data: latency, tokens, model, status — open in Excel for further analysis
`results.json`	Machine-readable metrics for programmatic consumption (CI gates, dashboards)
`raw_results.jsonl`	Full API responses including response text — useful when debugging quality issues
`chart_*.png`	Individual chart images — paste into slides

Charts

Cost comparison

Bar chart comparing total estimated cost (USD) for Model Router vs baseline. The router cost includes the markup on input tokens plus the underlying model's per-token pricing. Lower is better.

Latency comparison

Bar chart showing mean, p50, p90, p95, p99 latency for both endpoints. The mean can hide a long tail — always check p95/p99 to understand the worst user experience, not just the average.

Latency distribution

Histogram showing the spread of per-prompt latencies. A tighter distribution means more predictable response times, which often matters more in production than raw average speed.

Per-category latency

Grouped bar chart breaking down mean latency by prompt category (if your dataset has them). Helps identify which types of prompts the router handles faster or slower than the baseline.

Token usage breakdown

Stacked bar chart showing prompt tokens vs completion tokens for each endpoint. If the router and baseline produce very different completion lengths, that affects both cost and latency — this chart shows it directly.

Model distribution

Pie chart showing which underlying models the router picked, and how often. This is the chart that explains how Model Router achieved its cost/latency numbers — e.g. routing simple prompts to cheaper models and complex ones to more capable models.

Pairwise win rates (quality)

Bar chart showing the percentage of prompts where Model Router won, lost, or tied against the baseline in a head-to-head judge comparison. Uses dual-ordering to cancel position bias (see Glossary).

Absolute score comparison

Bar chart comparing mean scores (1–5) across four dimensions: Accuracy, Completeness, Clarity, and Helpfulness. Useful when the pairwise rate is close — absolute scores reveal whether "tie" means "both great" or "both mediocre".

Key metrics

Cost

Metric	Description
`estimated_cost_usd`	Total estimated cost based on token counts and configured pricing
`cost_per_prompt_avg_usd`	Average cost per prompt
`cost_savings_ratio`	`(baseline_cost - router_cost) / baseline_cost` — positive means router is cheaper

Latency

Metric	Description
`mean_ms`	Average response time (in milliseconds)
`median_ms`	50th percentile (p50)
`p90_ms` / `p95_ms` / `p99_ms`	Tail latency percentiles
`latency_diff_mean_ms`	`router_mean - baseline_mean` — negative means router is faster

Quality

Metric	Description
`router_win_rate`	% of prompts where router response was preferred
`baseline_win_rate`	% of prompts where baseline was preferred
`tie_rate`	% of prompts with no clear winner
`router_mean_overall`	Mean absolute score (1–5) across all dimensions
`baseline_mean_overall`	Same for baseline

Model distribution

Metric	Description
`model_distribution`	Dict mapping model name → count of prompts served

Reading the Markdown report

report.md follows this structure:

Header — evaluation name, timestamp, prompt count
Cost summary — total and per-prompt costs, savings ratio
Latency summary — percentile table, per-category breakdown
Model distribution — which models the router picked
Quality summary — win rates, absolute scores (if judge was enabled)
Configuration — endpoints, pricing, concurrency settings used (so the run is reproducible)

What good results look like

A "Model Router is working" result typically has:

Cost savings > 0% — the router found cheaper models for at least some prompts
Latency diff near 0 or negative — router isn't adding significant overhead
Win rate ≥ 40%, loss rate < 20% — quality is maintained or improved
Absolute scores within 0.3 of baseline — no significant quality degradation

If cost is way down but quality also dropped, the router is just picking cheap-and-bad models for your workload — that's not a win, and you may need a stricter routing policy or a different baseline.

Foundry cloud evaluation results

If you also ran python scripts/run_foundry_eval.py, the results/foundry-eval/ directory contains:

File	What it is
`report.md`	Grader results table with mean scores and pass rates
`results.json`	Machine-readable: `grader_summary`, `per_item_scores`, `result_counts`
`foundry_input.jsonl`	The transformed data that was uploaded to Foundry

Foundry grader summary

The grader_summary in results.json contains per-grader statistics:

Field	Description
`mean`	Average score across all prompts
`pass_rate`	Percentage of prompts that passed the threshold
`count`	Number of prompts scored

Foundry graders explained

Grader	Type	Pass threshold	What it measures
`quality_absolute_router`	`score_model`	≥ 3/5	Router response quality
`quality_absolute_baseline`	`score_model`	≥ 3/5	Baseline response quality
`quality_pairwise`	`score_model`	≥ 3/5	Head-to-head: is router ≥ baseline?
`mr_cost_comparison`	`python`	≥ 0.5	Cost savings ratio: `1 - (router/baseline)`
`mr_latency_comparison`	`python`	≥ 0.5	Latency improvement ratio

Per-item scores

The per_item_scores array shows each prompt's individual grader results, useful for identifying which prompts the router handles well or poorly.

Cross-validation

Run python scripts/cross_validate.py to compare local and Foundry results side by side. Both pipelines should agree on quality direction, cost savings (±5%), and latency trends. Larger disagreements usually mean the local judge model is materially different from the Foundry grader model.

Key Metrics

Cost

Metric	Description
`estimated_cost_usd`	Total estimated cost based on token counts and configured pricing
`cost_per_prompt_avg_usd`	Average cost per prompt
`cost_savings_ratio`	`(baseline_cost - router_cost) / baseline_cost` — positive means router is cheaper

Latency

Metric	Description
`mean_ms`	Average response time
`median_ms`	50th percentile (p50)
`p90_ms` / `p95_ms` / `p99_ms`	Tail latency percentiles
`latency_diff_mean_ms`	`router_mean - baseline_mean` — negative means router is faster

Quality

Metric	Description
`router_win_rate`	% of prompts where router response was preferred
`baseline_win_rate`	% of prompts where baseline was preferred
`tie_rate`	% of prompts with no clear winner
`router_mean_overall`	Mean absolute score (1–5) across all dimensions
`baseline_mean_overall`	Same for baseline

Model Distribution

Metric	Description
`model_distribution`	Dict mapping model name → count of prompts served

Reading the Markdown Report

The report.md follows this structure:

Header — evaluation name, timestamp, prompt count
Cost Summary — total and per-prompt costs, savings ratio
Latency Summary — percentile table, per-category breakdown
Model Distribution — which models the router selected
Quality Summary — win rates, absolute scores (if judge was enabled)
Configuration — endpoints, pricing, concurrency settings used

What Good Results Look Like

Cost savings > 0% — the router found cheaper models for some prompts
Latency diff near 0 or negative — router isn't adding significant overhead
Win rate >= 40%, loss rate < 20% — quality is maintained or improved
Absolute scores within 0.3 of baseline — no significant quality degradation

Foundry Cloud Evaluation Results

If you ran python scripts/run_foundry_eval.py, the results/foundry-eval/ directory contains:

File	What it is
`report.md`	Grader results table with mean scores and pass rates
`results.json`	Machine-readable: `grader_summary`, `per_item_scores`, `result_counts`
`foundry_input.jsonl`	The transformed data that was uploaded to Foundry

Foundry Grader Summary

The grader_summary in results.json contains per-grader statistics:

Field	Description
`mean`	Average score across all prompts
`pass_rate`	Percentage of prompts that passed the threshold
`count`	Number of prompts scored

Foundry Graders Explained

Grader	Type	Pass Threshold	What it measures
`quality_absolute_router`	`score_model`	≥ 3/5	Router response quality
`quality_absolute_baseline`	`score_model`	≥ 3/5	Baseline response quality
`quality_pairwise`	`score_model`	≥ 3/5	Head-to-head: is router ≥ baseline?
`mr_cost_comparison`	`python`	≥ 0.5	Cost savings ratio: `1 - (router/baseline)`
`mr_latency_comparison`	`python`	≥ 0.5	Latency improvement ratio

Per-Item Scores

The per_item_scores array shows each prompt’s individual grader results, useful for identifying which prompts the router handles well or poorly.

Cross-Validation

Run python scripts/cross_validate.py to compare local and Foundry results side by side. Both pipelines should agree on quality direction, cost savings (±5%), and latency trends.

Uh oh!

FilesExpand file tree

how-to-interpret-results.md

Latest commit

History

how-to-interpret-results.md

File metadata and controls

How To: Interpret Results

How to read a run in 60 seconds

Glossary (terms that show up in the report)

Output files

Charts

Cost comparison

Latency comparison

Latency distribution

Per-category latency

Token usage breakdown

Model distribution

Pairwise win rates (quality)

Absolute score comparison

Key metrics

Cost

Latency

Quality

Model distribution

Reading the Markdown report

What good results look like

Foundry cloud evaluation results

Foundry grader summary

Foundry graders explained

Per-item scores

Cross-validation

Key Metrics

Cost

Latency

Quality

Model Distribution

Reading the Markdown Report

What Good Results Look Like

Foundry Cloud Evaluation Results

Foundry Grader Summary

Foundry Graders Explained

Per-Item Scores

Cross-Validation