How To: Run a Live Evaluation

This guide walks you through running a real evaluation of Microsoft Foundry Model Router against a baseline model using your own Azure endpoints. By the end you'll have an HTML dashboard showing how Model Router compares to your chosen baseline on cost, latency, and quality — for prompts you care about.

Brand new to this repo? Read the QUICKSTART first — it explains what Model Router is, what this tool measures, and how to try the offline demo before spending any API calls.

What is a "live evaluation"?

A live evaluation calls real Azure endpoints with real prompts and records what happens. Specifically:

For every prompt, the tool calls Model Router (the system being measured) and a baseline model of your choice (e.g. GPT‑5 directly).
It records the response text, latency, token counts, and cost for each.
A separate judge model (LLM-as-a-judge) reads both responses and scores them on accuracy, completeness, clarity, and helpfulness.
Charts, a Markdown report, a CSV, and a JSON summary are written to disk.

This costs real Azure API tokens — typically a few cents for a quick test, a few dollars for a full benchmark. Use --dry-run first to confirm everything is wired up before spending money.

Prerequisites

Python 3.9+ — check with python --version
A Microsoft Foundry Model Router deployment in your Azure subscription
An Azure OpenAI baseline model deployment (e.g. GPT‑5, GPT‑4o) — this is what the router gets compared against
API keys for both endpoints (you'll paste these into a .env file)
(Recommended) a Python virtual environment — see QUICKSTART

Step 1: Install

git clone https://github.com/microsoft-foundry/Model-Router-Auto-Evaluation.git
cd Model-Router-Auto-Evaluation
pip install -e .

Or use the setup script (creates a venv and installs everything):

# Windows
.\scripts\setup.ps1

# Linux / macOS
bash scripts/setup.sh

Step 2: Configure credentials

cp .env.example .env       # macOS / Linux
copy .env.example .env     # Windows

Open .env and fill in your real values:

AZURE_MODEL_ROUTER_ENDPOINT=https://your-resource.services.ai.azure.com/models
AZURE_MODEL_ROUTER_KEY=your-model-router-key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com
AZURE_OPENAI_KEY=your-azure-openai-key

Security note: .env is listed in .gitignore and will never be committed. The endpoint URLs in .env.example are placeholders — they don't connect to anything until you replace them.

Where do these values come from?

Azure Portal → your Foundry resource → Keys and Endpoint
Azure Portal → your Azure OpenAI resource → Keys and Endpoint

Step 3: Configure the evaluation (optional)

The default config (configs/default.yaml) works out of the box. The settings most people change first:

Setting	Where in the YAML	Default	What it controls
Baseline model	`endpoints.baseline.deployment_name`	`gpt-5`	Which model the router is compared against
Number of prompts	`evaluation.sample_size`	`null` (all)	How many dataset prompts to use
Judge enabled	`judge.enabled`	`true`	Whether to run quality scoring (costs extra API calls)
Concurrency	`concurrency.max_parallel_requests`	`5`	How many prompts run in parallel

To change the baseline model, edit configs/default.yaml:

endpoints:
  baseline:
    deployment_name: "gpt-4o"   # any model deployed in your Azure OpenAI resource

Tip: Start with configs/quick_test.yaml (10 prompts) to verify everything works end-to-end before running a bigger benchmark.

Step 4: Validate (dry run, no API calls)

Before making real API calls, validate your config and dataset:

python scripts/run_eval.py --dry-run

This checks:

All ${ENV_VAR} references in the YAML resolve from .env
The dataset file loads and has valid id + prompt fields
Config values are within accepted ranges

Fix any errors here before moving on — they only get more expensive once real calls start.

Step 5: Run the evaluation

# Full eval with default config
python scripts/run_eval.py

# With a custom dataset
python scripts/run_eval.py --dataset my_prompts.jsonl

# Subset of prompts (great for first runs)
python scripts/run_eval.py --sample-size 50

# Custom output directory (so runs don't overwrite each other)
python scripts/run_eval.py --output-dir results/my-eval

You'll see two progress bars — first the eval phase (router + baseline), then the judge phase:

Evaluating: 100%|████████████████| 10/10 [00:42<00:00, 4.2s/prompt]
Judging:    100%|████████████████| 10/10 [01:15<00:00, 7.5s/prompt]

Step 6: View results

The path is printed at the end of the run, e.g. results/default/dashboard.html. Open that file in any browser — no server needed, the HTML is self-contained.

👉 See how-to-interpret-results.md for a guide to every chart and metric in plain language.

Common options

# Resume an interrupted run (Ctrl+C, crash, end of day)
python scripts/run_eval.py --resume --output-dir results/my-eval

# Use a different config preset
python scripts/run_eval.py --config configs/large_scale.yaml

# Disable the judge (cost + latency only — faster, cheaper)
# Edit configs/default.yaml: judge.enabled: false

Next steps

Submit results to Foundry for cloud grading — managed graders with portal visibility
Bring your own dataset — JSONL, CSV, or SQL
Scale to 1,000 prompts — checkpointing, multi-session runs
Compare two runs — A/B test configs or models
Cross-validate local vs Foundry results: python scripts/cross_validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How To: Run a Live Evaluation

What is a "live evaluation"?

Prerequisites

Step 1: Install

Step 2: Configure credentials

Step 3: Configure the evaluation (optional)

Step 4: Validate (dry run, no API calls)

Step 5: Run the evaluation

Step 6: View results

Common options

Next steps

Uh oh!

FilesExpand file tree

how-to-run-live-eval.md

Latest commit

History

how-to-run-live-eval.md

File metadata and controls

How To: Run a Live Evaluation

What is a "live evaluation"?

Prerequisites

Step 1: Install

Step 2: Configure credentials

Step 3: Configure the evaluation (optional)

Step 4: Validate (dry run, no API calls)

Step 5: Run the evaluation

Step 6: View results

Common options

Next steps