The Foundry Evaluations service integration is a separate post-processing layer — the core eval flow is untouched. See how-to-foundry-eval-sdk.md for details and faq.md for Foundry troubleshooting.
Foundry Evaluation natively supports quality grading via score_model graders — LLM-as-a-judge evaluators that score text responses.
It does not natively support cost benchmarking or latency benchmarking as built-in evaluation dimensions. For cost and latency — which Foundry doesn't evaluate out of the box — we use python graders that score pre-computed metrics from the local eval run. This gives you a unified quality + cost + latency view in a single Foundry eval, while keeping the door open for native support when it arrives.
This repo extends Foundry to cover all three dimensions (quality, cost, latency) in a single eval run using a pattern built on supported SDK primitives.
We use Foundry's python grader type to run custom cost and latency scoring code inside the Foundry eval pipeline. The key insight: cost and latency data is pre-computed during the local eval run and embedded into the JSONL data that gets uploaded to Foundry. The Python graders then perform simple arithmetic on that pre-computed data.
Local eval run (scripts/run_eval.py)
├── Calls Model Router + Baseline for each prompt
├── Records cost (tokens × pricing) and latency (wall clock) per request
└── Writes raw_results.jsonl + results.json
Transformer (src/foundry/transformer.py)
├── Reads raw_results.jsonl
├── Enriches each row with: router_cost_usd, baseline_cost_usd,
│ router_latency_ms, baseline_latency_ms, router_response, baseline_response
└── Writes foundry_input.jsonl
Foundry eval submission (src/foundry/runner.py)
├── Uploads foundry_input.jsonl as a Foundry file
├── Creates eval with 5 graders:
│ ├── quality_absolute_router (score_model) — rates router response 1-5
│ ├── quality_absolute_baseline (score_model) — rates baseline response 1-5
│ ├── quality_pairwise (score_model) — head-to-head comparison
│ ├── mr_cost_comparison (python) — scores cost savings
│ └── mr_latency_comparison (python) — scores latency improvement
└── All 5 graders run in a single eval run, producing unified results
Both custom graders return a float in [0.0, 1.0]:
| Score | Meaning |
|---|---|
| 1.0 | Router is free / instant compared to baseline |
| 0.75 | Router saves ~50% vs baseline |
| 0.5 | Parity — router and baseline are equivalent |
| 0.25 | Router costs ~50% more / is ~50% slower |
| 0.0 | Router costs 2× more / takes 2× as long |
Formula (cost): score = 0.5 + ((baseline_cost - router_cost) / baseline_cost) × 0.5
Formula (latency): score = 0.5 + ((baseline_latency - router_latency) / baseline_latency) × 0.5
The default pass threshold is ≥ 0.5 (router is at least as good as baseline).
-
Uses official SDK primitives — the
pythongrader type is a supported Foundry feature, not a hack. Cost and latency scores flow through the same eval pipeline as quality, producing unified results in the Foundry portal. -
Single eval run, three dimensions — Quality (
score_model), cost (python), and latency (python) graders all execute in oneevals.create()/runs.create()call. Users see one results page, not three separate workflows. -
Pre-computed data, not live measurement — The transformer bakes cost and latency data into the JSONL before upload. The Python graders just do arithmetic. This is deterministic and reproducible.
-
Portal integration — Results appear in the Microsoft Foundry portal alongside quality scores, with pass/fail indicators, mean scores, and per-item drill-down.
| Aspect | Reality |
|---|---|
| Cost/latency are not "evaluated by Foundry" | Foundry just runs your Python code. The actual cost/latency data comes from the local eval run — Foundry doesn't measure anything. |
| Pass/fail thresholds are a convention | The 0.5 threshold ("router must be at least as good as baseline to pass") is a convention we chose, not a Foundry standard. Teams should tune thresholds to their requirements. |
| No Foundry-native cost/latency benchmarking exists | This fills a gap. If Foundry adds native support later, this pattern becomes tech debt. |
| Scores are relative, not absolute | A cost score of 0.95 means "router is much cheaper than this specific baseline," not "router is cheap in absolute terms." |
| Data freshness | Cost and latency data reflect the local eval run conditions (network, load, pricing at the time). Re-running the local eval may produce different numbers. |
When communicating this approach to customers or stakeholders:
"Foundry Evaluations supports quality grading natively via
score_modelgraders. For cost and latency — which Foundry doesn't evaluate out of the box — we usepythongraders that score pre-computed metrics from the local eval run. This gives you a unified quality + cost + latency view in a single Foundry eval, while keeping the door open for native support when it arrives."
This framing is honest, positions the approach as a pattern (not a product feature), and doesn't overstate what Foundry is doing under the hood.
| File | Role |
|---|---|
src/foundry/custom_evaluators.py |
Python grader source code for cost and latency |
src/foundry/graders.py |
Builds score_model criteria for quality graders |
src/foundry/transformer.py |
Enriches local results into Foundry JSONL format |
src/foundry/runner.py |
Orchestrates eval creation, file upload, run execution |
configs/foundry.yaml |
Grader configuration and thresholds |
configs/grader_prompts/ |
Quality grader prompt templates |
- If Foundry adds native cost/latency evaluation, migrate from
pythongraders to the built-in equivalent and remove the custom evaluator code. - The
pythongrader sandbox has limited library access — the grader code intentionally uses only built-in Python (no imports). - The score formula could be made configurable via
configs/foundry.yamlif teams want different sensitivity curves.