Long runs get interrupted — by Ctrl+C, by network blips, by the laptop closing at the end of the day. This guide explains the checkpoint system that makes those interruptions safe, and how to tune the tool for larger benchmarks (1,000+ prompts).
Why this matters: A 1,000-prompt run takes hours and costs real money. Without checkpointing, an interruption near the end means re-paying for everything. With checkpointing, you pick up exactly where you left off.
Every result is flushed to disk immediately after the API call returns. If the process is interrupted (Ctrl+C, crash, machine reboot, network failure), all completed work is saved.
Two checkpoint files are maintained in the output directory while the run is in progress:
| File | Contents |
|---|---|
checkpoint_eval.jsonl |
One line per API response (router + baseline) |
checkpoint_judge.jsonl |
One line per judge evaluation |
A prompt is "completed" only when both endpoints (model_router + baseline) have returned. A half-finished prompt is automatically re-evaluated on resume — you'll never see a row with one endpoint missing.
python scripts/run_eval.py --resume --output-dir results/my-runThe runner:
- Loads the checkpoint files
- Identifies which prompt IDs are already done
- Runs only the remaining prompts
- Merges results and generates the full report
Pressing Ctrl+C triggers a graceful shutdown instead of a hard kill:
⚠ Evaluation interrupted. 523/1000 prompts saved to checkpoint.
Resume with: python scripts/run_eval.py --resume --output-dir results/default
In-flight API calls finish, results are flushed, and the exact resume command is printed for you to copy. On successful completion of the full run, checkpoint files are automatically deleted.
python scripts/run_eval.py --config configs/large_scale.yamlKey differences from default — these are tuned for sustained throughput rather than first-run friendliness:
| Setting | Default | Large Scale |
|---|---|---|
max_parallel_requests |
5 | 10 |
request_timeout_seconds |
60 | 120 |
max_retries |
3 | 5 |
judge.max_parallel |
3 | 5 |
judge.timeout_seconds |
90 | 120 |
judge.max_retries |
2 | 3 |
| Prompts | Eval phase | Judge phase | Total |
|---|---|---|---|
| 100 | ~3 min | ~10 min | ~13 min |
| 500 | ~17 min | ~55 min | ~72 min |
| 1,000 | ~35 min | ~110 min | ~2.5 hours |
These assume ~5 seconds per API call. Actual times vary by model latency and rate limits.
If you see 429 Too Many Requests errors, your endpoint is throttling you. Options, in order of preference:
- Reduce concurrency — lower
max_parallel_requestsin your config - Increase retries — the built-in exponential backoff handles transient 429s automatically
- Run across sessions — use
--resumeto split a run across multiple sessions - Request higher Azure quota — the long-term fix for sustained large-scale runs
Perfectly fine to run a benchmark across days:
# Session 1: start the run
python scripts/run_eval.py --config configs/large_scale.yaml --sample-size 1000
# (interrupted — e.g. end of day)
# Session 2: resume next day
python scripts/run_eval.py --resume --config configs/large_scale.yaml --output-dir results/large-scaleAll results are held in memory for metrics computation. For 1,000 prompts this is typically 50–200 MB — well within normal limits. The checkpoint files also serve as a disk-backed record, so memory is never the bottleneck.
| Component | Mechanism | Purpose |
|---|---|---|
| Eval API calls | asyncio.Semaphore (configurable) |
Prevent overwhelming endpoints |
| Judge API calls | Separate asyncio.Semaphore |
Independent limit for judge model |
| Per-prompt | Sequential (router → baseline) | Fair latency comparison — no interference between the two endpoints |
| Overall dispatch | asyncio.as_completed() |
Maximum throughput within semaphore limits |
All I/O is async — no threads are blocked. Retry logic uses exponential backoff (1s, 2s, 4s, ...).