Evaluating LLMs rigorously is expensive, fragile, and time-consuming. The standard workflow — writing one-off scripts, managing API keys by hand, retrying crashed jobs, paying for duplicate work, and stitching together ad-hoc comparison tables — does not scale beyond a single researcher or a single model.
Every serious LLM evaluation project eventually hits the same walls:
- Prompt brittleness — a prompt tuned for GPT-4o produces garbage from Llama. Evaluation results reflect prompt engineering choices, not model quality.
- Rubric subjectivity — hand-written scoring criteria are inconsistent, hard to version, and impossible to audit.
- Single-judge bias — one judge model introduces systematic scoring biases that compound across thousands of evaluations.
- Cost unpredictability — a poorly scoped run can burn hundreds of dollars before you know it went wrong.
- Fragile pipelines — a single API timeout at hour three means starting over, paying again, getting different outputs.
- No reproducibility — notebooks and ad-hoc scripts accumulate state that cannot be re-run cleanly six months later.
- Vendor dependency — code written for the OpenAI SDK must be rewritten to test Anthropic, Google, or an open-source model.
CoEval was built to change that.
| Problem | CoEval Solution |
|---|---|
| Evaluation logic scattered across notebooks | Single declarative YAML drives every phase |
| Hardcoded prompts biased toward a single model | Automated attribute and rubric generation, model-specific prompt overrides |
| High API bills with no visibility | Pre-run cost estimation with Batch API discounts (up to 50% off) |
| Crashed runs require starting over | Granular resume at task × model × record level |
| Judge bias from single-model evaluation | Configurable multi-judge ensembles with consistency metrics |
| No insight into what went wrong | 8 interactive HTML reports, JSONL audit trail, repair tooling |
| Vendor lock-in | 15+ interfaces — cloud, local, virtual benchmark |
| Manual comparison across dozens of models | Kendall τ ranking, Spearman ρ correlation, differentiation scores |
LLM developers stress-testing a fine-tune against frontier baselines before shipping.
ML engineers validating a RAG pipeline across diverse task dimensions and input distributions.
Researchers producing reproducible, publication-grade benchmarks that can be re-run by reviewers.
Platform teams running recurring evaluation suites across model versions with full cost and quality auditability.
Practitioners who need to compare open-source and proprietary models on their specific task without rebuilding evaluation infrastructure each time.
CoEval is organized around five capability clusters that cover the full evaluation lifecycle.
- Declarative YAML configuration — models, tasks, roles, sampling, rubrics, and prompts in one file; no code required to launch an evaluation
- Interactive wizard (
coeval wizard) — LLM-assisted config generation with conversational refinement; describe your evaluation goal in plain English - Auto attribute discovery — LLMs infer relevant task dimensions from your task description; override or extend at will
- Auto rubric generation — build evaluation rubrics automatically from task descriptions, statically from your own criteria, or by extending an existing rubric from a prior run
- Nuanced attribute sampling — inject diversity dimensions (tone, domain, urgency, style) per benchmark item to eliminate distribution collapse and keep synthetic data naturalistic
- Per-model prompt overrides — tailor generation and evaluation instructions at model granularity without duplicating task definitions
interface: auto— automatic cheapest-available provider selection based on your key file; no hardcoded provider choices
- Five-phase pipeline — attribute mapping → rubric mapping → data generation → response collection → evaluation; each phase is independently checkpointed and resumable
- Flexible role assignment — any model can be teacher, student, judge, or all three simultaneously; assign role-specific temperature and token budgets per model
- Batch API acceleration — OpenAI, Anthropic, Gemini, and Azure OpenAI batching enabled with a single config flag; up to 50% cost reduction on Phase 4 and Phase 5
- Concurrent execution — configurable worker pools per phase; HuggingFace models queued for GPU; all other interfaces run concurrently
- Per-phase mode control —
New/Keep/Extend/Modelper phase; skip phases you've already run, append only missing items, or reuse an existing model's output - Virtual benchmark teachers (
interface: benchmark) — inject pre-ingested dataset responses (XSum, CodeSearchNet, AESLC, MMLU, HumanEval, …) as Phase 3 ground truth without any LLM calls - Label accuracy mode — judge-free exact-match evaluation for classification and information extraction tasks via
label_attributes; no judge calls needed, no judge bias
- Pre-run cost estimation (
coeval plan) — samples 2 real API calls per model by default to measure actual latency and throughput; falls back to token-count heuristics with--estimate-samples 0 - Batch discount modeling — estimation automatically applies provider-specific batch discounts (50% for OpenAI, Anthropic, Gemini, Azure OpenAI)
- Per-model API quota —
quota:block sets hard call ceilings per model; the pipeline stops cleanly at the ceiling rather than crashing mid-phase --estimate-only— validate config, print itemized cost table, writecost_estimate.json, and exit before any evaluation work starts
- Model probe (
coeval probe) — lightweight availability check for all 15+ interfaces using model listing endpoints; consumes no generation tokens and exits before spending a dollar - Progress dashboard (
coeval status) — live phase artifact counts, pending batch jobs, and recent log errors; add--fetch-batchesto poll provider APIs for completion - Config HTML preview (
coeval describe) — browser-ready summary of models, tasks, rubrics, phase execution plan, prompt templates, and cost budget; shareable for team review before running - JSONL repair tool (
coeval repair) — scans for malformed records, removes corrupted phase checkpoints, and prepares for targeted re-runs with--continue - Structured logging — configurable log level per experiment; full JSONL audit trail written per phase alongside every result
- Granular checkpointing — each JSONL record written atomically; interruption mid-phase loses at most one record
--continue— resume any experiment from its last checkpoint; readsphases_completedfrommeta.json, skips done phases, and fills gaps in in-progress phases--resume EXPERIMENT_ID— fork phases 1–2 from an existing run; re-use attribute and rubric work across multiple student/judge variants without regenerating data--only-models— re-run specific models in phases 3–5 without touching others; ideal for adding a new model to an existing benchmark- Repair + continue workflow — identify failed records with
coeval repair --dry-run, apply fixes, then--continuefills the gaps without touching intact data
- Multi-judge ensemble — average scores across N judges for robust, bias-resistant evaluation; ensemble size configurable per experiment
- Single and per-factor evaluation modes — holistic score (one call per response) or one call per rubric dimension per response for maximum granularity
- 8 interactive HTML reports — score distributions, teacher analysis, judge consistency, student rankings, interaction matrices, coverage summaries, calibration curves, and robust summaries
- Excel export — complete results in a structured workbook for downstream statistical analysis or stakeholder review
- Calibration analysis — linear α/β calibration fit between judge scores and benchmark ground-truth; corrects for systematic over/under-scoring
- Agreement metrics — Spearman ρ, Kendall τ, and inter-judge consistency (ACR); quantify how reliably models agree on relative quality
- Paper tables — Spearman ρ comparisons, student rankings, sampling ablation, positional bias detection, calibration effect, and cost comparison tables ready for publication
- Benchmark score integration — correlate LLM judge scores with BERTScore, BLEU, ROUGE, and exact-match ground-truth from real datasets
Q: What is CoEval and what problem does it solve? A: CoEval is a declarative LLM benchmarking framework that orchestrates a fleet of language models as teachers, students, and judges. It addresses the core problems of LLM evaluation at scale — prompt brittleness, rubric subjectivity, single-judge bias, fragile pipelines, and vendor lock-in — through a structured five-phase pipeline driven entirely by a single YAML config.
Q: Who is CoEval designed for? A: CoEval is built for LLM developers stress-testing fine-tunes, ML engineers validating RAG pipelines, researchers producing reproducible benchmarks, and platform teams running recurring evaluation suites. Essentially anyone who needs rigorous, auditable model comparisons without rebuilding evaluation infrastructure from scratch.
Q: Do I need to write code to run an evaluation?
A: No. The entire experiment is described in a single YAML file — models, tasks, rubrics, sampling, and prompts. The coeval wizard command can even generate this config interactively from a plain-English description of your evaluation goal.
Q: How does CoEval reduce API costs?
A: CoEval integrates with OpenAI, Anthropic, and Azure OpenAI Batch APIs, which provide a 50% per-token discount on Phases 4 and 5. The coeval plan command runs a pre-run cost estimate (with optional live sample calls) so you know the cost before committing. Setting a quota block per model prevents runaway spending.
Q: What happens if my evaluation run crashes mid-way?
A: Each JSONL record is written atomically, so a crash loses at most one in-flight record. Running coeval run --config my.yaml --continue resumes from exactly where the run stopped, reading phases_completed from meta.json and skipping all already-completed work.
Q: How does CoEval handle vendor lock-in?
A: CoEval abstracts over 15 model interfaces — OpenAI, Anthropic, Gemini, Azure, Bedrock, Vertex, OpenRouter, Groq, DeepSeek, Mistral, Ollama, HuggingFace, and more. Switching providers requires changing a single interface: field in the YAML; no pipeline code changes are needed.