Overview

The Problem

Evaluating LLMs rigorously is expensive, fragile, and time-consuming. The standard workflow — writing one-off scripts, managing API keys by hand, retrying crashed jobs, paying for duplicate work, and stitching together ad-hoc comparison tables — does not scale beyond a single researcher or a single model.

Every serious LLM evaluation project eventually hits the same walls:

Prompt brittleness — a prompt tuned for GPT-4o produces garbage from Llama. Evaluation results reflect prompt engineering choices, not model quality.
Rubric subjectivity — hand-written scoring criteria are inconsistent, hard to version, and impossible to audit.
Single-judge bias — one judge model introduces systematic scoring biases that compound across thousands of evaluations.
Cost unpredictability — a poorly scoped run can burn hundreds of dollars before you know it went wrong.
Fragile pipelines — a single API timeout at hour three means starting over, paying again, getting different outputs.
No reproducibility — notebooks and ad-hoc scripts accumulate state that cannot be re-run cleanly six months later.
Vendor dependency — code written for the OpenAI SDK must be rewritten to test Anthropic, Google, or an open-source model.

CoEval was built to change that.

CoEval's Approach

Problem	CoEval Solution
Evaluation logic scattered across notebooks	Single declarative YAML drives every phase
Hardcoded prompts biased toward a single model	Automated attribute and rubric generation, model-specific prompt overrides
High API bills with no visibility	Pre-run cost estimation with Batch API discounts (up to 50% off)
Crashed runs require starting over	Granular resume at task × model × record level
Judge bias from single-model evaluation	Configurable multi-judge ensembles with consistency metrics
No insight into what went wrong	8 interactive HTML reports, JSONL audit trail, repair tooling
Vendor lock-in	15+ interfaces — cloud, local, virtual benchmark
Manual comparison across dozens of models	Kendall τ ranking, Spearman ρ correlation, differentiation scores

Who Is CoEval For?

LLM developers stress-testing a fine-tune against frontier baselines before shipping.

ML engineers validating a RAG pipeline across diverse task dimensions and input distributions.

Researchers producing reproducible, publication-grade benchmarks that can be re-run by reviewers.

Platform teams running recurring evaluation suites across model versions with full cost and quality auditability.

Practitioners who need to compare open-source and proprietary models on their specific task without rebuilding evaluation infrastructure each time.

Feature Catalogue

CoEval is organized around five capability clusters that cover the full evaluation lifecycle.

Experiment Design

Declarative YAML configuration — models, tasks, roles, sampling, rubrics, and prompts in one file; no code required to launch an evaluation
Interactive wizard (coeval wizard) — LLM-assisted config generation with conversational refinement; describe your evaluation goal in plain English
Auto attribute discovery — LLMs infer relevant task dimensions from your task description; override or extend at will
Auto rubric generation — build evaluation rubrics automatically from task descriptions, statically from your own criteria, or by extending an existing rubric from a prior run
Nuanced attribute sampling — inject diversity dimensions (tone, domain, urgency, style) per benchmark item to eliminate distribution collapse and keep synthetic data naturalistic
Per-model prompt overrides — tailor generation and evaluation instructions at model granularity without duplicating task definitions
interface: auto — automatic cheapest-available provider selection based on your key file; no hardcoded provider choices

Experiment Execution

Five-phase pipeline — attribute mapping → rubric mapping → data generation → response collection → evaluation; each phase is independently checkpointed and resumable
Flexible role assignment — any model can be teacher, student, judge, or all three simultaneously; assign role-specific temperature and token budgets per model
Batch API acceleration — OpenAI, Anthropic, Gemini, and Azure OpenAI batching enabled with a single config flag; up to 50% cost reduction on Phase 4 and Phase 5
Concurrent execution — configurable worker pools per phase; HuggingFace models queued for GPU; all other interfaces run concurrently
Per-phase mode control — New / Keep / Extend / Model per phase; skip phases you've already run, append only missing items, or reuse an existing model's output
Virtual benchmark teachers (interface: benchmark) — inject pre-ingested dataset responses (XSum, CodeSearchNet, AESLC, MMLU, HumanEval, …) as Phase 3 ground truth without any LLM calls
Label accuracy mode — judge-free exact-match evaluation for classification and information extraction tasks via label_attributes; no judge calls needed, no judge bias

Cost Planning & Control

Pre-run cost estimation (coeval plan) — samples 2 real API calls per model by default to measure actual latency and throughput; falls back to token-count heuristics with --estimate-samples 0
Batch discount modeling — estimation automatically applies provider-specific batch discounts (50% for OpenAI, Anthropic, Gemini, Azure OpenAI)
Per-model API quota — quota: block sets hard call ceilings per model; the pipeline stops cleanly at the ceiling rather than crashing mid-phase
--estimate-only — validate config, print itemized cost table, write cost_estimate.json, and exit before any evaluation work starts

Diagnostics & Control

Model probe (coeval probe) — lightweight availability check for all 15+ interfaces using model listing endpoints; consumes no generation tokens and exits before spending a dollar
Progress dashboard (coeval status) — live phase artifact counts, pending batch jobs, and recent log errors; add --fetch-batches to poll provider APIs for completion
Config HTML preview (coeval describe) — browser-ready summary of models, tasks, rubrics, phase execution plan, prompt templates, and cost budget; shareable for team review before running
JSONL repair tool (coeval repair) — scans for malformed records, removes corrupted phase checkpoints, and prepares for targeted re-runs with --continue
Structured logging — configurable log level per experiment; full JSONL audit trail written per phase alongside every result

Resume & Recovery

Granular checkpointing — each JSONL record written atomically; interruption mid-phase loses at most one record
--continue — resume any experiment from its last checkpoint; reads phases_completed from meta.json, skips done phases, and fills gaps in in-progress phases
--resume EXPERIMENT_ID — fork phases 1–2 from an existing run; re-use attribute and rubric work across multiple student/judge variants without regenerating data
--only-models — re-run specific models in phases 3–5 without touching others; ideal for adding a new model to an existing benchmark
Repair + continue workflow — identify failed records with coeval repair --dry-run, apply fixes, then --continue fills the gaps without touching intact data

Evaluation & Analysis

Multi-judge ensemble — average scores across N judges for robust, bias-resistant evaluation; ensemble size configurable per experiment
Single and per-factor evaluation modes — holistic score (one call per response) or one call per rubric dimension per response for maximum granularity
8 interactive HTML reports — score distributions, teacher analysis, judge consistency, student rankings, interaction matrices, coverage summaries, calibration curves, and robust summaries
Excel export — complete results in a structured workbook for downstream statistical analysis or stakeholder review
Calibration analysis — linear α/β calibration fit between judge scores and benchmark ground-truth; corrects for systematic over/under-scoring
Agreement metrics — Spearman ρ, Kendall τ, and inter-judge consistency (ACR); quantify how reliably models agree on relative quality
Paper tables — Spearman ρ comparisons, student rankings, sampling ablation, positional bias detection, calibration effect, and cost comparison tables ready for publication
Benchmark score integration — correlate LLM judge scores with BERTScore, BLEU, ROUGE, and exact-match ground-truth from real datasets

Frequently Asked Questions

Q: What is CoEval and what problem does it solve? A: CoEval is a declarative LLM benchmarking framework that orchestrates a fleet of language models as teachers, students, and judges. It addresses the core problems of LLM evaluation at scale — prompt brittleness, rubric subjectivity, single-judge bias, fragile pipelines, and vendor lock-in — through a structured five-phase pipeline driven entirely by a single YAML config.

Q: Who is CoEval designed for? A: CoEval is built for LLM developers stress-testing fine-tunes, ML engineers validating RAG pipelines, researchers producing reproducible benchmarks, and platform teams running recurring evaluation suites. Essentially anyone who needs rigorous, auditable model comparisons without rebuilding evaluation infrastructure from scratch.

Q: Do I need to write code to run an evaluation? A: No. The entire experiment is described in a single YAML file — models, tasks, rubrics, sampling, and prompts. The coeval wizard command can even generate this config interactively from a plain-English description of your evaluation goal.

Q: How does CoEval reduce API costs? A: CoEval integrates with OpenAI, Anthropic, and Azure OpenAI Batch APIs, which provide a 50% per-token discount on Phases 4 and 5. The coeval plan command runs a pre-run cost estimate (with optional live sample calls) so you know the cost before committing. Setting a quota block per model prevents runaway spending.

Q: What happens if my evaluation run crashes mid-way? A: Each JSONL record is written atomically, so a crash loses at most one in-flight record. Running coeval run --config my.yaml --continue resumes from exactly where the run stopped, reading phases_completed from meta.json and skipping all already-completed work.

Q: How does CoEval handle vendor lock-in? A: CoEval abstracts over 15 model interfaces — OpenAI, Anthropic, Gemini, Azure, Bedrock, Vertex, OpenRouter, Groq, DeepSeek, Mistral, Ollama, HuggingFace, and more. Switching providers requires changing a single interface: field in the YAML; no pipeline code changes are needed.

Installation →

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview

The Problem

CoEval's Approach

Who Is CoEval For?

Feature Catalogue

Experiment Design

Experiment Execution

Cost Planning & Control

Diagnostics & Control

Resume & Recovery

Evaluation & Analysis

Frequently Asked Questions

FilesExpand file tree

01-overview.md

Latest commit

History

01-overview.md

File metadata and controls

Overview

The Problem

CoEval's Approach

Who Is CoEval For?

Feature Catalogue

Experiment Design

Experiment Execution

Cost Planning & Control

Diagnostics & Control

Resume & Recovery

Evaluation & Analysis

Frequently Asked Questions