A comprehensive framework for evaluating LLM outputs with LLM-as-judge, consistency testing, and RAG evaluation patterns.
How do you know your LLM is working correctly?
Production AI systems need rigorous evaluation. This framework provides the tools to:
- Measure accuracy, consistency, and quality of LLM outputs
- Evaluate RAG systems for retrieval quality, faithfulness, and relevance
- Compare models across multiple dimensions (quality, cost, latency)
- Use the LLM-as-judge pattern for scalable evaluation
For Engineering Managers and AI teams who need to ship reliable AI systems.
# Install
pip install -e .
# Check your model is available
llm-eval check --model ollama/llama3.2
# Run your first evaluation
llm-eval run --model ollama/llama3.2 --dataset simple_qa
# Compare models
llm-eval compare --models ollama/llama3.2,ollama/mistral --dataset simple_qa- AccuracyEvaluator: Check factual accuracy against expected answers
- ConsistencyEvaluator: Test if model gives consistent responses across runs
- LatencyEvaluator: Measure response times against SLO thresholds
- CostEvaluator: Track token usage and estimate costs
- LLMJudgeEvaluator: Use LLM-as-judge pattern for quality assessment
- RAGEvaluator: Evaluate retrieval, faithfulness, and relevance
- OllamaRunner: Local inference with Ollama
- OpenAIRunner: OpenAI API (GPT-4, GPT-3.5, etc.)
- AnthropicRunner: Anthropic API (Claude 3.5, Claude 3)
- ConsoleReporter: Beautiful terminal output with rich formatting
- JSONReporter: Machine-readable output for automation
- HTMLReporter: Visual reports with charts
┌─────────────────────────────────────────────────────────────────┐
│ CLI Interface │
│ llm-eval run/compare/report │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Dataset │───▶│ Runner │───▶│Evaluators│───▶│ Reporter │ │
│ │ Loader │ │ │ │ │ │ │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ JSON/CSV │ │ Ollama │ │ Accuracy │ │ Console │ │
│ │ YAML │ │ OpenAI │ │ Cost │ │ JSON │ │
│ │ HF │ │ Anthropic│ │ Judge │ │ HTML │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
from llm_eval import OllamaRunner, AccuracyEvaluator
# Initialize
runner = OllamaRunner(model="llama3.2")
evaluator = AccuracyEvaluator(mode="contains")
# Run
result = runner.run("What is the capital of France?")
eval_result = evaluator.evaluate(
prompt="What is the capital of France?",
response=result.response,
expected_facts=["Paris"]
)
print(f"Score: {eval_result.score}") # 1.0 if Paris is mentionedfrom llm_eval import OpenAIRunner, LLMJudgeEvaluator
# Use GPT-4 as the judge
judge_runner = OpenAIRunner(model="gpt-4o")
evaluator = LLMJudgeEvaluator(
judge_fn=judge_runner.get_run_function(),
criteria=[
"Is the response helpful?",
"Is it accurate?",
"Is it well-structured?",
]
)
# Evaluate any response
result = evaluator.evaluate(
prompt="Explain quantum computing",
response="Quantum computing uses qubits..."
)
print(f"Judge Score: {result.score}")
print(f"Details: {result.details}")from llm_eval import RAGEvaluator, OllamaRunner
runner = OllamaRunner(model="llama3.2")
evaluator = RAGEvaluator(
retrieval_threshold=0.7,
faithfulness_threshold=0.8,
use_llm_judge=True
)
result = evaluator.evaluate(
prompt="What is the return policy?",
response="Returns accepted within 30 days.",
retrieved_docs=[
{"content": "Return Policy: 30 day returns.", "score": 0.95}
],
judge_fn=runner.get_run_function()
)
# Access component scores
print(f"Retrieval: {result.details['retrieval']['score']}")
print(f"Faithfulness: {result.details['faithfulness']['score']}")
print(f"Relevance: {result.details['relevance']['score']}")from llm_eval import ConsistencyEvaluator, OllamaRunner
runner = OllamaRunner(model="llama3.2")
evaluator = ConsistencyEvaluator(n_runs=5, similarity_threshold=0.85)
result = evaluator.evaluate(
prompt="What is 2 + 2?",
response="", # Not used when run_fn provided
run_fn=runner.get_run_function()
)
print(f"Consistency: {result.score:.2%}")
print(f"Consistent pairs: {result.details['consistent_pairs']}/{result.details['total_pairs']}")# List available datasets
llm-eval datasets
# List available evaluators
llm-eval evaluators
# Run evaluation with specific evaluators
llm-eval run \
--model ollama/llama3.2 \
--dataset simple_qa \
--evaluators accuracy,latency,cost \
--output results.json
# Generate HTML report
llm-eval report --input results.json --output report.html --format html
# Compare models
llm-eval compare \
--models ollama/llama3.2,openai/gpt-4o-mini \
--dataset reasoning \
--limit 10Create an eval_config.yaml file:
model: ollama/llama3.2
dataset: simple_qa
evaluators:
- accuracy
- consistency
- llm_judge
accuracy:
mode: contains
case_sensitive: false
consistency:
n_runs: 5
similarity_threshold: 0.85
llm_judge:
criteria:
- Is the response helpful?
- Is it accurate?
min_passing_score: 3.5
output:
format: html
path: evaluation_report.htmlThen run:
llm-eval run --config eval_config.yaml| Dataset | Samples | Categories |
|---|---|---|
simple_qa |
5 | Geography, Math, Literature, Science, History |
reasoning |
2 | Logic, Math reasoning |
coding |
2 | Python |
safety |
2 | Harmful requests, Legitimate security |
Create custom datasets as JSON or YAML:
{
"name": "my_evals",
"samples": [
{
"id": "test_001",
"prompt": "What is the capital of France?",
"expected": "Paris",
"expected_facts": ["Paris"],
"category": "geography"
}
]
}from llm_eval.evaluators.base import BaseEvaluator, EvalResult, EvalStatus
class MyCustomEvaluator(BaseEvaluator):
def evaluate(self, prompt: str, response: str, **kwargs) -> EvalResult:
# Your evaluation logic
score = self.calculate_score(response)
return self._create_result(
score=score,
status=EvalStatus.PASSED if score > 0.8 else EvalStatus.FAILED,
message=f"Custom score: {score}",
details={"custom_metric": score}
)from llm_eval.runners.base import BaseRunner, RunResult
class MyCustomRunner(BaseRunner):
def run(self, prompt: str, **kwargs) -> RunResult:
# Your LLM call logic
response = my_llm.generate(prompt)
return RunResult(
response=response,
model=self.model,
latency_ms=100.0
)- Anthropic's evaluation practices - The LLM-as-judge pattern and evaluation criteria are inspired by Anthropic's approach to Constitutional AI and model evaluation
- OpenAI Evals - The framework structure draws from OpenAI's open-source evaluation framework
- RAGAS - RAG evaluation metrics (retrieval, faithfulness, relevance) build on the RAGAS research
Existing evaluation tools are either:
- Too complex for quick experiments (OpenAI Evals)
- Too narrow in scope (single evaluator)
- Don't support local models well
This framework provides:
- Simple API - Get started in minutes
- Local-first - Works great with Ollama
- Production-ready patterns - LLM-as-judge, RAG eval, consistency testing
- Extensible - Easy to add custom evaluators
Contributions welcome! Areas of interest:
- Additional evaluators (toxicity, bias, etc.)
- More reporter formats
- Performance optimizations
- Documentation improvements
MIT
