LLM Eval Harness

A comprehensive framework for evaluating LLM outputs with LLM-as-judge, consistency testing, and RAG evaluation patterns.

Why This Exists

How do you know your LLM is working correctly?

Production AI systems need rigorous evaluation. This framework provides the tools to:

Measure accuracy, consistency, and quality of LLM outputs
Evaluate RAG systems for retrieval quality, faithfulness, and relevance
Compare models across multiple dimensions (quality, cost, latency)
Use the LLM-as-judge pattern for scalable evaluation

For Engineering Managers and AI teams who need to ship reliable AI systems.

Quick Start

# Install
pip install -e .

# Check your model is available
llm-eval check --model ollama/llama3.2

# Run your first evaluation
llm-eval run --model ollama/llama3.2 --dataset simple_qa

# Compare models
llm-eval compare --models ollama/llama3.2,ollama/mistral --dataset simple_qa

Features

Evaluators

AccuracyEvaluator: Check factual accuracy against expected answers
ConsistencyEvaluator: Test if model gives consistent responses across runs
LatencyEvaluator: Measure response times against SLO thresholds
CostEvaluator: Track token usage and estimate costs
LLMJudgeEvaluator: Use LLM-as-judge pattern for quality assessment
RAGEvaluator: Evaluate retrieval, faithfulness, and relevance

Runners

OllamaRunner: Local inference with Ollama
OpenAIRunner: OpenAI API (GPT-4, GPT-3.5, etc.)
AnthropicRunner: Anthropic API (Claude 3.5, Claude 3)

Reporters

ConsoleReporter: Beautiful terminal output with rich formatting
JSONReporter: Machine-readable output for automation
HTMLReporter: Visual reports with charts

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        CLI Interface                             │
│                    llm-eval run/compare/report                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ Dataset  │───▶│  Runner  │───▶│Evaluators│───▶│ Reporter │  │
│  │  Loader  │    │          │    │          │    │          │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│       │               │               │               │         │
│       ▼               ▼               ▼               ▼         │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐     │
│  │ JSON/CSV │   │  Ollama  │   │ Accuracy │   │  Console │     │
│  │   YAML   │   │  OpenAI  │   │   Cost   │   │   JSON   │     │
│  │    HF    │   │ Anthropic│   │  Judge   │   │   HTML   │     │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Usage Examples

Basic Evaluation

from llm_eval import OllamaRunner, AccuracyEvaluator

# Initialize
runner = OllamaRunner(model="llama3.2")
evaluator = AccuracyEvaluator(mode="contains")

# Run
result = runner.run("What is the capital of France?")
eval_result = evaluator.evaluate(
    prompt="What is the capital of France?",
    response=result.response,
    expected_facts=["Paris"]
)

print(f"Score: {eval_result.score}")  # 1.0 if Paris is mentioned

LLM-as-Judge Pattern

from llm_eval import OpenAIRunner, LLMJudgeEvaluator

# Use GPT-4 as the judge
judge_runner = OpenAIRunner(model="gpt-4o")
evaluator = LLMJudgeEvaluator(
    judge_fn=judge_runner.get_run_function(),
    criteria=[
        "Is the response helpful?",
        "Is it accurate?",
        "Is it well-structured?",
    ]
)

# Evaluate any response
result = evaluator.evaluate(
    prompt="Explain quantum computing",
    response="Quantum computing uses qubits..."
)

print(f"Judge Score: {result.score}")
print(f"Details: {result.details}")

RAG Evaluation

from llm_eval import RAGEvaluator, OllamaRunner

runner = OllamaRunner(model="llama3.2")
evaluator = RAGEvaluator(
    retrieval_threshold=0.7,
    faithfulness_threshold=0.8,
    use_llm_judge=True
)

result = evaluator.evaluate(
    prompt="What is the return policy?",
    response="Returns accepted within 30 days.",
    retrieved_docs=[
        {"content": "Return Policy: 30 day returns.", "score": 0.95}
    ],
    judge_fn=runner.get_run_function()
)

# Access component scores
print(f"Retrieval: {result.details['retrieval']['score']}")
print(f"Faithfulness: {result.details['faithfulness']['score']}")
print(f"Relevance: {result.details['relevance']['score']}")

Consistency Testing

from llm_eval import ConsistencyEvaluator, OllamaRunner

runner = OllamaRunner(model="llama3.2")
evaluator = ConsistencyEvaluator(n_runs=5, similarity_threshold=0.85)

result = evaluator.evaluate(
    prompt="What is 2 + 2?",
    response="",  # Not used when run_fn provided
    run_fn=runner.get_run_function()
)

print(f"Consistency: {result.score:.2%}")
print(f"Consistent pairs: {result.details['consistent_pairs']}/{result.details['total_pairs']}")

CLI Usage

# List available datasets
llm-eval datasets

# List available evaluators
llm-eval evaluators

# Run evaluation with specific evaluators
llm-eval run \
  --model ollama/llama3.2 \
  --dataset simple_qa \
  --evaluators accuracy,latency,cost \
  --output results.json

# Generate HTML report
llm-eval report --input results.json --output report.html --format html

# Compare models
llm-eval compare \
  --models ollama/llama3.2,openai/gpt-4o-mini \
  --dataset reasoning \
  --limit 10

Configuration

Create an eval_config.yaml file:

model: ollama/llama3.2
dataset: simple_qa
evaluators:
  - accuracy
  - consistency
  - llm_judge

accuracy:
  mode: contains
  case_sensitive: false

consistency:
  n_runs: 5
  similarity_threshold: 0.85

llm_judge:
  criteria:
    - Is the response helpful?
    - Is it accurate?
  min_passing_score: 3.5

output:
  format: html
  path: evaluation_report.html

Then run:

llm-eval run --config eval_config.yaml

Built-in Datasets

Dataset	Samples	Categories
`simple_qa`	5	Geography, Math, Literature, Science, History
`reasoning`	2	Logic, Math reasoning
`coding`	2	Python
`safety`	2	Harmful requests, Legitimate security

Create custom datasets as JSON or YAML:

{
  "name": "my_evals",
  "samples": [
    {
      "id": "test_001",
      "prompt": "What is the capital of France?",
      "expected": "Paris",
      "expected_facts": ["Paris"],
      "category": "geography"
    }
  ]
}

Extending

Custom Evaluator

from llm_eval.evaluators.base import BaseEvaluator, EvalResult, EvalStatus

class MyCustomEvaluator(BaseEvaluator):
    def evaluate(self, prompt: str, response: str, **kwargs) -> EvalResult:
        # Your evaluation logic
        score = self.calculate_score(response)

        return self._create_result(
            score=score,
            status=EvalStatus.PASSED if score > 0.8 else EvalStatus.FAILED,
            message=f"Custom score: {score}",
            details={"custom_metric": score}
        )

Custom Runner

from llm_eval.runners.base import BaseRunner, RunResult

class MyCustomRunner(BaseRunner):
    def run(self, prompt: str, **kwargs) -> RunResult:
        # Your LLM call logic
        response = my_llm.generate(prompt)

        return RunResult(
            response=response,
            model=self.model,
            latency_ms=100.0
        )

Inspired By

Anthropic's evaluation practices - The LLM-as-judge pattern and evaluation criteria are inspired by Anthropic's approach to Constitutional AI and model evaluation
OpenAI Evals - The framework structure draws from OpenAI's open-source evaluation framework
RAGAS - RAG evaluation metrics (retrieval, faithfulness, relevance) build on the RAGAS research

Why Build This?

Existing evaluation tools are either:

Too complex for quick experiments (OpenAI Evals)
Too narrow in scope (single evaluator)
Don't support local models well

This framework provides:

Simple API - Get started in minutes
Local-first - Works great with Ollama
Production-ready patterns - LLM-as-judge, RAG eval, consistency testing
Extensible - Easy to add custom evaluators

Contributing

Contributions welcome! Areas of interest:

Additional evaluators (toxicity, bias, etc.)
More reporter formats
Performance optimizations
Documentation improvements

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/llm_eval		src/llm_eval
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Eval Harness

Why This Exists

Quick Start

Features

Evaluators

Runners

Reporters

Architecture

Usage Examples

Basic Evaluation

LLM-as-Judge Pattern

RAG Evaluation

Consistency Testing

CLI Usage

Configuration

Built-in Datasets

Extending

Custom Evaluator

Custom Runner

Inspired By

Why Build This?

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Eval Harness

Why This Exists

Quick Start

Features

Evaluators

Runners

Reporters

Architecture

Usage Examples

Basic Evaluation

LLM-as-Judge Pattern

RAG Evaluation

Consistency Testing

CLI Usage

Configuration

Built-in Datasets

Extending

Custom Evaluator

Custom Runner

Inspired By

Why Build This?

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages