Skip to content

shalin-dev/llm-eval-harness

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Eval Harness

A comprehensive framework for evaluating LLM outputs with LLM-as-judge, consistency testing, and RAG evaluation patterns.

Evaluation Report

Why This Exists

How do you know your LLM is working correctly?

Production AI systems need rigorous evaluation. This framework provides the tools to:

  • Measure accuracy, consistency, and quality of LLM outputs
  • Evaluate RAG systems for retrieval quality, faithfulness, and relevance
  • Compare models across multiple dimensions (quality, cost, latency)
  • Use the LLM-as-judge pattern for scalable evaluation

For Engineering Managers and AI teams who need to ship reliable AI systems.

Quick Start

# Install
pip install -e .

# Check your model is available
llm-eval check --model ollama/llama3.2

# Run your first evaluation
llm-eval run --model ollama/llama3.2 --dataset simple_qa

# Compare models
llm-eval compare --models ollama/llama3.2,ollama/mistral --dataset simple_qa

Features

Evaluators

  • AccuracyEvaluator: Check factual accuracy against expected answers
  • ConsistencyEvaluator: Test if model gives consistent responses across runs
  • LatencyEvaluator: Measure response times against SLO thresholds
  • CostEvaluator: Track token usage and estimate costs
  • LLMJudgeEvaluator: Use LLM-as-judge pattern for quality assessment
  • RAGEvaluator: Evaluate retrieval, faithfulness, and relevance

Runners

  • OllamaRunner: Local inference with Ollama
  • OpenAIRunner: OpenAI API (GPT-4, GPT-3.5, etc.)
  • AnthropicRunner: Anthropic API (Claude 3.5, Claude 3)

Reporters

  • ConsoleReporter: Beautiful terminal output with rich formatting
  • JSONReporter: Machine-readable output for automation
  • HTMLReporter: Visual reports with charts

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        CLI Interface                             │
│                    llm-eval run/compare/report                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ Dataset  │───▶│  Runner  │───▶│Evaluators│───▶│ Reporter │  │
│  │  Loader  │    │          │    │          │    │          │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│       │               │               │               │         │
│       ▼               ▼               ▼               ▼         │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐     │
│  │ JSON/CSV │   │  Ollama  │   │ Accuracy │   │  Console │     │
│  │   YAML   │   │  OpenAI  │   │   Cost   │   │   JSON   │     │
│  │    HF    │   │ Anthropic│   │  Judge   │   │   HTML   │     │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘     │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Usage Examples

Basic Evaluation

from llm_eval import OllamaRunner, AccuracyEvaluator

# Initialize
runner = OllamaRunner(model="llama3.2")
evaluator = AccuracyEvaluator(mode="contains")

# Run
result = runner.run("What is the capital of France?")
eval_result = evaluator.evaluate(
    prompt="What is the capital of France?",
    response=result.response,
    expected_facts=["Paris"]
)

print(f"Score: {eval_result.score}")  # 1.0 if Paris is mentioned

LLM-as-Judge Pattern

from llm_eval import OpenAIRunner, LLMJudgeEvaluator

# Use GPT-4 as the judge
judge_runner = OpenAIRunner(model="gpt-4o")
evaluator = LLMJudgeEvaluator(
    judge_fn=judge_runner.get_run_function(),
    criteria=[
        "Is the response helpful?",
        "Is it accurate?",
        "Is it well-structured?",
    ]
)

# Evaluate any response
result = evaluator.evaluate(
    prompt="Explain quantum computing",
    response="Quantum computing uses qubits..."
)

print(f"Judge Score: {result.score}")
print(f"Details: {result.details}")

RAG Evaluation

from llm_eval import RAGEvaluator, OllamaRunner

runner = OllamaRunner(model="llama3.2")
evaluator = RAGEvaluator(
    retrieval_threshold=0.7,
    faithfulness_threshold=0.8,
    use_llm_judge=True
)

result = evaluator.evaluate(
    prompt="What is the return policy?",
    response="Returns accepted within 30 days.",
    retrieved_docs=[
        {"content": "Return Policy: 30 day returns.", "score": 0.95}
    ],
    judge_fn=runner.get_run_function()
)

# Access component scores
print(f"Retrieval: {result.details['retrieval']['score']}")
print(f"Faithfulness: {result.details['faithfulness']['score']}")
print(f"Relevance: {result.details['relevance']['score']}")

Consistency Testing

from llm_eval import ConsistencyEvaluator, OllamaRunner

runner = OllamaRunner(model="llama3.2")
evaluator = ConsistencyEvaluator(n_runs=5, similarity_threshold=0.85)

result = evaluator.evaluate(
    prompt="What is 2 + 2?",
    response="",  # Not used when run_fn provided
    run_fn=runner.get_run_function()
)

print(f"Consistency: {result.score:.2%}")
print(f"Consistent pairs: {result.details['consistent_pairs']}/{result.details['total_pairs']}")

CLI Usage

# List available datasets
llm-eval datasets

# List available evaluators
llm-eval evaluators

# Run evaluation with specific evaluators
llm-eval run \
  --model ollama/llama3.2 \
  --dataset simple_qa \
  --evaluators accuracy,latency,cost \
  --output results.json

# Generate HTML report
llm-eval report --input results.json --output report.html --format html

# Compare models
llm-eval compare \
  --models ollama/llama3.2,openai/gpt-4o-mini \
  --dataset reasoning \
  --limit 10

Configuration

Create an eval_config.yaml file:

model: ollama/llama3.2
dataset: simple_qa
evaluators:
  - accuracy
  - consistency
  - llm_judge

accuracy:
  mode: contains
  case_sensitive: false

consistency:
  n_runs: 5
  similarity_threshold: 0.85

llm_judge:
  criteria:
    - Is the response helpful?
    - Is it accurate?
  min_passing_score: 3.5

output:
  format: html
  path: evaluation_report.html

Then run:

llm-eval run --config eval_config.yaml

Built-in Datasets

Dataset Samples Categories
simple_qa 5 Geography, Math, Literature, Science, History
reasoning 2 Logic, Math reasoning
coding 2 Python
safety 2 Harmful requests, Legitimate security

Create custom datasets as JSON or YAML:

{
  "name": "my_evals",
  "samples": [
    {
      "id": "test_001",
      "prompt": "What is the capital of France?",
      "expected": "Paris",
      "expected_facts": ["Paris"],
      "category": "geography"
    }
  ]
}

Extending

Custom Evaluator

from llm_eval.evaluators.base import BaseEvaluator, EvalResult, EvalStatus

class MyCustomEvaluator(BaseEvaluator):
    def evaluate(self, prompt: str, response: str, **kwargs) -> EvalResult:
        # Your evaluation logic
        score = self.calculate_score(response)

        return self._create_result(
            score=score,
            status=EvalStatus.PASSED if score > 0.8 else EvalStatus.FAILED,
            message=f"Custom score: {score}",
            details={"custom_metric": score}
        )

Custom Runner

from llm_eval.runners.base import BaseRunner, RunResult

class MyCustomRunner(BaseRunner):
    def run(self, prompt: str, **kwargs) -> RunResult:
        # Your LLM call logic
        response = my_llm.generate(prompt)

        return RunResult(
            response=response,
            model=self.model,
            latency_ms=100.0
        )

Inspired By

  • Anthropic's evaluation practices - The LLM-as-judge pattern and evaluation criteria are inspired by Anthropic's approach to Constitutional AI and model evaluation
  • OpenAI Evals - The framework structure draws from OpenAI's open-source evaluation framework
  • RAGAS - RAG evaluation metrics (retrieval, faithfulness, relevance) build on the RAGAS research

Why Build This?

Existing evaluation tools are either:

  • Too complex for quick experiments (OpenAI Evals)
  • Too narrow in scope (single evaluator)
  • Don't support local models well

This framework provides:

  • Simple API - Get started in minutes
  • Local-first - Works great with Ollama
  • Production-ready patterns - LLM-as-judge, RAG eval, consistency testing
  • Extensible - Easy to add custom evaluators

Contributing

Contributions welcome! Areas of interest:

  • Additional evaluators (toxicity, bias, etc.)
  • More reporter formats
  • Performance optimizations
  • Documentation improvements

License

MIT

About

A comprehensive framework for evaluating LLM outputs with LLM-as-judge, consistency testing, and RAG evaluation patterns

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages