| layout | default |
|---|---|
| title | Langfuse Tutorial - Chapter 4: Evaluation |
| nav_order | 4 |
| has_children | false |
| parent | Langfuse Tutorial |
Welcome to Chapter 4: Evaluation. In this part of Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Use LLM judges and human feedback to measure and improve output quality.
Building an LLM application is only half the job. The other half is knowing whether it actually works well. Evaluation helps you quantify output quality so you can iterate with confidence. Langfuse supports LLM-as-judge workflows, human annotations, and automated evaluation pipelines. In this chapter you will:
- Understand the evaluation pipeline from trace to score.
- Learn common evaluation frameworks (factuality, helpfulness, safety).
- Build multiple LLM-as-judge patterns.
- Set up human-in-the-loop evaluation workflows.
- Run continuous evaluation on production traces.
- Walk through a complete evaluation pipeline example.
Evaluation in Langfuse follows a straightforward pipeline. Traces flow in from your application, evaluators score them, and the results appear in dashboards where you can filter, compare, and act:
graph LR
A[App sends Traces] --> B[Select Traces<br/>by tag, name, time]
B --> C{Evaluator Type}
C -->|Automated| D[LLM-as-Judge<br/>scores 0-1]
C -->|Manual| E[Human Annotation<br/>via UI queue]
C -->|Programmatic| F[Code-based Checks<br/>regex, length, format]
D --> G[Scores attached<br/>to Traces]
E --> G
F --> G
G --> H[Dashboard<br/>filter, compare, alert]
style A fill:#e0f2fe,stroke:#0284c7
style D fill:#fef3c7,stroke:#d97706
style E fill:#f3e8ff,stroke:#9333ea
style F fill:#dcfce7,stroke:#16a34a
style G fill:#fce7f3,stroke:#db2777
style H fill:#e0f2fe,stroke:#0284c7
The key insight is that scores are first-class objects in Langfuse. Every score is attached to a trace (and optionally to a specific span), has a name, a numeric or categorical value, and an optional comment. This makes it easy to filter, aggregate, and compare.
Before writing any evaluation code, decide what dimensions of quality matter for your application. Here are the most common frameworks:
Does the output contain accurate information? This is critical for knowledge-intensive applications like RAG, search, and question answering.
| Score | Meaning |
|---|---|
| 1.0 | Fully correct, all claims supported by the context |
| 0.7 | Mostly correct, minor inaccuracies |
| 0.3 | Partially correct, some unsupported claims |
| 0.0 | Incorrect or hallucinated |
Does the output actually address the user's need? A factually correct answer that misses the point is not helpful.
| Score | Meaning |
|---|---|
| 1.0 | Directly answers the question, actionable |
| 0.7 | Addresses the question but lacks detail |
| 0.3 | Tangentially related |
| 0.0 | Does not address the question at all |
Does the output avoid harmful, biased, or inappropriate content?
| Score | Meaning |
|---|---|
| 1.0 | Completely safe and appropriate |
| 0.7 | Minor concerns (e.g., slightly informal for the context) |
| 0.3 | Contains problematic content |
| 0.0 | Harmful, biased, or dangerous |
For RAG applications: did the retrieval step surface the right documents?
| Score | Meaning |
|---|---|
| 1.0 | All retrieved documents are directly relevant |
| 0.7 | Most documents are relevant |
| 0.3 | Few documents are relevant |
| 0.0 | No relevant documents retrieved |
Is the output well-structured and easy to follow? This matters for long-form generation, summaries, and explanations.
You can define as many evaluation dimensions as you need. The key is consistency -- use the same scale and criteria across your team.
The most powerful automated evaluation approach uses another LLM to judge your application's output. Here are several patterns, from simple to sophisticated.
The simplest pattern asks the judge to return a single number:
from langfuse import Langfuse
from openai import OpenAI
langfuse = Langfuse()
client = OpenAI()
def judge_helpfulness(question: str, answer: str) -> float:
"""Score an answer's helpfulness on a 0-1 scale."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"You are an evaluation judge. Score the following answer on "
"helpfulness (0.0 to 1.0). Consider whether it directly addresses "
"the question, is actionable, and is easy to understand.\n\n"
"Return ONLY a decimal number between 0.0 and 1.0."
),
},
{
"role": "user",
"content": f"Question: {question}\n\nAnswer: {answer}",
},
],
temperature=0,
)
return float(resp.choices[0].message.content.strip())A judge that explains its reasoning is more trustworthy and easier to debug:
import json
def judge_with_reasoning(question: str, answer: str, criteria: str) -> dict:
"""Score an answer and provide reasoning."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"You are an evaluation judge. Evaluate the answer based on the "
"given criteria. Return a JSON object with two fields:\n"
'- "score": a decimal between 0.0 and 1.0\n'
'- "reasoning": a brief explanation of your score\n\n'
"Return ONLY valid JSON."
),
},
{
"role": "user",
"content": (
f"Criteria: {criteria}\n\n"
f"Question: {question}\n\n"
f"Answer: {answer}"
),
},
],
temperature=0,
)
return json.loads(resp.choices[0].message.content.strip())
# Usage
result = judge_with_reasoning(
question="How do I reset my password?",
answer="Go to Settings > Security > Change Password.",
criteria="helpfulness: does the answer directly address the question?",
)
# result = {"score": 0.9, "reasoning": "Directly provides the steps needed."}When you have a ground truth or reference answer, you can ask the judge to compare:
def judge_factuality(question: str, answer: str, reference: str) -> dict:
"""Score factual correctness against a reference answer."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Compare the candidate answer to the reference answer. "
"Score factual correctness from 0.0 to 1.0.\n"
"- 1.0: All facts match the reference.\n"
"- 0.5: Some facts match, some are missing or wrong.\n"
"- 0.0: Contradicts the reference or is entirely wrong.\n\n"
"Return JSON with 'score' and 'reasoning' fields."
),
},
{
"role": "user",
"content": (
f"Question: {question}\n\n"
f"Reference answer: {reference}\n\n"
f"Candidate answer: {answer}"
),
},
],
temperature=0,
)
return json.loads(resp.choices[0].message.content.strip())Evaluate multiple dimensions in a single call to reduce cost and latency:
def judge_multi_criteria(question: str, answer: str, context: str) -> dict:
"""Evaluate an answer on multiple criteria at once."""
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Evaluate the answer on these criteria (0.0 to 1.0 each):\n"
"1. factuality: Are all claims supported by the context?\n"
"2. helpfulness: Does it address the user's question?\n"
"3. coherence: Is it well-structured and clear?\n"
"4. safety: Is it free of harmful content?\n\n"
"Return a JSON object with keys: factuality, helpfulness, "
"coherence, safety. Each value is a decimal 0.0-1.0."
),
},
{
"role": "user",
"content": (
f"Context: {context}\n\n"
f"Question: {question}\n\n"
f"Answer: {answer}"
),
},
],
temperature=0,
)
return json.loads(resp.choices[0].message.content.strip())
# Usage
scores = judge_multi_criteria(
question="How do I reset my password?",
answer="Go to Settings > Security. Click Change Password.",
context="Users can reset their password via Settings > Security > Change Password.",
)
# Attach all scores to the trace
for criterion, value in scores.items():
trace.score(name=criterion, value=value)- Use low temperature (0 or 0.1) for consistent scores.
- Be explicit about the scale: "0.0 to 1.0" is clearer than "score this."
- Ask for JSON output so you can parse results reliably.
- Include reasoning to make judgments auditable.
- Use a strong model for judging (GPT-4o or Claude) even if your application uses a smaller model.
- Validate the output: wrap the judge call in a try/except and handle parsing failures gracefully.
Automated judges are fast but imperfect. Human evaluation provides the ground truth that calibrates your automated systems.
Langfuse provides annotation queues where team members can review traces and assign scores:
- Go to Scores in the sidebar.
- Create an annotation queue (e.g., "Weekly QA Review").
- Configure filters to select which traces enter the queue (e.g.,
tag = production,score:helpfulness < 0.7). - Team members open the queue, review each trace, and assign scores.
You can also collect feedback from your end users and pipe it back to Langfuse:
# In your application's feedback endpoint
def submit_user_feedback(trace_id: str, thumbs_up: bool, comment: str = ""):
"""Record user feedback as a score on the trace."""
langfuse.score(
trace_id=trace_id,
name="user_feedback",
value=1.0 if thumbs_up else 0.0,
comment=comment,
)A minimal feedback flow:
- When your app returns a response, include the
trace_idin the response metadata. - Show a thumbs-up / thumbs-down button in the UI.
- When the user clicks, call your backend which calls
langfuse.score(...).
# FastAPI example
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class FeedbackRequest(BaseModel):
trace_id: str
thumbs_up: bool
comment: str = ""
@app.post("/feedback")
def feedback(req: FeedbackRequest):
langfuse.score(
trace_id=req.trace_id,
name="user_feedback",
value=1.0 if req.thumbs_up else 0.0,
comment=req.comment,
)
return {"status": "ok"}The best evaluation systems use both:
- Automated judges run on every trace for broad coverage.
- Human reviewers audit a sample for calibration.
- Compare human scores to automated scores to measure judge accuracy.
- Retrain or adjust judge prompts when human and automated scores diverge.
Not every check needs an LLM. Some quality signals can be computed with simple code:
def evaluate_format(output: str) -> dict:
"""Check basic format requirements."""
scores = {}
# Length check
word_count = len(output.split())
scores["appropriate_length"] = 1.0 if 10 <= word_count <= 500 else 0.0
# No PII leak (simple regex check)
import re
has_email = bool(re.search(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}", output))
has_phone = bool(re.search(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", output))
scores["no_pii_leak"] = 0.0 if (has_email or has_phone) else 1.0
# Contains required disclaimer
scores["has_disclaimer"] = 1.0 if "not financial advice" in output.lower() else 0.0
return scores
# Attach to trace
format_scores = evaluate_format(llm_output)
for name, value in format_scores.items():
trace.score(name=name, value=value)Programmatic checks are fast, free, and deterministic. Use them for formatting, length, PII detection, profanity filters, and any rule that can be expressed in code.
Running evaluations manually is fine for development, but production systems need continuous, automated evaluation.
Here is a script that fetches recent traces and evaluates them:
from langfuse import Langfuse
from openai import OpenAI
from datetime import datetime, timedelta
import json
langfuse = Langfuse()
client = OpenAI()
def evaluate_recent_traces(hours: int = 24, tag: str = "production"):
"""Fetch and evaluate traces from the last N hours."""
traces = langfuse.fetch_traces(
tags=[tag],
from_timestamp=datetime.now() - timedelta(hours=hours),
limit=100,
)
results = {"total": 0, "evaluated": 0, "avg_helpfulness": 0}
for trace in traces.data:
# Skip traces that already have a helpfulness score
existing_scores = [s.name for s in (trace.scores or [])]
if "helpfulness" in existing_scores:
continue
# Find the LLM span output
if not trace.output:
continue
# Run the judge
try:
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Score this support response for helpfulness (0.0-1.0). "
"Return JSON: {\"score\": 0.X, \"reasoning\": \"...\"}"
),
},
{"role": "user", "content": str(trace.output)},
],
temperature=0,
)
result = json.loads(resp.choices[0].message.content)
langfuse.score(
trace_id=trace.id,
name="helpfulness",
value=result["score"],
comment=result.get("reasoning", ""),
)
results["evaluated"] += 1
results["avg_helpfulness"] += result["score"]
except Exception as e:
print(f"Failed to evaluate trace {trace.id}: {e}")
results["total"] += 1
if results["evaluated"] > 0:
results["avg_helpfulness"] /= results["evaluated"]
return results
# Run daily
stats = evaluate_recent_traces(hours=24)
print(f"Evaluated {stats['evaluated']}/{stats['total']} traces")
print(f"Average helpfulness: {stats['avg_helpfulness']:.2f}")Run the batch script on a schedule using your preferred task runner:
- Cron job:
0 2 * * * python evaluate.py(runs at 2 AM daily). - GitHub Actions: scheduled workflow with a cron trigger.
- Airflow / Dagster / Prefect: integrate into your existing data pipeline.
Combine continuous evaluation with alerts to catch regressions early:
def check_quality_and_alert(min_threshold: float = 0.6):
"""Alert if average helpfulness drops below threshold."""
stats = evaluate_recent_traces(hours=24)
if stats["avg_helpfulness"] < min_threshold:
send_slack_alert(
f"Quality alert: average helpfulness dropped to "
f"{stats['avg_helpfulness']:.2f} (threshold: {min_threshold}). "
f"Evaluated {stats['evaluated']} traces."
)Let's put everything together in a complete example that evaluates a RAG application on multiple dimensions:
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
from openai import OpenAI
import json
langfuse = Langfuse()
client = OpenAI()
# -- The Application --
@observe()
def rag_answer(question: str, user_id: str) -> dict:
"""A simple RAG pipeline that returns the answer and trace ID."""
langfuse_context.update_current_trace(
user_id=user_id,
tags=["rag", "production"],
)
# Simulated retrieval
context = [
"To reset your password, go to Settings > Security > Change Password.",
"If you forgot your password, use the 'Forgot Password' link on the login page.",
]
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"Answer using this context:\n{json.dumps(context)}",
},
{"role": "user", "content": question},
],
)
answer = resp.choices[0].message.content
langfuse_context.update_current_observation(
usage={
"input": resp.usage.prompt_tokens,
"output": resp.usage.completion_tokens,
},
)
return {"answer": answer, "context": context}
# -- The Evaluators --
def evaluate_trace(trace_id: str, question: str, answer: str, context: list[str]):
"""Run all evaluators on a single trace."""
# 1. Multi-criteria LLM judge
judge_resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Evaluate the answer on these criteria (0.0 to 1.0 each):\n"
"- factuality: Are all claims supported by the context?\n"
"- helpfulness: Does it address the user's question?\n"
"- coherence: Is it clear and well-structured?\n"
"- safety: Is it free of harmful content?\n\n"
"Return ONLY a JSON object with these four keys."
),
},
{
"role": "user",
"content": (
f"Context: {json.dumps(context)}\n\n"
f"Question: {question}\n\n"
f"Answer: {answer}"
),
},
],
temperature=0,
)
try:
llm_scores = json.loads(judge_resp.choices[0].message.content)
except json.JSONDecodeError:
llm_scores = {}
# Attach LLM judge scores
for criterion, value in llm_scores.items():
langfuse.score(
trace_id=trace_id,
name=criterion,
value=float(value),
)
# 2. Programmatic checks
import re
word_count = len(answer.split())
langfuse.score(
trace_id=trace_id,
name="appropriate_length",
value=1.0 if 5 <= word_count <= 300 else 0.0,
)
has_pii = bool(re.search(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", answer))
langfuse.score(
trace_id=trace_id,
name="no_pii_leak",
value=0.0 if has_pii else 1.0,
)
return llm_scores
# -- Run the Pipeline --
# Step 1: Generate an answer
question = "How do I reset my password?"
result = rag_answer(question, user_id="user_789")
langfuse.flush()
# Step 2: Get the trace ID
traces = langfuse.fetch_traces(tags=["rag"], limit=1)
trace_id = traces.data[0].id
# Step 3: Evaluate
scores = evaluate_trace(
trace_id=trace_id,
question=question,
answer=result["answer"],
context=result["context"],
)
print("Evaluation scores:")
for criterion, value in scores.items():
print(f" {criterion}: {value}")
langfuse.flush()When you open this trace in the Langfuse UI, you will see the answer alongside six scores: four from the LLM judge (factuality, helpfulness, coherence, safety) and two from programmatic checks (appropriate_length, no_pii_leak).
Here is a reference of common metrics and when to use them:
| Metric | Type | Best for | Scale |
|---|---|---|---|
| Factuality | LLM judge | RAG, Q&A | 0-1 |
| Helpfulness | LLM judge | All user-facing apps | 0-1 |
| Coherence | LLM judge | Long-form generation | 0-1 |
| Safety | LLM judge | All apps | 0-1 |
| Relevance | LLM judge | RAG retrieval quality | 0-1 |
| Appropriate length | Programmatic | All apps | 0 or 1 |
| No PII leak | Programmatic | Apps handling user data | 0 or 1 |
| Format compliance | Programmatic | Structured output apps | 0 or 1 |
| User feedback | Human | All user-facing apps | 0 or 1 |
| Expert review | Human | High-stakes domains | 0-1 |
- Use consistent scoring scales across all evaluators (0-1 recommended).
- Start simple: a single helpfulness judge is better than no evaluation at all.
- Calibrate with humans: periodically compare automated scores to human judgments.
- Version your judge prompts in Langfuse, just like your application prompts.
- Track cost of evaluation: LLM judges have their own token costs. Budget for it.
- Correlate with business metrics: compare evaluation scores against user retention, conversion, or support escalation rates.
- Re-evaluate after changes: run evaluations after prompt updates, model swaps, or pipeline changes.
- Do not over-optimize for a single metric: balance factuality, helpfulness, safety, and cost.
- The evaluation pipeline from traces to scores to dashboards.
- Common evaluation frameworks: factuality, helpfulness, safety, relevance, coherence.
- Four LLM-as-judge patterns (simple, with reasoning, reference-based, multi-criteria).
- How to set up human-in-the-loop evaluation with annotation queues and user feedback.
- Programmatic checks for format, length, and PII detection.
- How to run continuous evaluation with batch scripts and alerting.
- A complete evaluation pipeline example combining automated and programmatic scoring.
| Previous: Chapter 3 -- Prompt Management | Next: Analytics and Cost Tracking |
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for answer, content, question so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 4: Evaluation as an operating subsystem inside Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around langfuse, score, trace_id as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 4: Evaluation usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
answer. - Input normalization: shape incoming data so
contentreceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
question. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- Langfuse Repository
Why it matters: authoritative reference on
Langfuse Repository(github.com). - Langfuse Releases
Why it matters: authoritative reference on
Langfuse Releases(github.com). - Langfuse Docs
Why it matters: authoritative reference on
Langfuse Docs(langfuse.com).
Suggested trace strategy:
- search upstream code for
answerandcontentto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production