| title | Evaluation Developer Guide | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| description | Learn how to build custom evaluators using the LLM Observability SDK. | |||||||||||||||||||||
| further_reading |
|
This guide covers how to build custom evaluators with the LLM Observability SDK and use them in LLM Experiments and in production.
An evaluation measures a specific quality of your LLM application's output, such as accuracy, tone, or harmfulness. You write the evaluation logic inside an evaluator, which receives context about the LLM interaction and returns a result.
To test your LLM application against a dataset before deploying, run your evaluators in LLM Experiments. In Experiments, evaluators run automatically: the SDK calls your evaluator on each distinct record. Use evaluators through the SDK.
To monitor the quality of your live LLM responses, run evaluators in production. You can run evaluators manually with submit_evaluation(), or automatically with custom LLM-as-a-judge evaluations. Use evaluators through the SDK, HTTP API, or the Datadog UI.
For production, there are two approaches:
- Manual evaluations (this guide): You run evaluators in your application code and submit results with
LLMObs.submit_evaluation()or the HTTP API. This gives you full control over evaluation logic and timing. - Custom LLM-as-a-judge evaluations: You configure evaluations in the Datadog UI using natural language prompts. Datadog automatically runs them on production traces in real time, with no code changes required.
This guide focuses on manual evaluations. For managed LLM-as-a-judge evaluations, see Custom LLM-as-a-Judge Evaluations.
The evaluation system has four main components:
- EvaluatorContext: The input to an evaluator. Contains the LLM's input, output, expected output, and span identifiers. In Experiments, the SDK builds this automatically from each dataset record. In production, you construct the EvaluatorContext yourself.
- EvaluatorResult: The output of an evaluator. Contains a typed value, optional reasoning, a pass/fail assessment, metadata, and tags. You can also return a plain value (
str,float,int,bool,dict) instead. - Metric type: Determines how the evaluation value is interpreted and displayed:
categorical(string labels),score(numeric),boolean(pass/fail), orjson(structured data). - SummaryEvaluatorContext — Experiments only. After all dataset records are evaluated, summary evaluators receive the aggregated results to compute statistics like averages or pass rates.
The typical flow:
- Experiments: Dataset record →
EvaluatorContext→ Evaluator →EvaluatorResult→ (after all records)SummaryEvaluatorContext→ Summary evaluator → summary result - Production: Span data →
EvaluatorContext(built manually) → Evaluator →EvaluatorResult→LLMObs.submit_evaluation()or HTTP API
There are two ways to define an evaluator using LLM Observability: class-based and function-based. In addition to these evaluators, LLM Observability has integrations with open source evaluation frameworks, such as DeepEval and [Pydantic][], that can be used in LLM Observability Experiments.
| Class-based | Function-based | |
|---|---|---|
| Best for | Reusable evaluators with custom configuration or state. | One-off evaluators with straightforward logic. |
| Receives | An EvaluatorContext object with full span context (input, output, expected output, metadata, span/trace IDs). |
input_data, output_data, and expected_output as separate arguments. |
| Supports summary evaluators | Yes (BaseSummaryEvaluator). |
No. |
If you are unsure, start with class-based evaluators. They provide the same capabilities as function-based evaluators.
Class-based evaluators provide a structured way to implement reusable evaluation logic with custom configuration.
Subclass BaseEvaluator to create an evaluator that runs on a single span or dataset record. Implement the evaluate method, which receives an EvaluatorContext and returns an EvaluatorResult (or a plain value).
{{< code-block lang="python" >}} from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult
class SemanticSimilarityEvaluator(BaseEvaluator): """Evaluates semantic similarity between output and expected output."""
def __init__(self, threshold: float = 0.8):
super().__init__(name="semantic_similarity")
self.threshold = threshold
def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
score = compute_similarity(context.output_data, context.expected_output)
return EvaluatorResult(
value=score,
reasoning=f"Similarity score: {score:.2f}",
assessment="pass" if score >= self.threshold else "fail",
metadata={"threshold": self.threshold},
tags={"type": "semantic"}
)
{{< /code-block >}}
- Call
super().__init__(name="evaluator_name")to set the evaluator's label. - Implement
evaluate(context: EvaluatorContext)with your evaluation logic. - Return an
EvaluatorResultfor rich results, or a plain value (str,float,int,bool,dict).
Subclass BaseSummaryEvaluator to create an evaluator that operates on the aggregated results of an entire experiment run. It receives a SummaryEvaluatorContext containing all inputs, outputs, and per-evaluator results.
{{< code-block lang="python" >}} from ddtrace.llmobs import BaseSummaryEvaluator, SummaryEvaluatorContext
class AverageScoreEvaluator(BaseSummaryEvaluator): """Computes average score across all evaluation results."""
def __init__(self, target_evaluator: str):
super().__init__(name="average_score")
self.target_evaluator = target_evaluator
def evaluate(self, context: SummaryEvaluatorContext):
scores = context.evaluation_results.get(self.target_evaluator, [])
if not scores:
return None
return sum(scores) / len(scores)
{{< /code-block >}}
- Call
super().__init__(name="evaluator_name")to set the evaluator's label. - Access per-evaluator results through
context.evaluation_results, which maps evaluator names to lists of results.
The LLMJudge class enables automated evaluation of LLM outputs using another LLM as the judge. It supports OpenAI, Azure OpenAI, Anthropic, Amazon Bedrock, and custom LLM clients with structured output formats.
| Parameter | Type | Required | Description |
|---|---|---|---|
user_prompt |
str |
Yes | Prompt template with {{field.path}} syntax for span context injection. |
system_prompt |
str |
No | System prompt to set the judge's behavior or persona. |
structured_output |
StructuredOutput |
No | Output format specification. See structured output types. |
provider |
str |
Conditional | LLM provider: "openai", "azure_openai", "anthropic", or "bedrock". Required if client is not provided. |
model |
str |
No | Model identifier (for example, "gpt-4o", "claude-sonnet-4-20250514"). |
model_params |
dict |
No | Additional parameters passed to the LLM API (for example, temperature). |
client |
callable | Conditional | Custom LLM client function. Required if provider is not provided. |
name |
str |
No | Evaluator name for identification in results. |
client_options |
dict |
No | Provider-specific configuration (for example, API keys). |
The user_prompt supports {{field.path}} syntax to inject context from the evaluated span. Nested paths are supported.
{{input_data}}— The span's input data.{{output_data}}— The span's output data.{{expected_output}}— Expected output for comparison (if available).{{metadata.key}}— Nested metadata fields (for example,{{metadata.topic}}).
| Output type | Description |
|---|---|
BooleanStructuredOutput |
Returns True/False with optional pass/fail assessment. |
ScoreStructuredOutput |
Returns a numeric score within a defined range, with optional thresholds. |
CategoricalStructuredOutput |
Returns one of a predefined set of categories, with optional pass values. |
Dict[str, JSONType] |
Custom JSON schema for arbitrary structured output. |
All structured output types accept reasoning=True to include an explanation in results, and reasoning_description to customize the reasoning field's description.
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput
judge = LLMJudge( provider="openai", model="gpt-4o", user_prompt="Is this response factually accurate? Response: {{output_data}}", structured_output=BooleanStructuredOutput( description="Whether the response is factually accurate", reasoning=True, pass_when=True, ), ) {{< /code-block >}}
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, ScoreStructuredOutput
judge = LLMJudge( provider="anthropic", model="claude-sonnet-4-20250514", user_prompt="Rate the helpfulness of this response (1-10): {{output_data}}", structured_output=ScoreStructuredOutput( description="Helpfulness score", min_score=1, max_score=10, reasoning=True, min_threshold=7, # Scores >= 7 pass ), ) {{< /code-block >}}
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, CategoricalStructuredOutput
judge = LLMJudge( provider="openai", model="gpt-4o", user_prompt="Classify the sentiment: {{output_data}}", structured_output=CategoricalStructuredOutput( categories={ "positive": "The response has a positive sentiment.", "neutral": "The response has a neutral sentiment.", "negative": "The response has a negative sentiment.", }, reasoning=True, pass_values=["positive", "neutral"], ), ) {{< /code-block >}}
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput
judge = LLMJudge( provider="azure_openai", model="gpt-4o", user_prompt="Is this response factually accurate? Response: {{output_data}}", structured_output=BooleanStructuredOutput( description="Whether the response is factually accurate", reasoning=True, pass_when=True, ), client_options={ "azure_endpoint": "https://your-resource.openai.azure.com", "api_version": "2024-10-21", "azure_deployment": "gpt-4o", }, ) {{< /code-block >}}
The azure_openai provider accepts the following client_options:
| Option | Environment variable | Description |
|---|---|---|
api_key |
AZURE_OPENAI_API_KEY |
Azure OpenAI API key. |
azure_endpoint |
AZURE_OPENAI_ENDPOINT |
Azure OpenAI endpoint URL. |
api_version |
AZURE_OPENAI_API_VERSION |
API version. Defaults to "2024-10-21". |
azure_deployment |
AZURE_OPENAI_DEPLOYMENT |
Deployment name. Falls back to the model parameter. |
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput
def my_llm_client(provider, messages, json_schema, model, model_params): response = call_my_llm(messages, model) return response
judge = LLMJudge( client=my_llm_client, model="my-custom-model", user_prompt="Is this response accurate? {{output_data}}", structured_output=BooleanStructuredOutput( description="Accuracy check", reasoning=True, pass_when=True, ), ) {{< /code-block >}}
- Requires either a
provider("openai","azure_openai","anthropic", or"bedrock") or a customclient. - Set API keys using
client_options={"api_key": "..."}or environment variables (OPENAI_API_KEY,ANTHROPIC_API_KEY). For Azure OpenAI, setAZURE_OPENAI_API_KEYandAZURE_OPENAI_ENDPOINT. For Bedrock, configure AWS credentials through environment variables orclient_options. - Use
reasoning=Truein structured outputs to include an explanation in results. - Define pass/fail criteria with
pass_when(boolean),pass_values(categorical), ormin_threshold/max_threshold(score).
Use LLMObs.publish_evaluator() to push a locally-defined LLMJudge configuration to Datadog as a custom LLM-as-a-judge draft. This lets you define and validate an evaluator in experiments, then promote it to production without manually recreating the configuration in the UI.
| Parameter | Type | Required | Description |
|---|---|---|---|
evaluator |
LLMJudge |
Yes | The LLMJudge instance to publish. |
ml_app |
str |
Yes | The LLM application name. |
eval_name |
str |
No | The name to use for the evaluator in Datadog. If omitted, defaults to the name set on the LLMJudge instance. |
variable_mapping |
dict[str, str] |
No | Remaps variable names in user_prompt to Datadog span field paths in the published evaluator. |
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs from ddtrace.llmobs._evaluators import BooleanStructuredOutput, LLMJudge
LLMObs.enable( ml_app="my-ml-app", api_key="<DD_API_KEY>", app_key="<DD_APP_KEY>", )
judge = LLMJudge( provider="openai", model="gpt-4o", system_prompt="You are a helpful evaluator.", user_prompt=( "Does the output correctly answer the question?\n" "Input: {{input_data}}\n" "Output: {{output_data}}" ), structured_output=BooleanStructuredOutput("correctness", pass_when=True), name="my-correctness-judge", )
result = LLMObs.publish_evaluator( judge, ml_app="my-ml-app", variable_mapping={"input_data": "span_input", "output_data": "span_output"}, ) print(result["ui_url"]) {{< /code-block >}}
LLMObs.publish_evaluator() returns {"ui_url": "..."}, which links to the evaluator in Datadog.
LLMObs.publish_evaluator() creates or updates the evaluator draft. Activate it from the Datadog UI to run it in production.The SDK provides built-in evaluators for common evaluation patterns. These are class-based evaluators that you can use directly without writing custom logic.
Performs string comparison operations between output_data and expected_output.
| Operation | Description |
|---|---|
eq |
Exact match (default) |
ne |
Not equals |
contains |
output_data contains expected_output (case-sensitive) |
icontains |
output_data contains expected_output (case-insensitive) |
{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import StringCheckEvaluator
evaluator = StringCheckEvaluator(operation="eq", case_sensitive=True)
evaluator = StringCheckEvaluator(operation="icontains", strip_whitespace=True)
evaluator = StringCheckEvaluator( operation="eq", output_extractor=lambda x: x.get("message", "") if isinstance(x, dict) else str(x), ) {{< /code-block >}}
Validates output against a regex pattern.
| Match mode | Description |
|---|---|
search |
Partial match anywhere in string (default) |
match |
Match from start of string |
fullmatch |
Match entire string |
{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import RegexMatchEvaluator import re
evaluator = RegexMatchEvaluator( pattern=r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$", match_mode="fullmatch" )
evaluator = RegexMatchEvaluator( pattern=r"success|completed", flags=re.IGNORECASE ) {{< /code-block >}}
Validates output length constraints.
| Count type | Description |
|---|---|
characters |
Count characters (default) |
words |
Count words |
lines |
Count lines |
{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import LengthEvaluator
evaluator = LengthEvaluator(min_length=50, max_length=200, count_type="characters")
evaluator = LengthEvaluator(min_length=10, max_length=100, count_type="words") {{< /code-block >}}
Validates that output is valid JSON, and optionally checks for required keys.
{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import JSONEvaluator
evaluator = JSONEvaluator()
evaluator = JSONEvaluator(required_keys=["name", "status", "data"]) {{< /code-block >}}
Measures semantic similarity between output_data and expected_output using embeddings. Returns a similarity score between 0.0 and 1.0.
{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import SemanticSimilarityEvaluator from openai import OpenAI
client = OpenAI()
def get_embedding(text): response = client.embeddings.create( input=text, model="text-embedding-3-small" ) return response.data[0].embedding
evaluator = SemanticSimilarityEvaluator( embedding_fn=get_embedding, threshold=0.8 # Minimum similarity score to pass ) {{< /code-block >}}
For straightforward evaluation logic, define a function instead of a class. Function-based evaluators receive the input, output, and expected output directly as arguments.
{{< code-block lang="python" >}} from ddtrace.llmobs import EvaluatorResult
def exact_match_evaluator(input_data, output_data, expected_output): """Checks if output exactly matches expected output.""" matches = output_data == expected_output return EvaluatorResult( value=matches, reasoning="Exact match" if matches else "Output differs from expected", assessment="pass" if matches else "fail", ) {{< /code-block >}}
Function signature:
{{< code-block lang="python" >}} def evaluator_function( input_data: Any, output_data: Any, expected_output: Any ) -> Union[JSONType, EvaluatorResult]: ... {{< /code-block >}}
You can return either:
- A plain value (
str,float,int,bool,dict), or - An
EvaluatorResultfor rich results with reasoning and metadata
Pass your evaluators to LLMObs.experiment() to run them against every record in a dataset. The SDK automatically builds an EvaluatorContext for each record and calls your evaluator. After all records are processed, any summary evaluators run on the aggregated results.
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, Dataset, DatasetRecord
dataset = Dataset( name="qa_dataset", records=[ DatasetRecord( input_data={"question": "What is 2+2?"}, expected_output="4" ), DatasetRecord( input_data={"question": "What is the capital of France?"}, expected_output="Paris" ), ] )
def qa_task(input_data, config): return generate_answer(input_data["question"])
semantic_eval = SemanticSimilarityEvaluator(threshold=0.7) summary_eval = AverageScoreEvaluator("semantic_similarity")
experiment = LLMObs.experiment( name="qa_experiment", task=qa_task, dataset=dataset, evaluators=[semantic_eval, exact_match_evaluator], summary_evaluators=[summary_eval] )
experiment.run() {{< /code-block >}}
RemoteEvaluator lets you reference a custom LLM-as-a-judge evaluation configured in the Datadog UI by name, and run it as part of a local experiment. This allows you to reuse your production evaluators in offline experiments without reimplementing the evaluation logic in Python.
| Parameter | Type | Description |
|---|---|---|
eval_name |
str |
The name of the LLM-as-a-judge evaluator as configured in Datadog. |
transform_fn |
Optional[Callable] |
A function that maps an EvaluatorContext to a dict of template variable values. |
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, RemoteEvaluator
evaluator = RemoteEvaluator(eval_name="quality-assessment")
experiment = LLMObs.experiment( name="my-experiment", task=my_task, dataset=dataset, evaluators=[evaluator], ) experiment.run() {{< /code-block >}}
When you configure an LLM-as-a-judge in the Datadog UI, the prompt template uses variables such as {{span_input}} and {{span_output}}. By default, RemoteEvaluator maps the following:
input_data→span_inputoutput_data→span_outputexpected_output→meta.expected_output
If your dataset records have a different structure—for example, input_data is a dict with multiple keys—provide a transform_fn to control exactly which values are sent for each template variable:
{{< code-block lang="python" >}} from ddtrace.llmobs import RemoteEvaluator, EvaluatorContext
def my_transform(context: EvaluatorContext) -> dict: # input_data is a dict: {"user_query": str, "retrieved_docs": list[str]} return { "span_input": context.input_data.get("user_query"), # → {{span_input}} in the prompt "span_output": context.output_data, # → {{span_output}} in the prompt "meta": { "retrieved_docs": context.input_data.get("retrieved_docs"), # → {{meta.retrieved_docs}} }, }
evaluator = RemoteEvaluator( eval_name="quality-assessment", transform_fn=my_transform, ) {{< /code-block >}}
If the backend evaluator encounters an error, a RemoteEvaluatorError is raised. Inspect backend_error for details:
{{< code-block lang="python" >}} from ddtrace.llmobs import RemoteEvaluator, RemoteEvaluatorError, EvaluatorContext
evaluator = RemoteEvaluator(eval_name="quality-assessment") context = EvaluatorContext(input_data={"query": "What is the capital of France?"}, output_data="Paris")
try: result = evaluator.evaluate(context) except RemoteEvaluatorError as e: print(e.backend_error) # {"type": "...", "message": "...", "recommended_resolution": "..."} {{< /code-block >}}
To submit evaluations from your application code, construct the EvaluatorContext yourself, call the evaluator, and submit the result with LLMObs.submit_evaluation(). You can also submit evaluations through the HTTP API.
For the full submit_evaluation() arguments and span-joining options, see the external evaluations documentation. For the HTTP API specification, see the Evaluations API reference.
{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, EvaluatorContext from ddtrace.llmobs.decorators import llm
evaluator = SemanticSimilarityEvaluator(threshold=0.8)
@llm(model_name="claude", name="invoke_llm", model_provider="anthropic") def llm_call(input_text): completion = ... # Your LLM application logic
# Build the evaluation context from the span data
context = EvaluatorContext(
input_data=input_text,
output_data=completion,
expected_output=None,
)
# Run the evaluator
result = evaluator.evaluate(context)
# Submit the result to Datadog
LLMObs.submit_evaluation(
span=LLMObs.export_span(),
ml_app="chatbot",
label=evaluator.name,
metric_type="score",
value=result.value,
assessment=result.assessment,
reasoning=result.reasoning,
)
return completion
{{< /code-block >}}
A frozen dataclass containing all the information needed to run an evaluation.
| Field | Type | Description |
|---|---|---|
input_data |
Any |
The input provided to the LLM application (for example, a prompt). |
output_data |
Any |
The actual output from the LLM application. |
expected_output |
Any |
The expected or ideal output the LLM should have produced. |
metadata |
Dict[str, Any] |
Additional metadata. |
span_id |
str |
The span's unique identifier. |
trace_id |
str |
The trace's unique identifier. |
In Experiments, the SDK populates this automatically from each dataset record. In production, you construct it yourself from your span data.
Allows you to return rich evaluation results with additional context. Used in both Experiments and production.
| Field | Type | Description |
|---|---|---|
value |
Union[str, float, int, bool, dict] |
The evaluation value. Type depends on metric_type. |
reasoning |
Optional[str] |
A text explanation of the evaluation result. |
assessment |
Optional[str] |
An assessment of this evaluation. Accepted values are pass and fail. |
metadata |
Optional[Dict[str, Any]] |
Additional metadata about the evaluation. |
tags |
Optional[Dict[str, str]] |
Tags to apply to the evaluation metric. |
A frozen dataclass providing aggregated evaluation results across all dataset records in an experiment. Only used by summary evaluators.
| Field | Type | Description |
|---|---|---|
inputs |
List[Any] |
List of all input data from the experiment. |
outputs |
List[Any] |
List of all output data from the experiment. |
expected_outputs |
List[Any] |
List of all expected outputs from the experiment. |
evaluation_results |
Dict[str, List[Any]] |
Dictionary mapping evaluator names to their results. |
metadata |
Dict[str, Any] |
Additional metadata associated with the experiment. |
The metric type is set when submitting an evaluation (through submit_evaluation() or the HTTP API) and determines how the value is validated and displayed in Datadog.
| Metric type | Value type | Use case |
|---|---|---|
categorical |
str |
Classifying outputs into categories (for example, "Positive", "Negative", "Neutral") |
score |
float or int |
Numeric scores or ratings (for example, 0.0-1.0, 1-10) |
boolean |
bool |
Pass/fail or yes/no evaluations |
json |
dict |
Structured evaluation data (for example, multi-dimensional rubrics or detailed breakdowns) |
Evaluation labels must follow these conventions:
- Must start with a letter
- Must only contain ASCII alphanumerics, underscores, or hyphens
- Spaces and other unsupported characters are converted to underscores
- Unicode is not supported
- Must not exceed 200 characters (fewer than 100 is preferred)
- Must be unique for a given LLM application (
ml_app) and organization
Set the jobs parameter to run tasks and evaluators concurrently on multiple threads, allowing experiments to complete faster when processing multiple dataset records.
When submitting evaluations for OpenTelemetry-instrumented spans, include the source:otel tag in the evaluation. See the external evaluations documentation for examples.
{{< partial name="whats-next/whats-next.html" >}}