Skip to content

Latest commit

 

History

History
705 lines (531 loc) · 29.1 KB

File metadata and controls

705 lines (531 loc) · 29.1 KB
title Evaluation Developer Guide
description Learn how to build custom evaluators using the LLM Observability SDK.
further_reading
link tag text
/llm_observability/evaluations/external_evaluations
Documentation
Learn about submitting external evaluations
link tag text
/llm_observability/setup/sdk/python
Documentation
Learn about the LLM Observability SDK for Python
link tag text
/llm_observability/instrumentation/api
Documentation
Learn about the HTTP API Reference

Overview

This guide covers how to build custom evaluators with the LLM Observability SDK and use them in LLM Experiments and in production.

Key concepts

An evaluation measures a specific quality of your LLM application's output, such as accuracy, tone, or harmfulness. You write the evaluation logic inside an evaluator, which receives context about the LLM interaction and returns a result.

Running evaluators in an Experiment

To test your LLM application against a dataset before deploying, run your evaluators in LLM Experiments. In Experiments, evaluators run automatically: the SDK calls your evaluator on each distinct record. Use evaluators through the SDK.

Running evaluators in production

To monitor the quality of your live LLM responses, run evaluators in production. You can run evaluators manually with submit_evaluation(), or automatically with custom LLM-as-a-judge evaluations. Use evaluators through the SDK, HTTP API, or the Datadog UI.

For production, there are two approaches:

  • Manual evaluations (this guide): You run evaluators in your application code and submit results with LLMObs.submit_evaluation() or the HTTP API. This gives you full control over evaluation logic and timing.
  • Custom LLM-as-a-judge evaluations: You configure evaluations in the Datadog UI using natural language prompts. Datadog automatically runs them on production traces in real time, with no code changes required.

This guide focuses on manual evaluations. For managed LLM-as-a-judge evaluations, see Custom LLM-as-a-Judge Evaluations.

Evaluation components

The evaluation system has four main components:

  • EvaluatorContext: The input to an evaluator. Contains the LLM's input, output, expected output, and span identifiers. In Experiments, the SDK builds this automatically from each dataset record. In production, you construct the EvaluatorContext yourself.
  • EvaluatorResult: The output of an evaluator. Contains a typed value, optional reasoning, a pass/fail assessment, metadata, and tags. You can also return a plain value (str, float, int, bool, dict) instead.
  • Metric type: Determines how the evaluation value is interpreted and displayed: categorical (string labels), score (numeric), boolean (pass/fail), or json (structured data).
  • SummaryEvaluatorContext — Experiments only. After all dataset records are evaluated, summary evaluators receive the aggregated results to compute statistics like averages or pass rates.

The typical flow:

  • Experiments: Dataset record → EvaluatorContext → Evaluator → EvaluatorResult → (after all records) SummaryEvaluatorContext → Summary evaluator → summary result
  • Production: Span data → EvaluatorContext (built manually) → Evaluator → EvaluatorResultLLMObs.submit_evaluation() or HTTP API

Building evaluators

There are two ways to define an evaluator using LLM Observability: class-based and function-based. In addition to these evaluators, LLM Observability has integrations with open source evaluation frameworks, such as DeepEval and [Pydantic][], that can be used in LLM Observability Experiments.

Class-based Function-based
Best for Reusable evaluators with custom configuration or state. One-off evaluators with straightforward logic.
Receives An EvaluatorContext object with full span context (input, output, expected output, metadata, span/trace IDs). input_data, output_data, and expected_output as separate arguments.
Supports summary evaluators Yes (BaseSummaryEvaluator). No.

If you are unsure, start with class-based evaluators. They provide the same capabilities as function-based evaluators.

Class-based evaluators

Class-based evaluators provide a structured way to implement reusable evaluation logic with custom configuration.

BaseEvaluator

Subclass BaseEvaluator to create an evaluator that runs on a single span or dataset record. Implement the evaluate method, which receives an EvaluatorContext and returns an EvaluatorResult (or a plain value).

{{< code-block lang="python" >}} from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult

class SemanticSimilarityEvaluator(BaseEvaluator): """Evaluates semantic similarity between output and expected output."""

def __init__(self, threshold: float = 0.8):
    super().__init__(name="semantic_similarity")
    self.threshold = threshold

def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
    score = compute_similarity(context.output_data, context.expected_output)

    return EvaluatorResult(
        value=score,
        reasoning=f"Similarity score: {score:.2f}",
        assessment="pass" if score >= self.threshold else "fail",
        metadata={"threshold": self.threshold},
        tags={"type": "semantic"}
    )

{{< /code-block >}}

  • Call super().__init__(name="evaluator_name") to set the evaluator's label.
  • Implement evaluate(context: EvaluatorContext) with your evaluation logic.
  • Return an EvaluatorResult for rich results, or a plain value (str, float, int, bool, dict).

BaseSummaryEvaluator

Summary evaluators are only available in experiments.

Subclass BaseSummaryEvaluator to create an evaluator that operates on the aggregated results of an entire experiment run. It receives a SummaryEvaluatorContext containing all inputs, outputs, and per-evaluator results.

{{< code-block lang="python" >}} from ddtrace.llmobs import BaseSummaryEvaluator, SummaryEvaluatorContext

class AverageScoreEvaluator(BaseSummaryEvaluator): """Computes average score across all evaluation results."""

def __init__(self, target_evaluator: str):
    super().__init__(name="average_score")
    self.target_evaluator = target_evaluator

def evaluate(self, context: SummaryEvaluatorContext):
    scores = context.evaluation_results.get(self.target_evaluator, [])
    if not scores:
        return None
    return sum(scores) / len(scores)

{{< /code-block >}}

  • Call super().__init__(name="evaluator_name") to set the evaluator's label.
  • Access per-evaluator results through context.evaluation_results, which maps evaluator names to lists of results.

LLMJudge

The LLMJudge class enables automated evaluation of LLM outputs using another LLM as the judge. It supports OpenAI, Azure OpenAI, Anthropic, Amazon Bedrock, and custom LLM clients with structured output formats.

Parameters

Parameter Type Required Description
user_prompt str Yes Prompt template with {{field.path}} syntax for span context injection.
system_prompt str No System prompt to set the judge's behavior or persona.
structured_output StructuredOutput No Output format specification. See structured output types.
provider str Conditional LLM provider: "openai", "azure_openai", "anthropic", or "bedrock". Required if client is not provided.
model str No Model identifier (for example, "gpt-4o", "claude-sonnet-4-20250514").
model_params dict No Additional parameters passed to the LLM API (for example, temperature).
client callable Conditional Custom LLM client function. Required if provider is not provided.
name str No Evaluator name for identification in results.
client_options dict No Provider-specific configuration (for example, API keys).

Template variables

The user_prompt supports {{field.path}} syntax to inject context from the evaluated span. Nested paths are supported.

  • {{input_data}} — The span's input data.
  • {{output_data}} — The span's output data.
  • {{expected_output}} — Expected output for comparison (if available).
  • {{metadata.key}} — Nested metadata fields (for example, {{metadata.topic}}).

Structured output types

Output type Description
BooleanStructuredOutput Returns True/False with optional pass/fail assessment.
ScoreStructuredOutput Returns a numeric score within a defined range, with optional thresholds.
CategoricalStructuredOutput Returns one of a predefined set of categories, with optional pass values.
Dict[str, JSONType] Custom JSON schema for arbitrary structured output.

All structured output types accept reasoning=True to include an explanation in results, and reasoning_description to customize the reasoning field's description.

Example: Boolean evaluation

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

judge = LLMJudge( provider="openai", model="gpt-4o", user_prompt="Is this response factually accurate? Response: {{output_data}}", structured_output=BooleanStructuredOutput( description="Whether the response is factually accurate", reasoning=True, pass_when=True, ), ) {{< /code-block >}}

Example: Score-based evaluation with thresholds

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, ScoreStructuredOutput

judge = LLMJudge( provider="anthropic", model="claude-sonnet-4-20250514", user_prompt="Rate the helpfulness of this response (1-10): {{output_data}}", structured_output=ScoreStructuredOutput( description="Helpfulness score", min_score=1, max_score=10, reasoning=True, min_threshold=7, # Scores >= 7 pass ), ) {{< /code-block >}}

Example: Categorical evaluation

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, CategoricalStructuredOutput

judge = LLMJudge( provider="openai", model="gpt-4o", user_prompt="Classify the sentiment: {{output_data}}", structured_output=CategoricalStructuredOutput( categories={ "positive": "The response has a positive sentiment.", "neutral": "The response has a neutral sentiment.", "negative": "The response has a negative sentiment.", }, reasoning=True, pass_values=["positive", "neutral"], ), ) {{< /code-block >}}

Example: Azure OpenAI

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

judge = LLMJudge( provider="azure_openai", model="gpt-4o", user_prompt="Is this response factually accurate? Response: {{output_data}}", structured_output=BooleanStructuredOutput( description="Whether the response is factually accurate", reasoning=True, pass_when=True, ), client_options={ "azure_endpoint": "https://your-resource.openai.azure.com", "api_version": "2024-10-21", "azure_deployment": "gpt-4o", }, ) {{< /code-block >}}

The azure_openai provider accepts the following client_options:

Option Environment variable Description
api_key AZURE_OPENAI_API_KEY Azure OpenAI API key.
azure_endpoint AZURE_OPENAI_ENDPOINT Azure OpenAI endpoint URL.
api_version AZURE_OPENAI_API_VERSION API version. Defaults to "2024-10-21".
azure_deployment AZURE_OPENAI_DEPLOYMENT Deployment name. Falls back to the model parameter.

Example: Custom LLM client

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

def my_llm_client(provider, messages, json_schema, model, model_params): response = call_my_llm(messages, model) return response

judge = LLMJudge( client=my_llm_client, model="my-custom-model", user_prompt="Is this response accurate? {{output_data}}", structured_output=BooleanStructuredOutput( description="Accuracy check", reasoning=True, pass_when=True, ), ) {{< /code-block >}}

Key points

  • Requires either a provider ("openai", "azure_openai", "anthropic", or "bedrock") or a custom client.
  • Set API keys using client_options={"api_key": "..."} or environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY). For Azure OpenAI, set AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT. For Bedrock, configure AWS credentials through environment variables or client_options.
  • Use reasoning=True in structured outputs to include an explanation in results.
  • Define pass/fail criteria with pass_when (boolean), pass_values (categorical), or min_threshold/max_threshold (score).

Publishing an LLMJudge as a Datadog managed evaluation

Use LLMObs.publish_evaluator() to push a locally-defined LLMJudge configuration to Datadog as a custom LLM-as-a-judge draft. This lets you define and validate an evaluator in experiments, then promote it to production without manually recreating the configuration in the UI.

Parameter Type Required Description
evaluator LLMJudge Yes The LLMJudge instance to publish.
ml_app str Yes The LLM application name.
eval_name str No The name to use for the evaluator in Datadog. If omitted, defaults to the name set on the LLMJudge instance.
variable_mapping dict[str, str] No Remaps variable names in user_prompt to Datadog span field paths in the published evaluator.

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs from ddtrace.llmobs._evaluators import BooleanStructuredOutput, LLMJudge

LLMObs.enable( ml_app="my-ml-app", api_key="<DD_API_KEY>", app_key="<DD_APP_KEY>", )

judge = LLMJudge( provider="openai", model="gpt-4o", system_prompt="You are a helpful evaluator.", user_prompt=( "Does the output correctly answer the question?\n" "Input: {{input_data}}\n" "Output: {{output_data}}" ), structured_output=BooleanStructuredOutput("correctness", pass_when=True), name="my-correctness-judge", )

result = LLMObs.publish_evaluator( judge, ml_app="my-ml-app", variable_mapping={"input_data": "span_input", "output_data": "span_output"}, ) print(result["ui_url"]) {{< /code-block >}}

LLMObs.publish_evaluator() returns {"ui_url": "..."}, which links to the evaluator in Datadog.

Each call to LLMObs.publish_evaluator() creates or updates the evaluator draft. Activate it from the Datadog UI to run it in production.

Built-in evaluators

The SDK provides built-in evaluators for common evaluation patterns. These are class-based evaluators that you can use directly without writing custom logic.

StringCheckEvaluator

Performs string comparison operations between output_data and expected_output.

Operation Description
eq Exact match (default)
ne Not equals
contains output_data contains expected_output (case-sensitive)
icontains output_data contains expected_output (case-insensitive)

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import StringCheckEvaluator

Perform an exact match (default)

evaluator = StringCheckEvaluator(operation="eq", case_sensitive=True)

Check whether output_data contains expected_output (case-insensitive)

evaluator = StringCheckEvaluator(operation="icontains", strip_whitespace=True)

Extract field from dict output before comparison

evaluator = StringCheckEvaluator( operation="eq", output_extractor=lambda x: x.get("message", "") if isinstance(x, dict) else str(x), ) {{< /code-block >}}

RegexMatchEvaluator

Validates output against a regex pattern.

Match mode Description
search Partial match anywhere in string (default)
match Match from start of string
fullmatch Match entire string

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import RegexMatchEvaluator import re

Validate email format

evaluator = RegexMatchEvaluator( pattern=r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$", match_mode="fullmatch" )

Validate output pattern (case-insensitive)

evaluator = RegexMatchEvaluator( pattern=r"success|completed", flags=re.IGNORECASE ) {{< /code-block >}}

LengthEvaluator

Validates output length constraints.

Count type Description
characters Count characters (default)
words Count words
lines Count lines

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import LengthEvaluator

Ensure response is 50-200 characters

evaluator = LengthEvaluator(min_length=50, max_length=200, count_type="characters")

Validate word count

evaluator = LengthEvaluator(min_length=10, max_length=100, count_type="words") {{< /code-block >}}

JSONEvaluator

Validates that output is valid JSON, and optionally checks for required keys.

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import JSONEvaluator

Validate JSON syntax

evaluator = JSONEvaluator()

Validate that required keys exist

evaluator = JSONEvaluator(required_keys=["name", "status", "data"]) {{< /code-block >}}

SemanticSimilarityEvaluator

Measures semantic similarity between output_data and expected_output using embeddings. Returns a similarity score between 0.0 and 1.0.

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import SemanticSimilarityEvaluator from openai import OpenAI

client = OpenAI()

def get_embedding(text): response = client.embeddings.create( input=text, model="text-embedding-3-small" ) return response.data[0].embedding

evaluator = SemanticSimilarityEvaluator( embedding_fn=get_embedding, threshold=0.8 # Minimum similarity score to pass ) {{< /code-block >}}

Function-based evaluators

For straightforward evaluation logic, define a function instead of a class. Function-based evaluators receive the input, output, and expected output directly as arguments.

{{< code-block lang="python" >}} from ddtrace.llmobs import EvaluatorResult

def exact_match_evaluator(input_data, output_data, expected_output): """Checks if output exactly matches expected output.""" matches = output_data == expected_output return EvaluatorResult( value=matches, reasoning="Exact match" if matches else "Output differs from expected", assessment="pass" if matches else "fail", ) {{< /code-block >}}

Function signature:

{{< code-block lang="python" >}} def evaluator_function( input_data: Any, output_data: Any, expected_output: Any ) -> Union[JSONType, EvaluatorResult]: ... {{< /code-block >}}

You can return either:

  • A plain value (str, float, int, bool, dict), or
  • An EvaluatorResult for rich results with reasoning and metadata

Using evaluators in experiments

Pass your evaluators to LLMObs.experiment() to run them against every record in a dataset. The SDK automatically builds an EvaluatorContext for each record and calls your evaluator. After all records are processed, any summary evaluators run on the aggregated results.

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, Dataset, DatasetRecord

Create dataset

dataset = Dataset( name="qa_dataset", records=[ DatasetRecord( input_data={"question": "What is 2+2?"}, expected_output="4" ), DatasetRecord( input_data={"question": "What is the capital of France?"}, expected_output="Paris" ), ] )

Define task

def qa_task(input_data, config): return generate_answer(input_data["question"])

Create evaluators

semantic_eval = SemanticSimilarityEvaluator(threshold=0.7) summary_eval = AverageScoreEvaluator("semantic_similarity")

Run experiment

experiment = LLMObs.experiment( name="qa_experiment", task=qa_task, dataset=dataset, evaluators=[semantic_eval, exact_match_evaluator], summary_evaluators=[summary_eval] )

experiment.run() {{< /code-block >}}

Using managed evaluators

RemoteEvaluator lets you reference a custom LLM-as-a-judge evaluation configured in the Datadog UI by name, and run it as part of a local experiment. This allows you to reuse your production evaluators in offline experiments without reimplementing the evaluation logic in Python.

Parameter Type Description
eval_name str The name of the LLM-as-a-judge evaluator as configured in Datadog.
transform_fn Optional[Callable] A function that maps an EvaluatorContext to a dict of template variable values.

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, RemoteEvaluator

evaluator = RemoteEvaluator(eval_name="quality-assessment")

experiment = LLMObs.experiment( name="my-experiment", task=my_task, dataset=dataset, evaluators=[evaluator], ) experiment.run() {{< /code-block >}}

Mapping dataset data to prompt variables with transform_fn

When you configure an LLM-as-a-judge in the Datadog UI, the prompt template uses variables such as {{span_input}} and {{span_output}}. By default, RemoteEvaluator maps the following:

  • input_dataspan_input
  • output_dataspan_output
  • expected_outputmeta.expected_output

If your dataset records have a different structure—for example, input_data is a dict with multiple keys—provide a transform_fn to control exactly which values are sent for each template variable:

{{< code-block lang="python" >}} from ddtrace.llmobs import RemoteEvaluator, EvaluatorContext

def my_transform(context: EvaluatorContext) -> dict: # input_data is a dict: {"user_query": str, "retrieved_docs": list[str]} return { "span_input": context.input_data.get("user_query"), # → {{span_input}} in the prompt "span_output": context.output_data, # → {{span_output}} in the prompt "meta": { "retrieved_docs": context.input_data.get("retrieved_docs"), # → {{meta.retrieved_docs}} }, }

evaluator = RemoteEvaluator( eval_name="quality-assessment", transform_fn=my_transform, ) {{< /code-block >}}

If the backend evaluator encounters an error, a RemoteEvaluatorError is raised. Inspect backend_error for details:

{{< code-block lang="python" >}} from ddtrace.llmobs import RemoteEvaluator, RemoteEvaluatorError, EvaluatorContext

evaluator = RemoteEvaluator(eval_name="quality-assessment") context = EvaluatorContext(input_data={"query": "What is the capital of France?"}, output_data="Paris")

try: result = evaluator.evaluate(context) except RemoteEvaluatorError as e: print(e.backend_error) # {"type": "...", "message": "...", "recommended_resolution": "..."} {{< /code-block >}}

Using evaluators in production

This section covers evaluations you run and submit manually from your application code. To have Datadog run evaluations automatically on production traces, see Custom LLM-as-a-Judge Evaluations instead.

To submit evaluations from your application code, construct the EvaluatorContext yourself, call the evaluator, and submit the result with LLMObs.submit_evaluation(). You can also submit evaluations through the HTTP API.

For the full submit_evaluation() arguments and span-joining options, see the external evaluations documentation. For the HTTP API specification, see the Evaluations API reference.

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, EvaluatorContext from ddtrace.llmobs.decorators import llm

evaluator = SemanticSimilarityEvaluator(threshold=0.8)

@llm(model_name="claude", name="invoke_llm", model_provider="anthropic") def llm_call(input_text): completion = ... # Your LLM application logic

# Build the evaluation context from the span data
context = EvaluatorContext(
    input_data=input_text,
    output_data=completion,
    expected_output=None,
)

# Run the evaluator
result = evaluator.evaluate(context)

# Submit the result to Datadog
LLMObs.submit_evaluation(
    span=LLMObs.export_span(),
    ml_app="chatbot",
    label=evaluator.name,
    metric_type="score",
    value=result.value,
    assessment=result.assessment,
    reasoning=result.reasoning,
)

return completion

{{< /code-block >}}

Data model reference

EvaluatorContext

A frozen dataclass containing all the information needed to run an evaluation.

Field Type Description
input_data Any The input provided to the LLM application (for example, a prompt).
output_data Any The actual output from the LLM application.
expected_output Any The expected or ideal output the LLM should have produced.
metadata Dict[str, Any] Additional metadata.
span_id str The span's unique identifier.
trace_id str The trace's unique identifier.

In Experiments, the SDK populates this automatically from each dataset record. In production, you construct it yourself from your span data.

EvaluatorResult

Allows you to return rich evaluation results with additional context. Used in both Experiments and production.

Field Type Description
value Union[str, float, int, bool, dict] The evaluation value. Type depends on metric_type.
reasoning Optional[str] A text explanation of the evaluation result.
assessment Optional[str] An assessment of this evaluation. Accepted values are pass and fail.
metadata Optional[Dict[str, Any]] Additional metadata about the evaluation.
tags Optional[Dict[str, str]] Tags to apply to the evaluation metric.

SummaryEvaluatorContext

A frozen dataclass providing aggregated evaluation results across all dataset records in an experiment. Only used by summary evaluators.

Field Type Description
inputs List[Any] List of all input data from the experiment.
outputs List[Any] List of all output data from the experiment.
expected_outputs List[Any] List of all expected outputs from the experiment.
evaluation_results Dict[str, List[Any]] Dictionary mapping evaluator names to their results.
metadata Dict[str, Any] Additional metadata associated with the experiment.

Metric types

The metric type is set when submitting an evaluation (through submit_evaluation() or the HTTP API) and determines how the value is validated and displayed in Datadog.

Metric type Value type Use case
categorical str Classifying outputs into categories (for example, "Positive", "Negative", "Neutral")
score float or int Numeric scores or ratings (for example, 0.0-1.0, 1-10)
boolean bool Pass/fail or yes/no evaluations
json dict Structured evaluation data (for example, multi-dimensional rubrics or detailed breakdowns)

Best practices

Naming conventions

Evaluation labels must follow these conventions:

  • Must start with a letter
  • Must only contain ASCII alphanumerics, underscores, or hyphens
  • Spaces and other unsupported characters are converted to underscores
  • Unicode is not supported
  • Must not exceed 200 characters (fewer than 100 is preferred)
  • Must be unique for a given LLM application (ml_app) and organization

Concurrent execution

Set the jobs parameter to run tasks and evaluators concurrently on multiple threads, allowing experiments to complete faster when processing multiple dataset records.

Asynchronous evaluators are not yet supported for concurrent execution. Only synchronous evaluators benefit from parallel execution.

OpenTelemetry integration

When submitting evaluations for OpenTelemetry-instrumented spans, include the source:otel tag in the evaluation. See the external evaluations documentation for examples.

Further Reading

{{< partial name="whats-next/whats-next.html" >}}