title

Evaluation Developer Guide

description

Learn how to build custom evaluators using the LLM Observability SDK.

further_reading

link	tag	text
/llm_observability/evaluations/external_evaluations	Documentation	Learn about submitting external evaluations

link	tag	text
/llm_observability/setup/sdk/python	Documentation	Learn about the LLM Observability SDK for Python

link	tag	text
/llm_observability/instrumentation/api	Documentation	Learn about the HTTP API Reference

Overview

This guide covers how to build custom evaluators with the LLM Observability SDK and use them in LLM Experiments and in production.

Key concepts

An evaluation measures a specific quality of your LLM application's output, such as accuracy, tone, or harmfulness. You write the evaluation logic inside an evaluator, which receives context about the LLM interaction and returns a result.

Running evaluators in an Experiment

To test your LLM application against a dataset before deploying, run your evaluators in LLM Experiments. In Experiments, evaluators run automatically: the SDK calls your evaluator on each distinct record. Use evaluators through the SDK.

Running evaluators in production

To monitor the quality of your live LLM responses, run evaluators in production. You can run evaluators manually with submit_evaluation(), or automatically with custom LLM-as-a-judge evaluations. Use evaluators through the SDK, HTTP API, or the Datadog UI.

For production, there are two approaches:

Manual evaluations (this guide): You run evaluators in your application code and submit results with LLMObs.submit_evaluation() or the HTTP API. This gives you full control over evaluation logic and timing.
Custom LLM-as-a-judge evaluations: You configure evaluations in the Datadog UI using natural language prompts. Datadog automatically runs them on production traces in real time, with no code changes required.

This guide focuses on manual evaluations. For managed LLM-as-a-judge evaluations, see Custom LLM-as-a-Judge Evaluations.

Evaluation components

The evaluation system has four main components:

EvaluatorContext: The input to an evaluator. Contains the LLM's input, output, expected output, and span identifiers. In Experiments, the SDK builds this automatically from each dataset record. In production, you construct the EvaluatorContext yourself.
EvaluatorResult: The output of an evaluator. Contains a typed value, optional reasoning, a pass/fail assessment, metadata, and tags. You can also return a plain value (str, float, int, bool, dict) instead.
Metric type: Determines how the evaluation value is interpreted and displayed: categorical (string labels), score (numeric), boolean (pass/fail), or json (structured data).
SummaryEvaluatorContext — Experiments only. After all dataset records are evaluated, summary evaluators receive the aggregated results to compute statistics like averages or pass rates.

The typical flow:

Experiments: Dataset record → EvaluatorContext → Evaluator → EvaluatorResult → (after all records) SummaryEvaluatorContext → Summary evaluator → summary result
Production: Span data → EvaluatorContext (built manually) → Evaluator → EvaluatorResult → LLMObs.submit_evaluation() or HTTP API

Building evaluators

There are two ways to define an evaluator using LLM Observability: class-based and function-based. In addition to these evaluators, LLM Observability has integrations with open source evaluation frameworks, such as DeepEval and [Pydantic][], that can be used in LLM Observability Experiments.

	Class-based	Function-based
Best for	Reusable evaluators with custom configuration or state.	One-off evaluators with straightforward logic.
Receives	An `EvaluatorContext` object with full span context (input, output, expected output, metadata, span/trace IDs).	`input_data`, `output_data`, and `expected_output` as separate arguments.
Supports summary evaluators	Yes (`BaseSummaryEvaluator`).	No.

If you are unsure, start with class-based evaluators. They provide the same capabilities as function-based evaluators.

Class-based evaluators

Class-based evaluators provide a structured way to implement reusable evaluation logic with custom configuration.

BaseEvaluator

Subclass BaseEvaluator to create an evaluator that runs on a single span or dataset record. Implement the evaluate method, which receives an EvaluatorContext and returns an EvaluatorResult (or a plain value).

{{< code-block lang="python" >}} from ddtrace.llmobs import BaseEvaluator, EvaluatorContext, EvaluatorResult

class SemanticSimilarityEvaluator(BaseEvaluator): """Evaluates semantic similarity between output and expected output."""

def __init__(self, threshold: float = 0.8):
    super().__init__(name="semantic_similarity")
    self.threshold = threshold

def evaluate(self, context: EvaluatorContext) -> EvaluatorResult:
    score = compute_similarity(context.output_data, context.expected_output)

    return EvaluatorResult(
        value=score,
        reasoning=f"Similarity score: {score:.2f}",
        assessment="pass" if score >= self.threshold else "fail",
        metadata={"threshold": self.threshold},
        tags={"type": "semantic"}
    )

Call super().__init__(name="evaluator_name") to set the evaluator's label.
Implement evaluate(context: EvaluatorContext) with your evaluation logic.
Return an EvaluatorResult for rich results, or a plain value (str, float, int, bool, dict).

BaseSummaryEvaluator

Summary evaluators are only available in experiments.

Subclass BaseSummaryEvaluator to create an evaluator that operates on the aggregated results of an entire experiment run. It receives a SummaryEvaluatorContext containing all inputs, outputs, and per-evaluator results.

{{< code-block lang="python" >}} from ddtrace.llmobs import BaseSummaryEvaluator, SummaryEvaluatorContext

class AverageScoreEvaluator(BaseSummaryEvaluator): """Computes average score across all evaluation results."""

def __init__(self, target_evaluator: str):
    super().__init__(name="average_score")
    self.target_evaluator = target_evaluator

def evaluate(self, context: SummaryEvaluatorContext):
    scores = context.evaluation_results.get(self.target_evaluator, [])
    if not scores:
        return None
    return sum(scores) / len(scores)

Call super().__init__(name="evaluator_name") to set the evaluator's label.
Access per-evaluator results through context.evaluation_results, which maps evaluator names to lists of results.

LLMJudge

The LLMJudge class enables automated evaluation of LLM outputs using another LLM as the judge. It supports OpenAI, Azure OpenAI, Anthropic, Amazon Bedrock, and custom LLM clients with structured output formats.

Parameters

Parameter	Type	Required	Description
`user_prompt`	`str`	Yes	Prompt template with `{{field.path}}` syntax for span context injection.
`system_prompt`	`str`	No	System prompt to set the judge's behavior or persona.
`structured_output`	`StructuredOutput`	No	Output format specification. See structured output types.
`provider`	`str`	Conditional	LLM provider: `"openai"`, `"azure_openai"`, `"anthropic"`, or `"bedrock"`. Required if `client` is not provided.
`model`	`str`	No	Model identifier (for example, `"gpt-4o"`, `"claude-sonnet-4-20250514"`).
`model_params`	`dict`	No	Additional parameters passed to the LLM API (for example, `temperature`).
`client`	callable	Conditional	Custom LLM client function. Required if `provider` is not provided.
`name`	`str`	No	Evaluator name for identification in results.
`client_options`	`dict`	No	Provider-specific configuration (for example, API keys).

Template variables

The user_prompt supports {{field.path}} syntax to inject context from the evaluated span. Nested paths are supported.

{{input_data}} — The span's input data.
{{output_data}} — The span's output data.
{{expected_output}} — Expected output for comparison (if available).
{{metadata.key}} — Nested metadata fields (for example, {{metadata.topic}}).

Structured output types

Output type	Description
`BooleanStructuredOutput`	Returns `True`/`False` with optional pass/fail assessment.
`ScoreStructuredOutput`	Returns a numeric score within a defined range, with optional thresholds.
`CategoricalStructuredOutput`	Returns one of a predefined set of categories, with optional pass values.
`Dict[str, JSONType]`	Custom JSON schema for arbitrary structured output.

All structured output types accept reasoning=True to include an explanation in results, and reasoning_description to customize the reasoning field's description.

Example: Boolean evaluation

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

judge = LLMJudge( provider="openai", model="gpt-4o", user_prompt="Is this response factually accurate? Response: {{output_data}}", structured_output=BooleanStructuredOutput( description="Whether the response is factually accurate", reasoning=True, pass_when=True, ), ) {{< /code-block >}}

Example: Score-based evaluation with thresholds

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, ScoreStructuredOutput

judge = LLMJudge( provider="anthropic", model="claude-sonnet-4-20250514", user_prompt="Rate the helpfulness of this response (1-10): {{output_data}}", structured_output=ScoreStructuredOutput( description="Helpfulness score", min_score=1, max_score=10, reasoning=True, min_threshold=7, # Scores >= 7 pass ), ) {{< /code-block >}}

Example: Categorical evaluation

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, CategoricalStructuredOutput

judge = LLMJudge( provider="openai", model="gpt-4o", user_prompt="Classify the sentiment: {{output_data}}", structured_output=CategoricalStructuredOutput( categories={ "positive": "The response has a positive sentiment.", "neutral": "The response has a neutral sentiment.", "negative": "The response has a negative sentiment.", }, reasoning=True, pass_values=["positive", "neutral"], ), ) {{< /code-block >}}

Example: Azure OpenAI

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

judge = LLMJudge( provider="azure_openai", model="gpt-4o", user_prompt="Is this response factually accurate? Response: {{output_data}}", structured_output=BooleanStructuredOutput( description="Whether the response is factually accurate", reasoning=True, pass_when=True, ), client_options={ "azure_endpoint": "https://your-resource.openai.azure.com", "api_version": "2024-10-21", "azure_deployment": "gpt-4o", }, ) {{< /code-block >}}

The azure_openai provider accepts the following client_options:

Option	Environment variable	Description
`api_key`	`AZURE_OPENAI_API_KEY`	Azure OpenAI API key.
`azure_endpoint`	`AZURE_OPENAI_ENDPOINT`	Azure OpenAI endpoint URL.
`api_version`	`AZURE_OPENAI_API_VERSION`	API version. Defaults to `"2024-10-21"`.
`azure_deployment`	`AZURE_OPENAI_DEPLOYMENT`	Deployment name. Falls back to the `model` parameter.

Example: Custom LLM client

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMJudge, BooleanStructuredOutput

def my_llm_client(provider, messages, json_schema, model, model_params): response = call_my_llm(messages, model) return response

judge = LLMJudge( client=my_llm_client, model="my-custom-model", user_prompt="Is this response accurate? {{output_data}}", structured_output=BooleanStructuredOutput( description="Accuracy check", reasoning=True, pass_when=True, ), ) {{< /code-block >}}

Key points

Requires either a provider ("openai", "azure_openai", "anthropic", or "bedrock") or a custom client.
Set API keys using client_options={"api_key": "..."} or environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY). For Azure OpenAI, set AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT. For Bedrock, configure AWS credentials through environment variables or client_options.
Use reasoning=True in structured outputs to include an explanation in results.
Define pass/fail criteria with pass_when (boolean), pass_values (categorical), or min_threshold/max_threshold (score).

Publishing an LLMJudge as a Datadog managed evaluation

Use LLMObs.publish_evaluator() to push a locally-defined LLMJudge configuration to Datadog as a custom LLM-as-a-judge draft. This lets you define and validate an evaluator in experiments, then promote it to production without manually recreating the configuration in the UI.

Parameter	Type	Required	Description
`evaluator`	`LLMJudge`	Yes	The `LLMJudge` instance to publish.
`ml_app`	`str`	Yes	The LLM application name.
`eval_name`	`str`	No	The name to use for the evaluator in Datadog. If omitted, defaults to the `name` set on the `LLMJudge` instance.
`variable_mapping`	`dict[str, str]`	No	Remaps variable names in `user_prompt` to Datadog span field paths in the published evaluator.

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs from ddtrace.llmobs._evaluators import BooleanStructuredOutput, LLMJudge

LLMObs.enable( ml_app="my-ml-app", api_key="<DD_API_KEY>", app_key="<DD_APP_KEY>", )

judge = LLMJudge( provider="openai", model="gpt-4o", system_prompt="You are a helpful evaluator.", user_prompt=( "Does the output correctly answer the question?\n" "Input: {{input_data}}\n" "Output: {{output_data}}" ), structured_output=BooleanStructuredOutput("correctness", pass_when=True), name="my-correctness-judge", )

result = LLMObs.publish_evaluator( judge, ml_app="my-ml-app", variable_mapping={"input_data": "span_input", "output_data": "span_output"}, ) print(result["ui_url"]) {{< /code-block >}}

LLMObs.publish_evaluator() returns {"ui_url": "..."}, which links to the evaluator in Datadog.

Each call to LLMObs.publish_evaluator() creates or updates the evaluator draft. Activate it from the Datadog UI to run it in production.

Built-in evaluators

The SDK provides built-in evaluators for common evaluation patterns. These are class-based evaluators that you can use directly without writing custom logic.

StringCheckEvaluator

Performs string comparison operations between output_data and expected_output.

Operation	Description
`eq`	Exact match (default)
`ne`	Not equals
`contains`	`output_data` contains `expected_output` (case-sensitive)
`icontains`	`output_data` contains `expected_output` (case-insensitive)

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import StringCheckEvaluator

Perform an exact match (default)

evaluator = StringCheckEvaluator(operation="eq", case_sensitive=True)

Check whether output_data contains expected_output (case-insensitive)

evaluator = StringCheckEvaluator(operation="icontains", strip_whitespace=True)

Extract field from dict output before comparison

evaluator = StringCheckEvaluator( operation="eq", output_extractor=lambda x: x.get("message", "") if isinstance(x, dict) else str(x), ) {{< /code-block >}}

RegexMatchEvaluator

Validates output against a regex pattern.

Match mode	Description
`search`	Partial match anywhere in string (default)
`match`	Match from start of string
`fullmatch`	Match entire string

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import RegexMatchEvaluator import re

Validate email format

evaluator = RegexMatchEvaluator( pattern=r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}$", match_mode="fullmatch" )

Validate output pattern (case-insensitive)

evaluator = RegexMatchEvaluator( pattern=r"success|completed", flags=re.IGNORECASE ) {{< /code-block >}}

LengthEvaluator

Validates output length constraints.

Count type	Description
`characters`	Count characters (default)
`words`	Count words
`lines`	Count lines

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import LengthEvaluator

Ensure response is 50-200 characters

evaluator = LengthEvaluator(min_length=50, max_length=200, count_type="characters")

Validate word count

evaluator = LengthEvaluator(min_length=10, max_length=100, count_type="words") {{< /code-block >}}

JSONEvaluator

Validates that output is valid JSON, and optionally checks for required keys.

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import JSONEvaluator

Validate JSON syntax

evaluator = JSONEvaluator()

Validate that required keys exist

evaluator = JSONEvaluator(required_keys=["name", "status", "data"]) {{< /code-block >}}

SemanticSimilarityEvaluator

Measures semantic similarity between output_data and expected_output using embeddings. Returns a similarity score between 0.0 and 1.0.

{{< code-block lang="python" >}} from ddtrace.llmobs._evaluators import SemanticSimilarityEvaluator from openai import OpenAI

client = OpenAI()

def get_embedding(text): response = client.embeddings.create( input=text, model="text-embedding-3-small" ) return response.data[0].embedding

evaluator = SemanticSimilarityEvaluator( embedding_fn=get_embedding, threshold=0.8 # Minimum similarity score to pass ) {{< /code-block >}}

Function-based evaluators

For straightforward evaluation logic, define a function instead of a class. Function-based evaluators receive the input, output, and expected output directly as arguments.

{{< code-block lang="python" >}} from ddtrace.llmobs import EvaluatorResult

def exact_match_evaluator(input_data, output_data, expected_output): """Checks if output exactly matches expected output.""" matches = output_data == expected_output return EvaluatorResult( value=matches, reasoning="Exact match" if matches else "Output differs from expected", assessment="pass" if matches else "fail", ) {{< /code-block >}}

Function signature:

{{< code-block lang="python" >}} def evaluator_function( input_data: Any, output_data: Any, expected_output: Any ) -> Union[JSONType, EvaluatorResult]: ... {{< /code-block >}}

You can return either:

A plain value (str, float, int, bool, dict), or
An EvaluatorResult for rich results with reasoning and metadata

Using evaluators in experiments

Pass your evaluators to LLMObs.experiment() to run them against every record in a dataset. The SDK automatically builds an EvaluatorContext for each record and calls your evaluator. After all records are processed, any summary evaluators run on the aggregated results.

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, Dataset, DatasetRecord

Create dataset

dataset = Dataset( name="qa_dataset", records=[ DatasetRecord( input_data={"question": "What is 2+2?"}, expected_output="4" ), DatasetRecord( input_data={"question": "What is the capital of France?"}, expected_output="Paris" ), ] )

Define task

def qa_task(input_data, config): return generate_answer(input_data["question"])

Create evaluators

semantic_eval = SemanticSimilarityEvaluator(threshold=0.7) summary_eval = AverageScoreEvaluator("semantic_similarity")

Run experiment

experiment = LLMObs.experiment( name="qa_experiment", task=qa_task, dataset=dataset, evaluators=[semantic_eval, exact_match_evaluator], summary_evaluators=[summary_eval] )

experiment.run() {{< /code-block >}}

Using managed evaluators

RemoteEvaluator lets you reference a custom LLM-as-a-judge evaluation configured in the Datadog UI by name, and run it as part of a local experiment. This allows you to reuse your production evaluators in offline experiments without reimplementing the evaluation logic in Python.

Parameter	Type	Description
`eval_name`	`str`	The name of the LLM-as-a-judge evaluator as configured in Datadog.
`transform_fn`	`Optional[Callable]`	A function that maps an `EvaluatorContext` to a dict of template variable values.

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, RemoteEvaluator

evaluator = RemoteEvaluator(eval_name="quality-assessment")

experiment = LLMObs.experiment( name="my-experiment", task=my_task, dataset=dataset, evaluators=[evaluator], ) experiment.run() {{< /code-block >}}

Mapping dataset data to prompt variables with `transform_fn`

When you configure an LLM-as-a-judge in the Datadog UI, the prompt template uses variables such as {{span_input}} and {{span_output}}. By default, RemoteEvaluator maps the following:

input_data → span_input
output_data → span_output
expected_output → meta.expected_output

If your dataset records have a different structure—for example, input_data is a dict with multiple keys—provide a transform_fn to control exactly which values are sent for each template variable:

{{< code-block lang="python" >}} from ddtrace.llmobs import RemoteEvaluator, EvaluatorContext

def my_transform(context: EvaluatorContext) -> dict: # input_data is a dict: {"user_query": str, "retrieved_docs": list[str]} return { "span_input": context.input_data.get("user_query"), # → {{span_input}} in the prompt "span_output": context.output_data, # → {{span_output}} in the prompt "meta": { "retrieved_docs": context.input_data.get("retrieved_docs"), # → {{meta.retrieved_docs}} }, }

evaluator = RemoteEvaluator( eval_name="quality-assessment", transform_fn=my_transform, ) {{< /code-block >}}

If the backend evaluator encounters an error, a RemoteEvaluatorError is raised. Inspect backend_error for details:

{{< code-block lang="python" >}} from ddtrace.llmobs import RemoteEvaluator, RemoteEvaluatorError, EvaluatorContext

evaluator = RemoteEvaluator(eval_name="quality-assessment") context = EvaluatorContext(input_data={"query": "What is the capital of France?"}, output_data="Paris")

try: result = evaluator.evaluate(context) except RemoteEvaluatorError as e: print(e.backend_error) # {"type": "...", "message": "...", "recommended_resolution": "..."} {{< /code-block >}}

Using evaluators in production

This section covers evaluations you run and submit manually from your application code. To have Datadog run evaluations automatically on production traces, see Custom LLM-as-a-Judge Evaluations instead.

To submit evaluations from your application code, construct the EvaluatorContext yourself, call the evaluator, and submit the result with LLMObs.submit_evaluation(). You can also submit evaluations through the HTTP API.

For the full submit_evaluation() arguments and span-joining options, see the external evaluations documentation. For the HTTP API specification, see the Evaluations API reference.

{{< code-block lang="python" >}} from ddtrace.llmobs import LLMObs, EvaluatorContext from ddtrace.llmobs.decorators import llm

evaluator = SemanticSimilarityEvaluator(threshold=0.8)

@llm(model_name="claude", name="invoke_llm", model_provider="anthropic") def llm_call(input_text): completion = ... # Your LLM application logic

# Build the evaluation context from the span data
context = EvaluatorContext(
    input_data=input_text,
    output_data=completion,
    expected_output=None,
)

# Run the evaluator
result = evaluator.evaluate(context)

# Submit the result to Datadog
LLMObs.submit_evaluation(
    span=LLMObs.export_span(),
    ml_app="chatbot",
    label=evaluator.name,
    metric_type="score",
    value=result.value,
    assessment=result.assessment,
    reasoning=result.reasoning,
)

return completion

Data model reference

EvaluatorContext

A frozen dataclass containing all the information needed to run an evaluation.

Field	Type	Description
`input_data`	`Any`	The input provided to the LLM application (for example, a prompt).
`output_data`	`Any`	The actual output from the LLM application.
`expected_output`	`Any`	The expected or ideal output the LLM should have produced.
`metadata`	`Dict[str, Any]`	Additional metadata.
`span_id`	`str`	The span's unique identifier.
`trace_id`	`str`	The trace's unique identifier.

In Experiments, the SDK populates this automatically from each dataset record. In production, you construct it yourself from your span data.

EvaluatorResult

Allows you to return rich evaluation results with additional context. Used in both Experiments and production.

Field	Type	Description
`value`	`Union[str, float, int, bool, dict]`	The evaluation value. Type depends on `metric_type`.
`reasoning`	`Optional[str]`	A text explanation of the evaluation result.
`assessment`	`Optional[str]`	An assessment of this evaluation. Accepted values are `pass` and `fail`.
`metadata`	`Optional[Dict[str, Any]]`	Additional metadata about the evaluation.
`tags`	`Optional[Dict[str, str]]`	Tags to apply to the evaluation metric.

SummaryEvaluatorContext

A frozen dataclass providing aggregated evaluation results across all dataset records in an experiment. Only used by summary evaluators.

Field	Type	Description
`inputs`	`List[Any]`	List of all input data from the experiment.
`outputs`	`List[Any]`	List of all output data from the experiment.
`expected_outputs`	`List[Any]`	List of all expected outputs from the experiment.
`evaluation_results`	`Dict[str, List[Any]]`	Dictionary mapping evaluator names to their results.
`metadata`	`Dict[str, Any]`	Additional metadata associated with the experiment.

Metric types

The metric type is set when submitting an evaluation (through submit_evaluation() or the HTTP API) and determines how the value is validated and displayed in Datadog.

Metric type	Value type	Use case
`categorical`	`str`	Classifying outputs into categories (for example, "Positive", "Negative", "Neutral")
`score`	`float` or `int`	Numeric scores or ratings (for example, 0.0-1.0, 1-10)
`boolean`	`bool`	Pass/fail or yes/no evaluations
`json`	`dict`	Structured evaluation data (for example, multi-dimensional rubrics or detailed breakdowns)

Best practices

Naming conventions

Evaluation labels must follow these conventions:

Must start with a letter
Must only contain ASCII alphanumerics, underscores, or hyphens
Spaces and other unsupported characters are converted to underscores
Unicode is not supported
Must not exceed 200 characters (fewer than 100 is preferred)
Must be unique for a given LLM application (ml_app) and organization

Concurrent execution

Set the jobs parameter to run tasks and evaluators concurrently on multiple threads, allowing experiments to complete faster when processing multiple dataset records.

Asynchronous evaluators are not yet supported for concurrent execution. Only synchronous evaluators benefit from parallel execution.

OpenTelemetry integration

When submitting evaluations for OpenTelemetry-instrumented spans, include the source:otel tag in the evaluation. See the external evaluations documentation for examples.

FilesExpand file tree

evaluation_developer_guide.md

Latest commit

History

evaluation_developer_guide.md

File metadata and controls

Overview

Key concepts

Running evaluators in an Experiment

Running evaluators in production

Evaluation components

Building evaluators

Class-based evaluators

BaseEvaluator

BaseSummaryEvaluator

LLMJudge

Parameters

Template variables

Structured output types

Example: Boolean evaluation

Example: Score-based evaluation with thresholds

Example: Categorical evaluation

Example: Azure OpenAI

Example: Custom LLM client

Key points

Publishing an LLMJudge as a Datadog managed evaluation

Built-in evaluators

StringCheckEvaluator

Perform an exact match (default)

Check whether output_data contains expected_output (case-insensitive)

Extract field from dict output before comparison

RegexMatchEvaluator

Validate email format

Validate output pattern (case-insensitive)

LengthEvaluator

Ensure response is 50-200 characters

Validate word count

JSONEvaluator

Validate JSON syntax

Validate that required keys exist

SemanticSimilarityEvaluator

Function-based evaluators

Using evaluators in experiments

Create dataset

Define task

Create evaluators

Run experiment

Using managed evaluators

Mapping dataset data to prompt variables with transform_fn

Using evaluators in production

Data model reference

EvaluatorContext

EvaluatorResult

SummaryEvaluatorContext

Metric types

Best practices

Naming conventions

Concurrent execution

OpenTelemetry integration

Further Reading

Mapping dataset data to prompt variables with `transform_fn`