Skip to content

Latest commit

 

History

History
506 lines (388 loc) · 17.3 KB

File metadata and controls

506 lines (388 loc) · 17.3 KB

Custom Evaluators

Custom evaluators let you score agent traces with your own logic. An evaluator is any program that reads EvalInput JSON from stdin and writes EvalResult JSON to stdout. This simple protocol means you can write evaluators in Python, JavaScript/TypeScript, or any language that can read/write JSON.

Quick Start

1. Scaffold an evaluator

agentevals evaluator init my_evaluator

This creates a directory with boilerplate code and an evaluator.yaml manifest:

my_evaluator/
├── my_evaluator.py     # scoring logic (implement your checks here)
└── evaluator.yaml      # metadata manifest

You can also specify a language:

agentevals evaluator init my_evaluator --runtime js    # JavaScript
agentevals evaluator init my_evaluator.ts              # TypeScript (inferred from extension)

2. Install the SDK (Python only)

pip install agentevals-evaluator-sdk

3. Write an evaluator

# evaluators/response_quality.py
from agentevals_evaluator_sdk import evaluator, EvalInput, EvalResult

@evaluator
def response_quality(input: EvalInput) -> EvalResult:
    scores = []
    for inv in input.invocations:
        if not inv.final_response:
            scores.append(0.0)
        elif len(inv.final_response.strip()) < input.config.get("min_length", 10):
            scores.append(0.5)
        else:
            scores.append(1.0)

    return EvalResult(
        score=sum(scores) / len(scores) if scores else 0.0,
        per_invocation_scores=scores,
    )

if __name__ == "__main__":
    response_quality.run()

The @evaluator decorator marks your function as an evaluator. Call .run() to execute it as a stdin/stdout script. Your function receives an EvalInput and returns an EvalResult. The decorated function can still be called directly in tests.

3. Add it to your eval config

# eval_config.yaml
evaluators:
  - name: tool_trajectory_avg_score   # built-in metric
    type: builtin

  - name: response_quality            # your custom evaluator
    type: code
    path: ./evaluators/response_quality.py
    threshold: 0.7
    config:
      min_length: 20

4. Run

agentevals run traces/my_trace.json \
  --config eval_config.yaml \
  --eval-set eval_set.json

Eval Config Reference

Each evaluator entry in the evaluators list uses the following fields. The type field determines which other fields are valid.

type: code (local scripts)

Field Required Default Description
name yes Unique name for the evaluator (used in output)
type yes code for local code files
path yes Path to the evaluator file (.py, .js, or .ts)
threshold no 0.5 Score at or above this value means PASSED
timeout no 30 Subprocess timeout in seconds
config no {} Arbitrary key-value pairs passed to the evaluator

type: openai_eval (OpenAI Evals API)

Field Required Default Description
name yes Unique name for the evaluator (used in output)
type yes openai_eval for OpenAI Evals API graders
threshold no 0.5 Maps to pass_threshold in the OpenAI grader
timeout no 120 Max seconds to wait for the OpenAI eval run
grader yes OpenAI grader config (see OpenAI Evals Graders)

Protocol

Every evaluator — regardless of language — communicates via the same JSON protocol over stdin/stdout.

Input (EvalInput)

{
  "protocol_version": "1.0",
  "metric_name": "response_quality",
  "threshold": 0.7,
  "config": { "min_length": 20 },
  "invocations": [
    {
      "invocation_id": "inv-001",
      "user_content": "What is 2+2?",
      "final_response": "The answer is 4.",
      "intermediate_steps": {
        "tool_calls": [
          { "name": "calculator", "args": { "expression": "2+2" } }
        ],
        "tool_responses": [
          { "name": "calculator", "output": "4" }
        ]
      }
    }
  ],
  "expected_invocations": null
}
Field Type Description
protocol_version string Wire-format version ("MAJOR.MINOR"). Current: "1.0"
metric_name string Name of this evaluator
threshold float Pass/fail threshold
config object User-provided config from the YAML
invocations array Agent turns to evaluate
expected_invocations array or null Golden reference turns (from eval set)

Each invocation contains:

Field Type Description
invocation_id string Unique turn identifier
user_content string What the user said
final_response string or null The agent's final response
intermediate_steps object Steps between user input and final response

The intermediate_steps object contains:

Field Type Description
tool_calls array Tools the agent called
tool_responses array Responses the agent received from tools

Output (EvalResult)

{
  "score": 0.85,
  "status": null,
  "per_invocation_scores": [1.0, 0.7],
  "details": { "issues": ["inv-002: response too short"] }
}
Field Required Description
score yes Overall score between 0.0 and 1.0
status no "PASSED", "FAILED", or "NOT_EVALUATED". If omitted, derived from score vs threshold.
per_invocation_scores no Per-turn scores (same order as input invocations)
details no Arbitrary metadata for debugging

Protocol Versioning

The protocol_version field uses "MAJOR.MINOR" format (currently "1.0"). This allows the CLI and SDK to evolve independently while maintaining compatibility:

  • Additive only -- new fields may be added to EvalInput or EvalResult; existing fields are never removed or renamed within the same major version.
  • Defaults required -- every new field must have a default value. Older deserializers silently ignore unknown fields (Pydantic's default behavior), so an evaluator built against an older SDK will still work with a newer CLI.
  • MINOR bumps -- additive changes (new optional fields). No action required by evaluator authors.
  • MAJOR bumps -- breaking changes (removed fields, type changes). The SDK's @evaluator decorator will log a warning if it sees a major version it does not recognize.

The CLI and SDK are independent packages. Install them at whatever versions you need:

pip install agentevals            # CLI -- may speak protocol 1.1
pip install agentevals-evaluator-sdk   # SDK -- may speak protocol 1.0

As long as the major version matches, they are compatible.

Writing Evaluators in Other Languages

You don't need the Python SDK. Any program that reads JSON from stdin and writes JSON to stdout works.

JavaScript / TypeScript

// evaluators/tool_check.js
const input = JSON.parse(require("fs").readFileSync("/dev/stdin", "utf8"));

let score = 1.0;
for (const inv of input.invocations) {
  if (inv.intermediate_steps.tool_calls.length === 0) {
    score -= 0.5;
  }
}

console.log(JSON.stringify({
  score: Math.max(0, score),
  per_invocation_scores: [],
}));
evaluators:
  - name: tool_check
    type: code
    path: ./evaluators/tool_check.js

Any language

Write a program that:

  1. Reads all of stdin as a UTF-8 string
  2. Parses it as JSON (matching the EvalInput schema)
  3. Writes a JSON object to stdout (matching the EvalResult schema)
  4. Exits with code 0 on success, non-zero on failure

The file extension determines which interpreter is used:

Extension Command
.py python <file>
.js, .ts node <file>

Discovering Evaluators

List available evaluators

agentevals evaluator list                    # all sources
agentevals evaluator list --source builtin   # only ADK built-in metrics
agentevals evaluator list --source github    # only community evaluators

This shows evaluators from all registered sources: ADK built-in metrics and the community GitHub repository.

Remote Evaluators

You can reference evaluators from the community repository directly in your eval config. They are downloaded and cached automatically on first use.

evaluators:
  - name: tool_trajectory_avg_score
    type: builtin

  - name: response_quality
    type: remote
    source: github
    ref: evaluators/response_quality/response_quality.py
    threshold: 0.7
Field Required Default Description
name yes Unique name for the evaluator (used in output)
type yes remote for evaluators fetched from a registry
source no github Evaluator source (github, or custom)
ref yes Path within the source (e.g. path in the GitHub repo)
threshold no 0.5 Score at or above this value means PASSED
timeout no 30 Subprocess timeout in seconds
config no {} Arbitrary key-value pairs passed to the evaluator
executor no local Execution environment (local or docker in the future)

Remote evaluators are cached in ~/.cache/agentevals/evaluators/. To force a re-download, delete the cached file.

OpenAI Evals API Graders

You can delegate grading to the OpenAI Evals API instead of running scoring logic locally. This requires pip install "agentevals-cli[openai]" and OPENAI_API_KEY to be set.

Text Similarity Grader

Compares the agent's response against a golden reference using text similarity metrics. Requires an eval set.

evaluators:
  - name: response_similarity
    type: openai_eval
    threshold: 0.8
    grader:
      type: text_similarity
      evaluation_metric: fuzzy_match

The grader.evaluation_metric field selects the similarity algorithm:

Metric Description
fuzzy_match Approximate string matching using edit distance
bleu N-gram overlap score, commonly used for translation quality
gleu Google's variant of BLEU with sentence-level scoring
meteor Alignment-based metric considering synonyms and paraphrases
cosine Cosine similarity on vectorized text
rouge_1 through rouge_5 Unigram through 5-gram overlap (F-measure)
rouge_l Longest common subsequence overlap (F-measure)

Label Model Grader

Scores responses without a golden set. The model reads each response and assigns a label from a fixed list. Passing labels are defined in the config.

evaluators:
  - name: quality_check
    type: openai_eval
    grader:
      type: label_model
      model: gpt-4o-mini
      input:
        - role: user
          content: "Rate this response: {{ item.actual_response }}"
      labels: [good, bad]
      passing_labels: [good]

The threshold field is not used for label_model. A response passes if its assigned label is in passing_labels.

How it works

Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the item namespace (with include_sample_schema: false), so OpenAI only grades the provided text without generating any model outputs.

Configuring the GitHub source

By default, evaluators are fetched from the official community repository. Override with environment variables:

export AGENTEVALS_EVALUATOR_REPO="your-org/your-evaluators-repo"
export AGENTEVALS_EVALUATOR_BRANCH="main"

Contributing Evaluators to the Community

  1. Scaffold a new evaluator:
agentevals evaluator init my_evaluator
  1. Implement your scoring logic and update the evaluator.yaml manifest with a description, tags, and your name.

  2. Copy the my_evaluator/ directory into the evaluators/ folder of the community repository and open a PR.

The community repo uses per-evaluator manifests. A CI workflow compiles all evaluators/*/evaluator.yaml files into a single index.yaml on merge, which is what agentevals evaluator list fetches.

Architecture

Custom evaluators use a layered architecture designed for extensibility.

┌─────────────────────────────────────────────┐
│  Eval Config (YAML)                         │
│  type: code | remote | openai_eval          │
└──────────────┬─────────────┬────────────────┘
               │             │
     code/remote         openai_eval
               │             │
               ▼             ▼
┌──────────────────────┐  ┌──────────────────────┐
│  EvaluatorResolver   │  │  OpenAI Evals API    │
│  remote → local      │  │  create eval + run   │
│  (passthrough: code) │  │  poll → get results  │
└──────────┬───────────┘  └──────────────────────┘
           │
           ▼
┌──────────────────────────┐
│  CustomEvaluatorRunner   │
│  ADK Evaluator adapter   │
│  Invocation ↔ EvalInput  │
└──────────┬───────────────┘
           │
           ▼
┌──────────────────────────┐
│  EvaluatorBackend (ABC)  │
│  "local"  → Subprocess   │
│  "docker" → (future)     │
└──────────┬───────────────┘
           │
           ▼
┌──────────────────────────┐
│  Runtime registry        │
│  PythonRuntime (.py)     │
│  NodeRuntime (.js, .ts)  │
└──────────────────────────┘
  • type: openai_eval takes a separate path: it calls the OpenAI Evals API directly (create eval, create run, poll, collect results) and returns a MetricResult. It does not go through the subprocess/backend stack.
  • EvaluatorSource is the registry abstraction. Implementations (BuiltinEvaluatorSource, GitHubEvaluatorSource) list and fetch evaluators from different registries.
  • EvaluatorResolver downloads remote evaluators and converts RemoteEvaluatorDef to CodeEvaluatorDef with a local cached path.
  • EvaluatorBackend is the execution abstraction. The executor field in config selects which factory to use ("local"SubprocessBackend). New executors (e.g. DockerBackend) register via register_executor().
  • SubprocessBackend runs a local file as a child process, piping JSON over stdin/stdout.
  • Runtime is an internal detail of SubprocessBackend that maps file extensions to interpreter commands.
  • CustomEvaluatorRunner adapts any EvaluatorBackend into ADK's Evaluator interface, handling the conversion between ADK's Invocation objects and the simpler EvalInput/EvalResult protocol.

Adding a new language runtime

To support a new language (e.g., Go), add a Runtime subclass in custom_evaluators.py:

class GoRuntime(Runtime):
    @property
    def extensions(self) -> tuple[str, ...]:
        return (".go",)

    def build_command(self, path: Path) -> list[str]:
        go = shutil.which("go")
        if not go:
            raise RuntimeError("Go not found on PATH")
        return [go, "run", str(path)]

Then register it:

_RUNTIMES: list[Runtime] = [
    PythonRuntime(),
    NodeRuntime(),
    GoRuntime(),       # new
]

No other files need to change — the extension validator and evaluator pick it up automatically.

Adding a new executor

To support a different execution environment (e.g., Docker), you need two things:

  1. Implement the backend in custom_evaluators.py:
class DockerBackend(EvaluatorBackend):
    def __init__(self, path: Path, timeout: int = 30):
        self._path = path
        self._timeout = timeout

    async def run(self, eval_input: EvalInput, metric_name: str) -> EvalResult:
        # Build/run container, pipe JSON, return result
        ...
  1. Register it:
from agentevals.custom_evaluators import register_executor

register_executor("docker", lambda path, timeout: DockerBackend(path, timeout))

Users then set executor: docker in their config:

evaluators:
  - name: untrusted_evaluator
    type: code
    path: ./evaluators/untrusted.py
    executor: docker

Adding a new evaluator source

To support a different evaluator registry (e.g., a custom API), implement EvaluatorSource:

from agentevals.evaluator.sources import EvaluatorSource, EvaluatorInfo, register_source

class MyRegistrySource(EvaluatorSource):
    @property
    def source_name(self) -> str:
        return "my-registry"

    async def list_evaluators(self) -> list[EvaluatorInfo]: ...
    async def fetch_evaluator(self, ref: str, dest: Path) -> Path: ...

register_source(MyRegistrySource())

Users can then reference evaluators from the new source:

evaluators:
  - name: my_evaluator
    type: remote
    source: my-registry
    ref: some/ref/path.py