Custom evaluators let you score agent traces with your own logic. An evaluator is any program that reads EvalInput JSON from stdin and writes EvalResult JSON to stdout. This simple protocol means you can write evaluators in Python, JavaScript/TypeScript, or any language that can read/write JSON.
agentevals evaluator init my_evaluatorThis creates a directory with boilerplate code and an evaluator.yaml manifest:
my_evaluator/
├── my_evaluator.py # scoring logic (implement your checks here)
└── evaluator.yaml # metadata manifest
You can also specify a language:
agentevals evaluator init my_evaluator --runtime js # JavaScript
agentevals evaluator init my_evaluator.ts # TypeScript (inferred from extension)pip install agentevals-evaluator-sdk# evaluators/response_quality.py
from agentevals_evaluator_sdk import evaluator, EvalInput, EvalResult
@evaluator
def response_quality(input: EvalInput) -> EvalResult:
scores = []
for inv in input.invocations:
if not inv.final_response:
scores.append(0.0)
elif len(inv.final_response.strip()) < input.config.get("min_length", 10):
scores.append(0.5)
else:
scores.append(1.0)
return EvalResult(
score=sum(scores) / len(scores) if scores else 0.0,
per_invocation_scores=scores,
)
if __name__ == "__main__":
response_quality.run()The @evaluator decorator marks your function as an evaluator. Call .run() to execute it as a stdin/stdout script. Your function receives an EvalInput and returns an EvalResult. The decorated function can still be called directly in tests.
# eval_config.yaml
evaluators:
- name: tool_trajectory_avg_score # built-in metric
type: builtin
- name: response_quality # your custom evaluator
type: code
path: ./evaluators/response_quality.py
threshold: 0.7
config:
min_length: 20agentevals run traces/my_trace.json \
--config eval_config.yaml \
--eval-set eval_set.jsonEach evaluator entry in the evaluators list uses the following fields. The type field determines which other fields are valid.
| Field | Required | Default | Description |
|---|---|---|---|
name |
yes | Unique name for the evaluator (used in output) | |
type |
yes | code for local code files |
|
path |
yes | Path to the evaluator file (.py, .js, or .ts) |
|
threshold |
no | 0.5 |
Score at or above this value means PASSED |
timeout |
no | 30 |
Subprocess timeout in seconds |
config |
no | {} |
Arbitrary key-value pairs passed to the evaluator |
| Field | Required | Default | Description |
|---|---|---|---|
name |
yes | Unique name for the evaluator (used in output) | |
type |
yes | openai_eval for OpenAI Evals API graders |
|
threshold |
no | 0.5 |
Maps to pass_threshold in the OpenAI grader |
timeout |
no | 120 |
Max seconds to wait for the OpenAI eval run |
grader |
yes | OpenAI grader config (see OpenAI Evals Graders) |
Every evaluator — regardless of language — communicates via the same JSON protocol over stdin/stdout.
{
"protocol_version": "1.0",
"metric_name": "response_quality",
"threshold": 0.7,
"config": { "min_length": 20 },
"invocations": [
{
"invocation_id": "inv-001",
"user_content": "What is 2+2?",
"final_response": "The answer is 4.",
"intermediate_steps": {
"tool_calls": [
{ "name": "calculator", "args": { "expression": "2+2" } }
],
"tool_responses": [
{ "name": "calculator", "output": "4" }
]
}
}
],
"expected_invocations": null
}| Field | Type | Description |
|---|---|---|
protocol_version |
string | Wire-format version ("MAJOR.MINOR"). Current: "1.0" |
metric_name |
string | Name of this evaluator |
threshold |
float | Pass/fail threshold |
config |
object | User-provided config from the YAML |
invocations |
array | Agent turns to evaluate |
expected_invocations |
array or null | Golden reference turns (from eval set) |
Each invocation contains:
| Field | Type | Description |
|---|---|---|
invocation_id |
string | Unique turn identifier |
user_content |
string | What the user said |
final_response |
string or null | The agent's final response |
intermediate_steps |
object | Steps between user input and final response |
The intermediate_steps object contains:
| Field | Type | Description |
|---|---|---|
tool_calls |
array | Tools the agent called |
tool_responses |
array | Responses the agent received from tools |
{
"score": 0.85,
"status": null,
"per_invocation_scores": [1.0, 0.7],
"details": { "issues": ["inv-002: response too short"] }
}| Field | Required | Description |
|---|---|---|
score |
yes | Overall score between 0.0 and 1.0 |
status |
no | "PASSED", "FAILED", or "NOT_EVALUATED". If omitted, derived from score vs threshold. |
per_invocation_scores |
no | Per-turn scores (same order as input invocations) |
details |
no | Arbitrary metadata for debugging |
The protocol_version field uses "MAJOR.MINOR" format (currently "1.0"). This allows the CLI and SDK to evolve independently while maintaining compatibility:
- Additive only -- new fields may be added to
EvalInputorEvalResult; existing fields are never removed or renamed within the same major version. - Defaults required -- every new field must have a default value. Older deserializers silently ignore unknown fields (Pydantic's default behavior), so an evaluator built against an older SDK will still work with a newer CLI.
- MINOR bumps -- additive changes (new optional fields). No action required by evaluator authors.
- MAJOR bumps -- breaking changes (removed fields, type changes). The SDK's
@evaluatordecorator will log a warning if it sees a major version it does not recognize.
The CLI and SDK are independent packages. Install them at whatever versions you need:
pip install agentevals # CLI -- may speak protocol 1.1
pip install agentevals-evaluator-sdk # SDK -- may speak protocol 1.0As long as the major version matches, they are compatible.
You don't need the Python SDK. Any program that reads JSON from stdin and writes JSON to stdout works.
// evaluators/tool_check.js
const input = JSON.parse(require("fs").readFileSync("/dev/stdin", "utf8"));
let score = 1.0;
for (const inv of input.invocations) {
if (inv.intermediate_steps.tool_calls.length === 0) {
score -= 0.5;
}
}
console.log(JSON.stringify({
score: Math.max(0, score),
per_invocation_scores: [],
}));evaluators:
- name: tool_check
type: code
path: ./evaluators/tool_check.jsWrite a program that:
- Reads all of stdin as a UTF-8 string
- Parses it as JSON (matching the
EvalInputschema) - Writes a JSON object to stdout (matching the
EvalResultschema) - Exits with code 0 on success, non-zero on failure
The file extension determines which interpreter is used:
| Extension | Command |
|---|---|
.py |
python <file> |
.js, .ts |
node <file> |
agentevals evaluator list # all sources
agentevals evaluator list --source builtin # only ADK built-in metrics
agentevals evaluator list --source github # only community evaluatorsThis shows evaluators from all registered sources: ADK built-in metrics and the community GitHub repository.
You can reference evaluators from the community repository directly in your eval config. They are downloaded and cached automatically on first use.
evaluators:
- name: tool_trajectory_avg_score
type: builtin
- name: response_quality
type: remote
source: github
ref: evaluators/response_quality/response_quality.py
threshold: 0.7| Field | Required | Default | Description |
|---|---|---|---|
name |
yes | Unique name for the evaluator (used in output) | |
type |
yes | remote for evaluators fetched from a registry |
|
source |
no | github |
Evaluator source (github, or custom) |
ref |
yes | Path within the source (e.g. path in the GitHub repo) | |
threshold |
no | 0.5 |
Score at or above this value means PASSED |
timeout |
no | 30 |
Subprocess timeout in seconds |
config |
no | {} |
Arbitrary key-value pairs passed to the evaluator |
executor |
no | local |
Execution environment (local or docker in the future) |
Remote evaluators are cached in ~/.cache/agentevals/evaluators/. To force a re-download, delete the cached file.
You can delegate grading to the OpenAI Evals API instead of running scoring logic locally. This requires pip install "agentevals-cli[openai]" and OPENAI_API_KEY to be set.
Compares the agent's response against a golden reference using text similarity metrics. Requires an eval set.
evaluators:
- name: response_similarity
type: openai_eval
threshold: 0.8
grader:
type: text_similarity
evaluation_metric: fuzzy_matchThe grader.evaluation_metric field selects the similarity algorithm:
| Metric | Description |
|---|---|
fuzzy_match |
Approximate string matching using edit distance |
bleu |
N-gram overlap score, commonly used for translation quality |
gleu |
Google's variant of BLEU with sentence-level scoring |
meteor |
Alignment-based metric considering synonyms and paraphrases |
cosine |
Cosine similarity on vectorized text |
rouge_1 through rouge_5 |
Unigram through 5-gram overlap (F-measure) |
rouge_l |
Longest common subsequence overlap (F-measure) |
Scores responses without a golden set. The model reads each response and assigns a label from a fixed list. Passing labels are defined in the config.
evaluators:
- name: quality_check
type: openai_eval
grader:
type: label_model
model: gpt-4o-mini
input:
- role: user
content: "Rate this response: {{ item.actual_response }}"
labels: [good, bad]
passing_labels: [good]The threshold field is not used for label_model. A response passes if its assigned label is in passing_labels.
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the item namespace (with include_sample_schema: false), so OpenAI only grades the provided text without generating any model outputs.
By default, evaluators are fetched from the official community repository. Override with environment variables:
export AGENTEVALS_EVALUATOR_REPO="your-org/your-evaluators-repo"
export AGENTEVALS_EVALUATOR_BRANCH="main"- Scaffold a new evaluator:
agentevals evaluator init my_evaluator-
Implement your scoring logic and update the
evaluator.yamlmanifest with a description, tags, and your name. -
Copy the
my_evaluator/directory into theevaluators/folder of the community repository and open a PR.
The community repo uses per-evaluator manifests. A CI workflow compiles all evaluators/*/evaluator.yaml files into a single index.yaml on merge, which is what agentevals evaluator list fetches.
Custom evaluators use a layered architecture designed for extensibility.
┌─────────────────────────────────────────────┐
│ Eval Config (YAML) │
│ type: code | remote | openai_eval │
└──────────────┬─────────────┬────────────────┘
│ │
code/remote openai_eval
│ │
▼ ▼
┌──────────────────────┐ ┌──────────────────────┐
│ EvaluatorResolver │ │ OpenAI Evals API │
│ remote → local │ │ create eval + run │
│ (passthrough: code) │ │ poll → get results │
└──────────┬───────────┘ └──────────────────────┘
│
▼
┌──────────────────────────┐
│ CustomEvaluatorRunner │
│ ADK Evaluator adapter │
│ Invocation ↔ EvalInput │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ EvaluatorBackend (ABC) │
│ "local" → Subprocess │
│ "docker" → (future) │
└──────────┬───────────────┘
│
▼
┌──────────────────────────┐
│ Runtime registry │
│ PythonRuntime (.py) │
│ NodeRuntime (.js, .ts) │
└──────────────────────────┘
type: openai_evaltakes a separate path: it calls the OpenAI Evals API directly (create eval, create run, poll, collect results) and returns aMetricResult. It does not go through the subprocess/backend stack.EvaluatorSourceis the registry abstraction. Implementations (BuiltinEvaluatorSource,GitHubEvaluatorSource) list and fetch evaluators from different registries.EvaluatorResolverdownloads remote evaluators and convertsRemoteEvaluatorDeftoCodeEvaluatorDefwith a local cached path.EvaluatorBackendis the execution abstraction. Theexecutorfield in config selects which factory to use ("local"→SubprocessBackend). New executors (e.g.DockerBackend) register viaregister_executor().SubprocessBackendruns a local file as a child process, piping JSON over stdin/stdout.Runtimeis an internal detail ofSubprocessBackendthat maps file extensions to interpreter commands.CustomEvaluatorRunneradapts anyEvaluatorBackendinto ADK'sEvaluatorinterface, handling the conversion between ADK'sInvocationobjects and the simplerEvalInput/EvalResultprotocol.
To support a new language (e.g., Go), add a Runtime subclass in custom_evaluators.py:
class GoRuntime(Runtime):
@property
def extensions(self) -> tuple[str, ...]:
return (".go",)
def build_command(self, path: Path) -> list[str]:
go = shutil.which("go")
if not go:
raise RuntimeError("Go not found on PATH")
return [go, "run", str(path)]Then register it:
_RUNTIMES: list[Runtime] = [
PythonRuntime(),
NodeRuntime(),
GoRuntime(), # new
]No other files need to change — the extension validator and evaluator pick it up automatically.
To support a different execution environment (e.g., Docker), you need two things:
- Implement the backend in
custom_evaluators.py:
class DockerBackend(EvaluatorBackend):
def __init__(self, path: Path, timeout: int = 30):
self._path = path
self._timeout = timeout
async def run(self, eval_input: EvalInput, metric_name: str) -> EvalResult:
# Build/run container, pipe JSON, return result
...- Register it:
from agentevals.custom_evaluators import register_executor
register_executor("docker", lambda path, timeout: DockerBackend(path, timeout))Users then set executor: docker in their config:
evaluators:
- name: untrusted_evaluator
type: code
path: ./evaluators/untrusted.py
executor: dockerTo support a different evaluator registry (e.g., a custom API), implement EvaluatorSource:
from agentevals.evaluator.sources import EvaluatorSource, EvaluatorInfo, register_source
class MyRegistrySource(EvaluatorSource):
@property
def source_name(self) -> str:
return "my-registry"
async def list_evaluators(self) -> list[EvaluatorInfo]: ...
async def fetch_evaluator(self, ref: str, dest: Path) -> Path: ...
register_source(MyRegistrySource())Users can then reference evaluators from the new source:
evaluators:
- name: my_evaluator
type: remote
source: my-registry
ref: some/ref/path.py