Skip to content

Commit bbf0ae3

Browse files
committed
rename to evaluators
Signed-off-by: Peter Jausovec <peter.jausovec@solo.io>
1 parent 4d08807 commit bbf0ae3

14 files changed

Lines changed: 157 additions & 136 deletions

File tree

.github/workflows/build-index.yaml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
1-
name: Build grader index
1+
name: Build evaluator index
22

33
on:
44
push:
55
branches: [main]
66
paths:
7-
- "graders/**/grader.yaml"
7+
- "evaluators/**/evaluator.yaml"
88

99
permissions:
1010
contents: write
@@ -20,7 +20,7 @@ jobs:
2020
sudo wget -qO /usr/local/bin/yq https://github.com/mikefarah/yq/releases/latest/download/yq_linux_amd64
2121
sudo chmod +x /usr/local/bin/yq
2222
23-
- name: Build index.yaml from grader manifests
23+
- name: Build index.yaml from evaluator manifests
2424
run: |
2525
set -euo pipefail
2626
@@ -31,10 +31,10 @@ jobs:
3131
# Source: .github/workflows/build-index.yaml
3232
# Generated: ${TIMESTAMP}
3333
34-
graders:
34+
evaluators:
3535
EOF
3636
37-
for manifest in graders/*/grader.yaml; do
37+
for manifest in evaluators/*/evaluator.yaml; do
3838
dir=$(dirname "$manifest")
3939
name=$(yq '.name' "$manifest")
4040
description=$(yq '.description' "$manifest")
Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
name: Validate evaluators
2+
3+
on:
4+
pull_request:
5+
paths:
6+
- "evaluators/**"
7+
- "scripts/validate_evaluator.py"
8+
- "scripts/test_input.json"
9+
10+
jobs:
11+
validate:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- uses: actions/checkout@v4
15+
16+
- name: Set up Python
17+
uses: actions/setup-python@v5
18+
with:
19+
python-version: "3.12"
20+
21+
- name: Install dependencies
22+
run: |
23+
pip install pyyaml
24+
# TODO: switch to `pip install agentevals-grader-sdk` once published to PyPI
25+
pip install "agentevals-grader-sdk @ git+https://github.com/agentevals-dev/agentevals.git#subdirectory=packages/grader-sdk-py"
26+
27+
- name: Discover and validate all evaluators
28+
run: |
29+
evaluator_dirs=$(find evaluators -mindepth 1 -maxdepth 1 -type d | sort)
30+
if [ -z "$evaluator_dirs" ]; then
31+
echo "No evaluator directories found."
32+
exit 0
33+
fi
34+
python scripts/validate_evaluator.py $evaluator_dirs

.github/workflows/validate-graders.yaml

Lines changed: 0 additions & 31 deletions
This file was deleted.

README.md

Lines changed: 37 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
1-
# agentevals Community Graders
1+
# agentevals Community Evaluators
22

3-
Community-maintained graders for [agentevals](https://github.com/agentevals-dev/agentevals) -- the agent evaluation framework built on Google ADK.
3+
Community-maintained evaluators for [agentevals](https://github.com/agentevals-dev/agentevals) -- the agent evaluation framework built on Google ADK.
44

5-
Graders are standalone scoring programs that evaluate agent traces. They read `EvalInput` JSON from stdin and write `EvalResult` JSON to stdout. This repository is the official index of community-contributed graders.
5+
Evaluators are standalone scoring programs that evaluate agent traces. They read `EvalInput` JSON from stdin and write `EvalResult` JSON to stdout. This repository is the official index of community-contributed evaluators.
66

7-
## Using community graders
7+
## Using community evaluators
88

9-
### Browse available graders
9+
### Browse available evaluators
1010

1111
```bash
12-
agentevals grader list --source github
12+
agentevals evaluator list --source github
1313
```
1414

15-
### Reference a community grader in your eval config
15+
### Reference a community evaluator in your eval config
1616

1717
Add a `type: remote` entry to your `eval_config.yaml`:
1818

@@ -23,15 +23,15 @@ metrics:
2323
- name: response_quality
2424
type: remote
2525
source: github
26-
ref: graders/response_quality/response_quality.py
26+
ref: evaluators/response_quality/response_quality.py
2727
threshold: 0.7
2828
config:
2929
min_response_length: 20
3030

3131
- name: tool_coverage
3232
type: remote
3333
source: github
34-
ref: graders/tool_coverage/tool_coverage.py
34+
ref: evaluators/tool_coverage/tool_coverage.py
3535
threshold: 1.0
3636
config:
3737
min_tool_calls: 1
@@ -45,34 +45,34 @@ agentevals run traces/my_trace.json \
4545
--eval-set eval_set.json
4646
```
4747

48-
The grader is downloaded automatically and cached in `~/.cache/agentevals/graders/`.
48+
The evaluator is downloaded automatically and cached in `~/.cache/agentevals/evaluators/`.
4949

50-
## Contributing a grader
50+
## Contributing an evaluator
5151

52-
### 1. Scaffold a new grader
52+
### 1. Scaffold a new evaluator
5353

5454
```bash
5555
pip install agentevals
56-
agentevals grader init my_grader
56+
agentevals evaluator init my_evaluator
5757
```
5858

5959
This creates a directory ready to be added to this repo:
6060

6161
```
62-
my_grader/
63-
├── my_grader.py # your scoring logic
64-
└── grader.yaml # metadata manifest
62+
my_evaluator/
63+
├── my_evaluator.py # your scoring logic
64+
└── evaluator.yaml # metadata manifest
6565
```
6666

6767
### 2. Implement your scoring logic
6868

69-
Edit `my_grader.py`. Your function receives an `EvalInput` with the agent's invocations and returns an `EvalResult` with a score between 0.0 and 1.0.
69+
Edit `my_evaluator.py`. Your function receives an `EvalInput` with the agent's invocations and returns an `EvalResult` with a score between 0.0 and 1.0.
7070

7171
```python
7272
from agentevals_grader_sdk import grader, EvalInput, EvalResult
7373

7474
@grader
75-
def my_grader(input: EvalInput) -> EvalResult:
75+
def my_evaluator(input: EvalInput) -> EvalResult:
7676
scores = []
7777
for inv in input.invocations:
7878
# Your scoring logic here
@@ -82,19 +82,22 @@ def my_grader(input: EvalInput) -> EvalResult:
8282
score=sum(scores) / len(scores) if scores else 0.0,
8383
per_invocation_scores=scores,
8484
)
85+
86+
if __name__ == "__main__":
87+
my_evaluator.run()
8588
```
8689

8790
Install the SDK standalone with `pip install agentevals-grader-sdk` (no heavy dependencies).
8891

8992
### 3. Update the manifest
9093

91-
Edit `grader.yaml` with a description, tags, and your name:
94+
Edit `evaluator.yaml` with a description, tags, and your name:
9295

9396
```yaml
94-
name: my_grader
95-
description: What this grader checks
97+
name: my_evaluator
98+
description: What this evaluator checks
9699
language: python
97-
entrypoint: my_grader.py
100+
entrypoint: my_evaluator.py
98101
tags: [quality, tools]
99102
author: your-github-username
100103
```
@@ -105,21 +108,21 @@ Run the validation script to catch issues before submitting:
105108
106109
```bash
107110
pip install agentevals-grader-sdk pyyaml
108-
python scripts/validate_grader.py graders/my_grader
111+
python scripts/validate_evaluator.py evaluators/my_evaluator
109112
```
110113

111114
This checks:
112115
- **Manifest schema** -- required fields, entrypoint exists, name matches directory
113116
- **Syntax and imports** -- compiles cleanly, uses `@grader` decorator
114-
- **Smoke run** -- runs the grader with synthetic input and validates the `EvalResult` output (correct types for `score`, `details`, `status`, etc.)
117+
- **Smoke run** -- runs the evaluator with synthetic input and validates the `EvalResult` output (correct types for `score`, `details`, `status`, etc.)
115118

116119
You can also test with a full eval run:
117120

118121
```yaml
119122
metrics:
120-
- name: my_grader
123+
- name: my_evaluator
121124
type: code
122-
path: ./graders/my_grader/my_grader.py
125+
path: ./evaluators/my_evaluator/my_evaluator.py
123126
threshold: 0.5
124127
```
125128
@@ -130,13 +133,13 @@ agentevals run traces/sample.json --config eval_config.yaml --eval-set eval_set.
130133
### 5. Submit a pull request
131134

132135
1. Fork this repository
133-
2. Copy your grader directory into `graders/`:
136+
2. Copy your evaluator directory into `evaluators/`:
134137

135138
```
136-
graders/
137-
├── my_grader/
138-
│ ├── grader.yaml
139-
│ └── my_grader.py
139+
evaluators/
140+
├── my_evaluator/
141+
│ ├── evaluator.yaml
142+
│ └── my_evaluator.py
140143
├── response_quality/
141144
│ └── ...
142145
└── tool_coverage/
@@ -145,16 +148,16 @@ graders/
145148

146149
3. Open a PR against `main`
147150

148-
CI will automatically validate your grader (manifest, syntax, and smoke run). Once merged, a separate workflow regenerates `index.yaml`, and your grader becomes available to everyone via `agentevals grader list`.
151+
CI will automatically validate your evaluator (manifest, syntax, and smoke run). Once merged, a separate workflow regenerates `index.yaml`, and your evaluator becomes available to everyone via `agentevals evaluator list`.
149152

150153
## Supported languages
151154

152-
Graders can be written in any language that reads JSON from stdin and writes JSON to stdout.
155+
Evaluators can be written in any language that reads JSON from stdin and writes JSON to stdout.
153156

154157
| Language | Extension | SDK available |
155158
|---|---|---|
156159
| Python | `.py` | `pip install agentevals-grader-sdk` |
157160
| JavaScript | `.js` | No SDK yet -- just read stdin, write stdout |
158161
| TypeScript | `.ts` | No SDK yet -- just read stdin, write stdout |
159162

160-
See the [custom graders documentation](https://github.com/agentevals-dev/agentevals/blob/main/docs/custom-graders.md) for the full protocol reference.
163+
See the [custom evaluators documentation](https://github.com/agentevals-dev/agentevals/blob/main/docs/custom-evaluators.md) for the full protocol reference.
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
name: peters_evaluator
2+
description: 'sample evaluator that returns a 0.123 score'
3+
language: python
4+
entrypoint: peters_evaluator.py
5+
tags: ["test", "example"]
6+
author: 'peterj'
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
"""Custom evaluator: peters_evaluator
2+
3+
Usage in eval_config.yaml:
4+
5+
metrics:
6+
- name: peters_evaluator
7+
type: code
8+
path: ./peters_evaluator/peters_evaluator.py
9+
threshold: 0.5
10+
"""
11+
12+
from agentevals_grader_sdk import grader, EvalInput, EvalResult
13+
14+
15+
@grader
16+
def peters_evaluator(input: EvalInput) -> EvalResult:
17+
return EvalResult(score=0.123, details={"message": "All good"})
18+
19+
20+
if __name__ == "__main__":
21+
peters_evaluator.run()
File renamed without changes.

graders/response_quality/response_quality.py renamed to evaluators/response_quality/response_quality.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
"""Community grader: response_quality
1+
"""Community evaluator: response_quality
22
33
Checks that every invocation has a non-empty response, meets a configurable
44
minimum length, and doesn't just parrot back the user input.
@@ -48,3 +48,7 @@ def response_quality(input: EvalInput) -> EvalResult:
4848
per_invocation_scores=scores,
4949
details={"issues": issues} if issues else None,
5050
)
51+
52+
53+
if __name__ == "__main__":
54+
response_quality.run()
Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
"""Community grader: tool_coverage
1+
"""Community evaluator: tool_coverage
22
33
Verifies that each invocation made at least a minimum number of tool calls.
44
Useful for ensuring agents actually use their tools rather than hallucinating
@@ -34,3 +34,7 @@ def tool_coverage(input: EvalInput) -> EvalResult:
3434
per_invocation_scores=scores,
3535
details={"missing_tools": details} if details else None,
3636
)
37+
38+
39+
if __name__ == "__main__":
40+
tool_coverage.run()

0 commit comments

Comments
 (0)