You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -240,7 +240,7 @@ evaluators:
240
240
threshold: 0.7
241
241
```
242
242
243
-
Evaluators with a `requirements.txt` get automatic virtual environment management. You can also use `type: remote` for community evaluators from GitHub, or `type: openai_eval` to delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) (requires `pip install "agentevals-cli[openai]"`).
243
+
Evaluators with a `requirements.txt` get automatic virtual environment management. You can also use `type: remote` for community evaluators from GitHub, or `type: openai_eval` to delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) (requires `pip install "agentevals-cli[openai]"`). Supported grader types: `text_similarity` and `string_check`.
244
244
245
245
See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK helpers, and how to contribute evaluators.
Copy file name to clipboardExpand all lines: docs/custom-evaluators.md
+26Lines changed: 26 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -317,6 +317,32 @@ The `grader.evaluation_metric` field selects the similarity algorithm:
317
317
| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
318
318
| `rouge_l` | Longest common subsequence overlap (F-measure) |
319
319
320
+
### String Check Grader
321
+
322
+
Checks whether the agent response contains, equals, or matches a fixed reference string. No eval set is needed.
323
+
324
+
```yaml
325
+
evaluators:
326
+
- name: response_contains_hello
327
+
type: openai_eval
328
+
threshold: 0.8
329
+
grader:
330
+
type: string_check
331
+
reference: "hello"
332
+
operation: ilike
333
+
```
334
+
335
+
The `operation` field controls how the check is applied:
336
+
337
+
| Operation | Description |
338
+
|---|---|
339
+
| `eq` | Exact match (case-sensitive) |
340
+
| `ne` | Does not equal (case-sensitive) |
341
+
| `like` | Contains the reference (case-sensitive) |
342
+
| `ilike` | Contains the reference (case-insensitive) |
343
+
344
+
Each invocation either passes or fails. The `threshold` field is not used by `string_check`.
345
+
320
346
### How it works
321
347
322
348
Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
0 commit comments