You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: evals/README.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,17 +28,17 @@ Run these after any changes to the provider, mock, or shared utilities to catch
28
28
# From evals/
29
29
30
30
# Run a single suite (all test cases)
31
-
npm run eval:aiconfig-create # ai-configs/aiconfig-create
32
-
npm run eval:aiconfig-update # ai-configs/aiconfig-update
33
-
npm run eval:aiconfig-tools # ai-configs/aiconfig-tools
34
-
npm run eval:aiconfig-variations # ai-configs/aiconfig-variations
31
+
npm run eval:configs-create # agentcontrol/configs-create
32
+
npm run eval:configs-update # agentcontrol/configs-update
33
+
npm run eval:agentcontrol-tools # agentcontrol/tools
34
+
npm run eval:configs-variations # agentcontrol/configs-variations
35
35
npm run eval:flag-create # feature-flags/launchdarkly-flag-create
36
36
37
37
# Quick smoke check — first test case only (~15-20s, ~$0.05)
38
-
npm run eval:aiconfig-create:single
39
-
npm run eval:aiconfig-update:single
40
-
npm run eval:aiconfig-tools:single
41
-
npm run eval:aiconfig-variations:single
38
+
npm run eval:configs-create:single
39
+
npm run eval:configs-update:single
40
+
npm run eval:agentcontrol-tools:single
41
+
npm run eval:configs-variations:single
42
42
npm run eval:flag-create:single
43
43
44
44
# Aggregate and CI operations
@@ -147,7 +147,7 @@ This handles agents that call `get-foo` before AND after mutation; using `indexO
147
147
148
148
### Cross-model evaluation (`run-models.js`)
149
149
150
-
The cross-model runner evaluates all suites against one or more model aliases without touching the canonical `eval-scores.json`. Results are written to `<suite>/results.<alias>.json` (e.g., `aiconfig-create/results.haiku.json`).
150
+
The cross-model runner evaluates all suites against one or more model aliases without touching the canonical `eval-scores.json`. Results are written to `<suite>/results.<alias>.json` (e.g., `configs-create/results.haiku.json`).
151
151
152
152
```bash
153
153
npm run eval:haiku # claude-haiku-4-5-20251001
@@ -222,7 +222,7 @@ Read the SKILL.md and note every MCP tool it references. Verify each tool exists
222
222
mkdir <skill-name>
223
223
```
224
224
225
-
Use the same name as the skill directory (e.g., `aiconfig-create`). Create `promptfooconfig.yaml`:
225
+
Use the same name as the skill directory (e.g., `configs-create`). Create `promptfooconfig.yaml`:
@@ -264,7 +264,7 @@ Add an entry to `scripts/_manifest.js`:
264
264
```js
265
265
{
266
266
suite:"<skill-name>",
267
-
skillKey:"<domain>/<skill-name>", // e.g. "ai-configs/aiconfig-create"
267
+
skillKey:"<domain>/<skill-name>", // e.g. "agentcontrol/configs-create"
268
268
skillDir:"skills/<domain>/<skill-name>",
269
269
readme:"skills/<domain>/<skill-name>/README.md",
270
270
},
@@ -364,7 +364,7 @@ Running `npm run eval:all` writes a summary at the repo root:
364
364
"updatedAt": "2026-05-19T00:00:00Z",
365
365
"lastCommit": "fc69376",
366
366
"skills": {
367
-
"ai-configs/aiconfig-create": {
367
+
"agentcontrol/configs-create": {
368
368
"score": 100,
369
369
"passed": 4,
370
370
"total": 4,
@@ -377,6 +377,6 @@ Running `npm run eval:all` writes a summary at the repo root:
377
377
```
378
378
379
379
- `lastCommit`— the short git SHA at the time of the last `eval:all` run. Used by `eval:diff` to determine which suites have changed since scores were recorded.
380
-
- `skillKey`— the canonical key is `<domain>/<skill-name>` (e.g., `ai-configs/aiconfig-create`).
380
+
- `skillKey`— the canonical key is `<domain>/<skill-name>` (e.g., `agentcontrol/configs-create`).
381
381
382
382
Run `node scripts/aggregate.js` (without `--run`) to rebuild this file from existing `<suite>/results.json` files without making any API calls.
0 commit comments