Skip to content

Commit 266c11b

Browse files
committed
chore: update evals readme
1 parent 0719412 commit 266c11b

3 files changed

Lines changed: 188 additions & 1183 deletions

File tree

evals/README.md

Lines changed: 132 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -1,83 +1,111 @@
11
# SuperDoc AI Eval Suite
22

3-
Tests whether LLMs correctly use SuperDoc's 193 document editing tools. Two levels: tool quality (does the model pick the right tool?) and execution (does the document actually change?).
3+
Promptfoo-based evaluation suite for SuperDoc document-editing tools.
4+
5+
It has two layers:
6+
7+
- Tool quality: does the model choose the right tool with the right arguments?
8+
- Execution: does the document actually change correctly when the full agent loop runs?
49

510
## Quick start
611

12+
Run these commands from the repo root:
13+
714
```bash
8-
cp .env.example .env # add AI_GATEWAY_API_KEY (+ optional OPENAI_API_KEY)
9-
pnpm run extract-tools # extract tool definitions from SDK (run once after clone)
10-
pnpm run eval # Level 1: tool quality (6 providers, 47 tests)
11-
pnpm run eval:e2e # Level 2: execution (3 providers, 21 real DOCX tests)
12-
pnpm run eval:view # open results in browser
15+
pnpm install
16+
pnpm run generate:all # if packages/sdk/tools/*.json are missing
17+
cp evals/.env.example evals/.env
18+
pnpm --filter @superdoc-testing/evals run extract-tools
19+
pnpm --filter @superdoc-testing/evals run eval:openai # Level 1
20+
pnpm --prefix apps/cli run build # required for Level 2
21+
pnpm --filter @superdoc-testing/evals run eval:e2e # Level 2
22+
pnpm --filter @superdoc-testing/evals run view
1323
```
1424

25+
Edit `evals/.env` before running:
26+
27+
- `OPENAI_API_KEY` for `eval` and `eval:openai`
28+
- `AI_GATEWAY_API_KEY` for `eval:e2e`
29+
- `ANTHROPIC_API_KEY` for `analyze`
30+
- `GOOGLE_API_KEY` only if you enable a native Google provider in `promptfooconfig.yaml`
31+
32+
If you prefer to work inside `evals/`, the same scripts are available as `pnpm run <script>`.
33+
1534
## Two levels of testing
1635

17-
### Level 1: Tool Quality
36+
### Level 1: Tool quality
1837

19-
Give the LLM a task and tool definitions. Check the response: did it pick the right tool with valid arguments? No document execution. Fast, cheap.
38+
Give the model a task plus a small essential tool bundle. Check whether it chooses the right tools and arguments. No real document execution.
2039

21-
- **47 tests** across 9 categories (reading, mutations, formatting, structure, tables, comments, tracked changes, lists, hygiene)
22-
- **6 providers** via Vercel AI Gateway: GPT-4o, GPT-4.1, GPT-4.1-mini, GPT-5.4, Claude Haiku 4.5, Gemini 2.5 Flash
40+
- **31 tests** across 12 categories
41+
- **2 prompts**: `prompts/agent.txt` and `prompts/minimal.txt`
42+
- **3 active providers** via native Promptfoo OpenAI providers: GPT-4o, GPT-4.1-mini, GPT-5.4
43+
- **186 evaluations per full run**: 31 tests x 2 prompts x 3 providers
2344
- Config: `promptfooconfig.yaml`
45+
- Tool bundle: `lib/essential.json` (generated, gitignored)
2446

2547
### Level 2: Execution (E2E)
2648

27-
Run the full agent loop on real .docx files. Open document, LLM picks tools, CLI executes them. Assert the document content changed correctly.
49+
Run the full agent loop on real `.docx` fixtures. Open the document, let the model pick tools, execute them through the SDK/CLI, and assert on the resulting document text.
2850

29-
- **21 tests** on 3 fixture documents (document.docx, memorandum.docx, table-doc.docx)
51+
- **21 tests** on 3 fixture documents: `document.docx`, `memorandum.docx`, `table-doc.docx`
3052
- **3 providers** via Vercel AI SDK + AI Gateway: GPT-5.4, Claude Haiku 4.5, Gemini 2.5 Pro
3153
- Config: `promptfooconfig.e2e.yaml`
32-
- Provider: `providers/superdoc-agent-gateway.mjs` (uses `generateText()` + `jsonSchema()` + `stepCountIs(10)`)
54+
- Provider: `providers/superdoc-agent-gateway.mjs`
3355

3456
## Commands
3557

3658
| Command | What it does |
37-
|---------|-------------|
38-
| `pnpm run eval` | Level 1: tool quality (all providers) |
39-
| `pnpm run eval:e2e` | Level 2: execution via AI Gateway |
40-
| `pnpm run eval:openai` | Level 1 filtered to OpenAI models only |
41-
| `pnpm run eval:view` | Open Promptfoo web UI with results |
42-
| `pnpm run eval:export` | Run eval + save to `results/latest.json` |
43-
| `pnpm run eval:repeat` | Run 3x, no cache (variance testing) |
44-
| `pnpm run extract-tools` | Re-extract tools from SDK |
45-
| `pnpm run baseline:save` | Snapshot current results for comparison |
46-
| `pnpm run baseline:compare` | Compare two snapshots for regressions |
59+
|---------|--------------|
60+
| `pnpm run extract-tools` | Generate `lib/essential.json` from SDK tool catalogs |
61+
| `pnpm run eval` | Level 1 across all active providers in `promptfooconfig.yaml` |
62+
| `pnpm run eval:openai` | Level 1 filtered to `GPT-*` providers; currently equivalent to `eval` |
63+
| `pnpm run eval:e2e` | Level 2 execution tests via AI Gateway |
64+
| `pnpm run eval:repeat` | Repeat Level 1 three times with Promptfoo cache disabled |
65+
| `pnpm run view` | Open the Promptfoo results UI |
66+
| `pnpm run analyze` | Generate an HTML analysis dashboard from `results/latest.json` |
67+
| `pnpm run eval:analyze` | Run Level 1, then generate the HTML analysis dashboard |
68+
| `pnpm run baseline:save <label>` | Save `results/latest.json` as a versioned baseline |
69+
| `pnpm run baseline:compare <a> <b>` | Compare two saved baselines |
4770

4871
## Structure
4972

50-
```
73+
```text
5174
evals/
52-
promptfooconfig.yaml Level 1: tool quality (6 providers via AI Gateway)
53-
promptfooconfig.e2e.yaml Level 2: execution (3 providers via AI Gateway)
75+
promptfooconfig.yaml Level 1 tool-quality config
76+
promptfooconfig.e2e.yaml Level 2 execution config
5477
prompts/
55-
agent.txt System prompt (127 lines, tool categories + API guide)
56-
minimal.txt Minimal baseline prompt (for GDPval)
78+
agent.txt Main system prompt
79+
minimal.txt Minimal baseline prompt
5780
tests/
58-
tool-quality.yaml 47 tests: tool selection + argument validation
59-
execution.yaml 21 tests: real DOCX editing + content assertions
81+
tool-quality.yaml 31 tool-selection / argument-shape tests
82+
execution.yaml 21 real DOCX editing tests
6083
providers/
61-
superdoc-agent-gateway.mjs Vercel AI SDK provider (any model via config.modelId)
62-
superdoc-agent.mjs OpenAI-only provider (legacy, direct API)
63-
utils.mjs Shared: SDK loading, file management, caching
84+
superdoc-agent-gateway.mjs AI SDK + AI Gateway execution provider
85+
superdoc-agent.mjs Legacy direct OpenAI execution provider
86+
vercel-tools.mjs Capture-only AI SDK provider for tool-call experiments
87+
utils.mjs Shared SDK loading, file management, caching
6488
lib/
65-
checks.cjs 18 assertion functions (noHallucinatedParams, validOpNames, etc.)
66-
normalize.cjs Cross-provider tool call format normalization
89+
checks.cjs Assertion helpers for tool-call validation
90+
normalize.cjs Cross-provider tool call normalization
6791
extract.mjs SDK tool extraction script
68-
essential.json Extracted tool definitions (6 essential + discover_tools)
69-
save-baseline.mjs Save versioned result snapshot
70-
compare-baselines.mjs Compare two snapshots for regressions
92+
essential.json Generated tool bundle: 7 essential tools + discover_tools
93+
save-baseline.mjs Save versioned result snapshots
94+
compare-baselines.mjs Compare baseline snapshots
95+
analyze-results.mjs Generate HTML analysis from eval output
7196
fixtures/
72-
document.docx Bullet lists (read, replace, insert)
73-
memorandum.docx Legal memo (dates, amounts, party names)
74-
table-doc.docx Tables (headers, cell content)
75-
contract.docx Long contract (future stress tests)
76-
comments-doc.docx Document with comments (future)
97+
document.docx Bullet-list fixture
98+
memorandum.docx Legal memo fixture
99+
table-doc.docx Table fixture
100+
contract.docx Longer contract fixture
101+
comments-doc.docx Comment fixture
77102
results/
78-
latest.json Most recent eval output
79-
.cache/ Response cache (keyed by model+fixture+task)
80-
baselines/ Versioned snapshots
103+
latest.json Latest Level 1 output
104+
latest-openai.json Latest Level 1 filtered OpenAI output
105+
latest-e2e.json Latest Level 2 output
106+
analysis.html Generated analysis dashboard
107+
.cache/ Provider cache
108+
baselines/ Saved snapshots
81109
output/ Saved DOCX files from keepFile tests
82110
```
83111

@@ -103,7 +131,7 @@ evals/
103131
metric: argument_accuracy
104132
```
105133
106-
`tool-call-f1` checks tool selection (F1 score). `file://lib/checks.cjs:functionName` runs a specific assertion function.
134+
`tool-call-f1` checks tool selection. `file://lib/checks.cjs:functionName` runs a named assertion helper.
107135

108136
### Execution test (Level 2)
109137

@@ -121,43 +149,57 @@ evals/
121149
value: '$150,000,000'
122150
```
123151

124-
Every execution test asserts: new content exists, old content gone, unrelated content intact.
152+
Execution tests should assert all three:
153+
154+
- New content exists
155+
- Old content is gone
156+
- Unrelated content is still intact
125157

126-
## Assertion functions (`lib/checks.cjs`)
158+
## Assertion helpers (`lib/checks.cjs`)
127159

128160
| Function | What it checks |
129-
|----------|---------------|
130-
| `noHallucinatedParams` | No `doc` or `sessionId` in tool args |
131-
| `validOpNames` | Ops are `text.rewrite`/`text.insert`/`text.delete`, not bare `replace`/`insert`/`delete` |
132-
| `stepFields` | Every step has `op` and `where` |
133-
| `noRequireAny` | Mutations use `require: "first"`, not `"any"` |
134-
| `noMixedBatch` | No `text.rewrite` + `format.apply` in same batch |
135-
| `correctFormatArgs` | `format.apply` uses `{inline: {bold: true}}`, not `{bold: true}` |
136-
| `textSearchArgs` | `query_match` has `select.type: "text"` + `select.pattern` |
137-
| `nodeSearchArgs` | `query_match` has `select.type: "node"` + correct `nodeType` |
138-
| `noTextInsertForStructure` | Headings/paragraphs use standalone tools, not `text.insert` |
139-
| `validDiscoverGroups` | `discover_tools` groups are valid names |
140-
| `isTrackedMode` | Tracked changes set `changeMode: "tracked"` |
141-
| `isNotTrackedMode` | Direct edits do not set `changeMode: "tracked"` |
142-
| `atomicMultiStep` | Multi-step has `atomic: true` + 2+ steps |
143-
144-
Each function: `(output, context) => { pass, score, reason }` or `true` (skip).
161+
|----------|----------------|
162+
| `noHallucinatedParams` | No non-empty `doc` or `sessionId` arguments |
163+
| `validOpNames` | Mutation ops use `text.rewrite` / `text.insert` / `text.delete` |
164+
| `stepFields` | Every mutation step has `op` and `where` |
165+
| `noRequireAny` | Mutations do not use `require: "any"` |
166+
| `noMixedBatch` | Text edits and `format.apply` are not mixed in one batch |
167+
| `correctFormatArgs` | `format.apply` nests formatting under `args.inline` |
168+
| `textSearchArgs` | `query_match` uses a valid text selector |
169+
| `nodeSearchArgs` | `query_match` uses a valid node selector |
170+
| `nodeSearchOrBlocksList` | Listing nodes uses `query_match` or `blocks_list` correctly |
171+
| `noTextInsertForStructure` | Headings/paragraphs use standalone create tools, not `text.insert` |
172+
| `validDiscoverGroups` | `discover_tools` loads valid group names |
173+
| `isTrackedMode` | Tracked changes use `changeMode: "tracked"` |
174+
| `isNotTrackedMode` | Direct edits do not use tracked mode |
175+
| `atomicMultiStep` | Multi-step mutations are atomic and grouped together |
176+
| `usesDeleteOp` | The mutation includes a delete-style op |
177+
| `usesRewriteOp` | The mutation includes `text.rewrite` |
145178

146179
## Adding a new model
147180

148-
Add a provider to any config YAML:
181+
### Level 1: native Promptfoo providers
182+
183+
Add another native provider to `promptfooconfig.yaml`:
149184

150185
```yaml
151-
# In promptfooconfig.yaml (Level 1)
152-
- id: vercel:anthropic/claude-sonnet-4.6
153-
label: Claude Sonnet 4.6
154-
delay: 1000
186+
- id: openai:chat:gpt-4.1
187+
label: GPT-4.1
155188
config:
156189
temperature: 0
190+
seed: 42
157191
tools: file://lib/essential.json
158-
maxTokens: 1024
192+
tool_choice: required
193+
timeout: 30000
194+
```
159195

160-
# In promptfooconfig.e2e.yaml (Level 2)
196+
`promptfooconfig.yaml` also includes commented native Anthropic and Google examples.
197+
198+
### Level 2: AI Gateway execution providers
199+
200+
Add another entry to `promptfooconfig.e2e.yaml`:
201+
202+
```yaml
161203
- id: file://providers/superdoc-agent-gateway.mjs
162204
label: Claude Sonnet 4.6 (Gateway)
163205
config:
@@ -166,10 +208,21 @@ Add a provider to any config YAML:
166208

167209
## Notes
168210

169-
- All providers route through **Vercel AI Gateway** (`AI_GATEWAY_API_KEY`). One key, all models.
170-
- Run `pnpm run generate:all` from repo root if `extract-tools` fails (SDK artifacts need regenerating).
171-
- `prompts/agent.txt` is the canonical system prompt. Update it when changing tool documentation.
172-
- Promptfoo caches responses. Changing assertions re-runs on cached data for free. Clear: `npx promptfoo cache clear`.
173-
- `normalize.cjs` converts Anthropic `tool_use` and Google `functionCall` formats to OpenAI format so all assertions work across providers.
174-
- Execution provider caches results in `results/.cache/` (keyed by model+fixture+task). Disable: `PROMPTFOO_CACHE_ENABLED=false`.
175-
- Files prefixed with `__` (e.g. `__promptfooconfig.gdpval.yaml`) are disabled/legacy configs kept for reference.
211+
- `lib/essential.json` is generated and gitignored. If it is missing, run `pnpm run extract-tools`.
212+
- If `extract-tools` fails because `packages/sdk/tools/*.json` are missing, run `pnpm run generate:all` from the repo root first.
213+
- Level 1 currently uses native OpenAI Promptfoo providers. Level 2 uses a custom provider that routes through Vercel AI Gateway.
214+
- `pnpm run view` is the correct script name. There is no `eval:view` script in the current package.
215+
- `pnpm run analyze` reads `results/latest.json`, writes `results/analysis.html`, and requires `ANTHROPIC_API_KEY`.
216+
- Promptfoo caches model responses. Clear Promptfoo's cache with `npx promptfoo cache clear`.
217+
- The custom execution provider also caches results in `results/.cache/`. Disable it with `PROMPTFOO_CACHE_ENABLED=false`.
218+
219+
## Exit codes and troubleshooting
220+
221+
- Promptfoo exits non-zero when tests fail. By default it uses pass-rate threshold `100` and failed-test exit code `100`, so a run can write results successfully and still return exit status `100`.
222+
- To treat a failing eval run as a successful shell command, set either `PROMPTFOO_PASS_RATE_THRESHOLD=0` or `PROMPTFOO_FAILED_TEST_EXIT_CODE=0`.
223+
- If Promptfoo crashes with a missing `better-sqlite3` binding, approve and rebuild native packages:
224+
225+
```bash
226+
pnpm approve-builds
227+
pnpm rebuild better-sqlite3
228+
```

package.json

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,29 @@
135135
"@vue/server-renderer": "3.5.25",
136136
"@vue/shared": "3.5.25",
137137
"vite": "npm:rolldown-vite@7.3.1"
138-
}
138+
},
139+
"ignoredBuiltDependencies": [
140+
"@parcel/watcher",
141+
"@playwright/browser-chromium",
142+
"@swc/core",
143+
"lmdb",
144+
"msgpackr-extract",
145+
"msw",
146+
"onnxruntime-node",
147+
"protobufjs"
148+
],
149+
"onlyBuiltDependencies": [
150+
"@vscode/vsce-sign",
151+
"better-sqlite3",
152+
"canvas",
153+
"esbuild",
154+
"keytar",
155+
"lefthook",
156+
"puppeteer",
157+
"sharp",
158+
"unrs-resolver",
159+
"vue-demi"
160+
]
139161
},
140162
"dependencies": {
141163
"superdoc": "link:../../../packages/superdoc"

0 commit comments

Comments
 (0)