You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Tests whether LLMs correctly use SuperDoc's 193 document editing tools. Two levels: tool quality (does the model pick the right tool?) and execution (does the document actually change?).
3
+
Promptfoo-based evaluation suite for SuperDoc document-editing tools.
4
+
5
+
It has two layers:
6
+
7
+
- Tool quality: does the model choose the right tool with the right arguments?
8
+
- Execution: does the document actually change correctly when the full agent loop runs?
Run the full agent loop on real .docx files. Open document, LLM picks tools, CLI executes them. Assert the document content changed correctly.
49
+
Run the full agent loop on real `.docx` fixtures. Open the document, let the model pick tools, execute them through the SDK/CLI, and assert on the resulting document text.
28
50
29
-
-**21 tests** on 3 fixture documents (document.docx, memorandum.docx, table-doc.docx)
51
+
-**21 tests** on 3 fixture documents: `document.docx`, `memorandum.docx`, `table-doc.docx`
30
52
-**3 providers** via Vercel AI SDK + AI Gateway: GPT-5.4, Claude Haiku 4.5, Gemini 2.5 Pro
| `noTextInsertForStructure` | Headings/paragraphs use standalone create tools, not `text.insert` |
172
+
| `validDiscoverGroups` | `discover_tools` loads valid group names |
173
+
| `isTrackedMode` | Tracked changes use `changeMode: "tracked"` |
174
+
| `isNotTrackedMode` | Direct edits do not use tracked mode |
175
+
| `atomicMultiStep` | Multi-step mutations are atomic and grouped together |
176
+
| `usesDeleteOp` | The mutation includes a delete-style op |
177
+
| `usesRewriteOp` | The mutation includes `text.rewrite` |
145
178
146
179
## Adding a new model
147
180
148
-
Add a provider to any config YAML:
181
+
### Level 1: native Promptfoo providers
182
+
183
+
Add another native provider to `promptfooconfig.yaml`:
149
184
150
185
```yaml
151
-
# In promptfooconfig.yaml (Level 1)
152
-
- id: vercel:anthropic/claude-sonnet-4.6
153
-
label: Claude Sonnet 4.6
154
-
delay: 1000
186
+
- id: openai:chat:gpt-4.1
187
+
label: GPT-4.1
155
188
config:
156
189
temperature: 0
190
+
seed: 42
157
191
tools: file://lib/essential.json
158
-
maxTokens: 1024
192
+
tool_choice: required
193
+
timeout: 30000
194
+
```
159
195
160
-
# In promptfooconfig.e2e.yaml (Level 2)
196
+
`promptfooconfig.yaml`also includes commented native Anthropic and Google examples.
197
+
198
+
### Level 2: AI Gateway execution providers
199
+
200
+
Add another entry to `promptfooconfig.e2e.yaml`:
201
+
202
+
```yaml
161
203
- id: file://providers/superdoc-agent-gateway.mjs
162
204
label: Claude Sonnet 4.6 (Gateway)
163
205
config:
@@ -166,10 +208,21 @@ Add a provider to any config YAML:
166
208
167
209
## Notes
168
210
169
-
- All providers route through **Vercel AI Gateway** (`AI_GATEWAY_API_KEY`). One key, all models.
170
-
- Run `pnpm run generate:all` from repo root if `extract-tools` fails (SDK artifacts need regenerating).
171
-
- `prompts/agent.txt`is the canonical system prompt. Update it when changing tool documentation.
172
-
- Promptfoo caches responses. Changing assertions re-runs on cached data for free. Clear: `npx promptfoo cache clear`.
173
-
- `normalize.cjs`converts Anthropic `tool_use` and Google `functionCall` formats to OpenAI format so all assertions work across providers.
174
-
- Execution provider caches results in `results/.cache/` (keyed by model+fixture+task). Disable: `PROMPTFOO_CACHE_ENABLED=false`.
175
-
- Files prefixed with `__` (e.g. `__promptfooconfig.gdpval.yaml`) are disabled/legacy configs kept for reference.
211
+
- `lib/essential.json`is generated and gitignored. If it is missing, run `pnpm run extract-tools`.
212
+
- If `extract-tools` fails because `packages/sdk/tools/*.json` are missing, run `pnpm run generate:all` from the repo root first.
213
+
- Level 1 currently uses native OpenAI Promptfoo providers. Level 2 uses a custom provider that routes through Vercel AI Gateway.
214
+
- `pnpm run view`is the correct script name. There is no `eval:view` script in the current package.
215
+
- `pnpm run analyze`reads `results/latest.json`, writes `results/analysis.html`, and requires `ANTHROPIC_API_KEY`.
216
+
- Promptfoo caches model responses. Clear Promptfoo's cache with `npx promptfoo cache clear`.
217
+
- The custom execution provider also caches results in `results/.cache/`. Disable it with `PROMPTFOO_CACHE_ENABLED=false`.
218
+
219
+
## Exit codes and troubleshooting
220
+
221
+
- Promptfoo exits non-zero when tests fail. By default it uses pass-rate threshold `100` and failed-test exit code `100`, so a run can write results successfully and still return exit status `100`.
222
+
- To treat a failing eval run as a successful shell command, set either `PROMPTFOO_PASS_RATE_THRESHOLD=0` or `PROMPTFOO_FAILED_TEST_EXIT_CODE=0`.
223
+
- If Promptfoo crashes with a missing `better-sqlite3` binding, approve and rebuild native packages:
0 commit comments