Skip to content

Commit 0719412

Browse files
authored
feat(evals): add Promptfoo-based AI tool call evaluation suite (#2351)
* feat(evals): add Promptfoo-based AI tool call evaluation suite Add automated evaluation infrastructure for validating LLM tool call quality across SuperDoc's Document Engine API. Tests whether models select the correct tools and construct valid arguments when given document editing tasks. The suite extracts 6 essential tool definitions from the SDK and runs them against multiple OpenAI models and cross-provider comparisons (Anthropic, Google). Includes deterministic assertions for tool selection, argument accuracy, and production correctness rules learned from the labs agent implementation. * docs(evals): simplify README * feat(evals): enhance GDPval benchmark configuration and test assertions Updated the GDPval benchmark configuration to include distinct prompts for SuperDoc tool-augmented and baseline models. Enhanced the test assertions in the GDPval workflows to provide clearer scoring criteria for model responses, focusing on the specificity and executable nature of the responses. Adjusted thresholds for scoring to better reflect the quality of tool calls and text descriptions in document editing tasks. * chore(evals): update GPT model version in GDPval configuration Changed the model identifier from GPT-4o to GPT-5.4 in the GDPval benchmark configuration for both SuperDoc tool-augmented and baseline prompts, ensuring alignment with the latest model updates. * feat(evals): add execution tests and enhance configuration for SuperDoc agent Introduced a new execution test suite for the SuperDoc agent, validating real document editing capabilities through the CLI. Added a new script command for executing these tests and updated the GDPval configuration to reflect the latest GPT model version. Included necessary dependencies and created a new provider for the SuperDoc agent to facilitate the execution of tool calls against DOCX files. * feat(evals): enhance SuperDoc agent with document copy and round-trip validation Updated the SuperDoc agent to create temporary copies of documents for editing, ensuring original fixtures remain unaltered. Implemented round-trip validation to verify that edits persist after saving and re-opening DOCX files. Added a new memorandum fixture and expanded execution tests to cover various document editing scenarios, enhancing overall test coverage and reliability. * feat(evals): add keepFile option to SuperDoc agent for document preservation Enhanced the SuperDoc agent to include a `keepFile` option, allowing users to save edited documents to a specified output directory. Updated the logic to create the output directory if it doesn't exist and modified the cleanup process to conditionally copy the edited document based on this new option. Adjusted execution tests to validate the new functionality, ensuring comprehensive coverage of document editing scenarios. * feat(evals): increase maxConcurrency for SuperDoc agent tests and update execution logic Enhanced the SuperDoc agent's execution configuration by increasing the `maxConcurrency` from 1 to 5, allowing for more efficient concurrent test execution. Updated the cleanup process to ensure isolated state directories are properly managed, improving resource handling during tests. Adjusted execution tests to reflect these changes, ensuring robust validation of document editing capabilities. * feat(evals): refactor SuperDoc agent evaluation scripts and enhance tool configuration Refactored the SuperDoc agent's evaluation scripts to streamline the execution process and improve clarity. Removed the deprecated cross-provider configuration and consolidated tool evaluation logic into a unified structure. Introduced new assertion checks for tool quality and argument accuracy, ensuring comprehensive validation of document editing tasks. Updated the test suite to reflect these changes, enhancing overall test coverage and reliability. * feat(evals): add AI Gateway support and new execution configuration for SuperDoc agent Introduced the AI Gateway API key in the environment configuration to enable optional integration with Vercel AI Gateway. Added a new script command for executing evaluations through the gateway, enhancing the SuperDoc agent's capabilities. Created a new YAML configuration file for execution tests via the AI Gateway, allowing for testing across multiple models. Updated the package dependencies to include the necessary SDK for AI Gateway functionality. * feat(evals): enhance SuperDoc agent with usage tracking and new customer prompt tests Updated the SuperDoc agent to include tracking of total usage and steps during text generation, improving performance insights. Added a series of customer prompt tests in YAML format to validate various document editing tasks, ensuring comprehensive coverage of real-world scenarios. This enhancement aims to bolster the agent's capabilities and testing framework. * feat(evals): streamline evaluation configuration and remove deprecated files Removed the JavaScript assertion file and context builder, simplifying the evaluation framework. Updated the prompt configuration to eliminate unused metrics and added new document fixtures for testing. Enhanced execution tests to validate document editing capabilities with the new fixtures, ensuring comprehensive coverage of various scenarios. * fix(evals): update model labels and refine execution test descriptions Updated the model labels in the execution gateway configuration for clarity and accuracy. Refined the execution test descriptions to better reflect the specific tasks being validated, enhancing the readability and intent of the tests. Commented out deprecated Google provider configurations to streamline the YAML files. * chore(evals): clean up evaluation configurations and remove obsolete files Updated the .gitignore to exclude temporary files and removed deprecated YAML configuration files related to GDPval and execution tests. Streamlined the package.json by eliminating unused evaluation scripts, enhancing overall project organization and clarity. * chore(evals): update pnpm-lock.yaml and .gitignore for dependency management Updated pnpm-lock.yaml to reflect new versions of dependencies, including @types/node and added new SDK entries for SuperDoc. Modified .gitignore to exclude additional temporary files and states, improving project cleanliness and organization. * feat(evals): implement caching mechanism for SuperDoc agent evaluations Added a caching system to the SuperDoc agent and gateway providers to improve performance by storing and retrieving results based on a generated cache key. Updated the utility functions to handle cache operations, ensuring efficient reuse of previous evaluation results. Modified the evaluation logic to check for cached results before executing tasks, enhancing overall efficiency in the evaluation framework. Additionally, updated the package.json to reflect changes in evaluation scripts and added a new YAML configuration for end-to-end tests via the AI Gateway. * feat(evals): expand evaluation framework with two-level testing and enhanced documentation Updated the evaluation framework to include two levels of testing: tool quality and execution. Enhanced the README to clarify testing processes, commands, and configurations. Introduced new YAML files for tool quality and execution tests, detailing the number of tests and providers involved. Improved command descriptions for better usability and added new document fixtures for comprehensive testing of document editing capabilities. * feat(evals): add Vercel tools provider and enhance evaluation scripts Introduced a new Vercel tools provider for the SuperDoc evaluation framework, enabling structured tool calls with the Vercel AI SDK. Updated the package.json to include a new script for evaluating tools with the Vercel configuration. Enhanced the prompt configuration by adding a new YAML file for tool evaluations and refined existing evaluation scripts to support the new provider. Additionally, made minor adjustments to the presentation HTML for improved accessibility and clarity. * chore(deps): update pnpm-lock.yaml to remove naive-ui and add @superdoc/common dependency Removed outdated naive-ui entries and added @superdoc/common as a workspace dependency in pnpm-lock.yaml, ensuring the project reflects the latest dependency structure. * feat(evals): enhance evaluation scripts and update Vercel tools provider Refined evaluation scripts in package.json to output results to specific JSON files for better organization. Updated the Vercel tools provider to support live discovery of tools and improved error handling. Enhanced YAML configurations for tool evaluations, including clearer descriptions and adjustments to thresholds for tool-call metrics. Added caching functionality to optimize performance and ensure efficient reuse of evaluation results. * fix: remove unused Vercel AI SDK evaluation configuration file and update tool quality test descriptions for clarity and consistency * feat: add analysis functionality for eval results and update tool quality tests - Introduced a new script `analyze-results.mjs` for generating a visual HTML dashboard from evaluation results using the Claude Agent SDK. - Added new npm scripts: `analyze` and `eval:analyze` for easier result analysis. - Updated `package.json` to include the `@anthropic-ai/claude-agent-sdk` dependency. - Enhanced tool quality tests to allow for node search using either `query_match` or `blocks_list` for improved flexibility in evaluations. * chore(deps): update @types/node version across multiple dependencies in pnpm-lock.yaml - Updated the version of @types/node from 25.5.0 to 22.19.2 for various dependencies to ensure compatibility and reduce potential conflicts. - Added the @anthropic-ai/claude-agent-sdk dependency with version 0.2.76. - Adjusted the version of esbuild in the vitest dependency to 0.27.2. * fix(evals): address code review findings in eval assertions and caching - correctFormatArgs: validate args.inline on all format.apply steps, not just bold - nodeSearchOrBlocksList: enforce expected nodeType match for both query_match and blocks_list - compare-baselines: match tests by identity (description+provider+prompt) instead of array index - cacheKey: include prompt hash so prompt changes invalidate cached results - Remove unused eval:tools script referencing deleted config
1 parent edcb3c6 commit 0719412

30 files changed

Lines changed: 8430 additions & 155 deletions

β€Ž.gitignoreβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ perf-baseline-results.json
7171
.claude
7272
plans/
7373

74+
tests/layout-snapshots
7475
tests/layout/candidate/
7576
tests/layout/reference/
7677
tests/layout/reports/

β€ŽCLAUDE.mdβ€Ž

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,24 @@ Many packages use `.js` files with JSDoc `@typedef` for type definitions (e.g.,
112112
- `pnpm dev` - Start dev server (from examples/)
113113
- `pnpm run generate:all` - Generate all derived artifacts (schemas, SDK clients, tool catalogs, reference docs)
114114

115+
## AI Eval Suite
116+
117+
The `evals/` directory contains a Promptfoo-based evaluation suite for validating AI tool call quality.
118+
119+
| Command | What it does | Cost |
120+
|---------|-------------|------|
121+
| `pnpm --filter @superdoc-testing/evals run eval` | Run deterministic evals (reading + argument tests) | ~$0.30 |
122+
| `pnpm --filter @superdoc-testing/evals run eval:reading` | Run reading tool tests only | ~$0.15 |
123+
| `pnpm --filter @superdoc-testing/evals run eval:gdpval` | Run GDPval benchmark (Model+SuperDoc vs Model-Only) | ~$1-2 |
124+
| `pnpm --filter @superdoc-testing/evals run eval:view` | Open Promptfoo web UI with results | Free |
125+
| `pnpm --filter @superdoc-testing/evals run baseline:save <label>` | Save versioned results snapshot | Free |
126+
127+
Tool definitions are extracted from `packages/sdk/tools/` via `evals/tools/extract.mjs`. Run `pnpm run generate:all` first if SDK artifacts are missing.
128+
129+
Test files are YAML in `evals/tests/`. Each test has a `vars.task` prompt and JavaScript assertions that check tool call structure (Level 1: tool selection + argument accuracy, not execution).
130+
131+
The system prompt at `evals/prompts/agent.txt` is a copy of the proven prompt from `examples/eval-demo/lib/agent.ts`. Update both when changing the prompt.
132+
115133
## Generated Artifacts
116134

117135
These directories are produced by `pnpm run generate:all`:

β€Ževals/.env.exampleβ€Ž

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
OPENAI_API_KEY=sk-...
2+
ANTHROPIC_API_KEY=sk-ant-...
3+
GOOGLE_API_KEY=...
4+
AI_GATEWAY_API_KEY=...

β€Ževals/.gitignoreβ€Ž

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
lib/essential.json
2+
results/
3+
.promptfoo/
4+
node_modules/
5+
.env
6+
.env.local
7+
__*
8+
fixtures/.state*
9+
fixtures/tmp-*

β€Ževals/README.mdβ€Ž

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# SuperDoc AI Eval Suite
2+
3+
Tests whether LLMs correctly use SuperDoc's 193 document editing tools. Two levels: tool quality (does the model pick the right tool?) and execution (does the document actually change?).
4+
5+
## Quick start
6+
7+
```bash
8+
cp .env.example .env # add AI_GATEWAY_API_KEY (+ optional OPENAI_API_KEY)
9+
pnpm run extract-tools # extract tool definitions from SDK (run once after clone)
10+
pnpm run eval # Level 1: tool quality (6 providers, 47 tests)
11+
pnpm run eval:e2e # Level 2: execution (3 providers, 21 real DOCX tests)
12+
pnpm run eval:view # open results in browser
13+
```
14+
15+
## Two levels of testing
16+
17+
### Level 1: Tool Quality
18+
19+
Give the LLM a task and tool definitions. Check the response: did it pick the right tool with valid arguments? No document execution. Fast, cheap.
20+
21+
- **47 tests** across 9 categories (reading, mutations, formatting, structure, tables, comments, tracked changes, lists, hygiene)
22+
- **6 providers** via Vercel AI Gateway: GPT-4o, GPT-4.1, GPT-4.1-mini, GPT-5.4, Claude Haiku 4.5, Gemini 2.5 Flash
23+
- Config: `promptfooconfig.yaml`
24+
25+
### Level 2: Execution (E2E)
26+
27+
Run the full agent loop on real .docx files. Open document, LLM picks tools, CLI executes them. Assert the document content changed correctly.
28+
29+
- **21 tests** on 3 fixture documents (document.docx, memorandum.docx, table-doc.docx)
30+
- **3 providers** via Vercel AI SDK + AI Gateway: GPT-5.4, Claude Haiku 4.5, Gemini 2.5 Pro
31+
- Config: `promptfooconfig.e2e.yaml`
32+
- Provider: `providers/superdoc-agent-gateway.mjs` (uses `generateText()` + `jsonSchema()` + `stepCountIs(10)`)
33+
34+
## Commands
35+
36+
| Command | What it does |
37+
|---------|-------------|
38+
| `pnpm run eval` | Level 1: tool quality (all providers) |
39+
| `pnpm run eval:e2e` | Level 2: execution via AI Gateway |
40+
| `pnpm run eval:openai` | Level 1 filtered to OpenAI models only |
41+
| `pnpm run eval:view` | Open Promptfoo web UI with results |
42+
| `pnpm run eval:export` | Run eval + save to `results/latest.json` |
43+
| `pnpm run eval:repeat` | Run 3x, no cache (variance testing) |
44+
| `pnpm run extract-tools` | Re-extract tools from SDK |
45+
| `pnpm run baseline:save` | Snapshot current results for comparison |
46+
| `pnpm run baseline:compare` | Compare two snapshots for regressions |
47+
48+
## Structure
49+
50+
```
51+
evals/
52+
promptfooconfig.yaml Level 1: tool quality (6 providers via AI Gateway)
53+
promptfooconfig.e2e.yaml Level 2: execution (3 providers via AI Gateway)
54+
prompts/
55+
agent.txt System prompt (127 lines, tool categories + API guide)
56+
minimal.txt Minimal baseline prompt (for GDPval)
57+
tests/
58+
tool-quality.yaml 47 tests: tool selection + argument validation
59+
execution.yaml 21 tests: real DOCX editing + content assertions
60+
providers/
61+
superdoc-agent-gateway.mjs Vercel AI SDK provider (any model via config.modelId)
62+
superdoc-agent.mjs OpenAI-only provider (legacy, direct API)
63+
utils.mjs Shared: SDK loading, file management, caching
64+
lib/
65+
checks.cjs 18 assertion functions (noHallucinatedParams, validOpNames, etc.)
66+
normalize.cjs Cross-provider tool call format normalization
67+
extract.mjs SDK tool extraction script
68+
essential.json Extracted tool definitions (6 essential + discover_tools)
69+
save-baseline.mjs Save versioned result snapshot
70+
compare-baselines.mjs Compare two snapshots for regressions
71+
fixtures/
72+
document.docx Bullet lists (read, replace, insert)
73+
memorandum.docx Legal memo (dates, amounts, party names)
74+
table-doc.docx Tables (headers, cell content)
75+
contract.docx Long contract (future stress tests)
76+
comments-doc.docx Document with comments (future)
77+
results/
78+
latest.json Most recent eval output
79+
.cache/ Response cache (keyed by model+fixture+task)
80+
baselines/ Versioned snapshots
81+
output/ Saved DOCX files from keepFile tests
82+
```
83+
84+
## Writing tests
85+
86+
### Tool quality test (Level 1)
87+
88+
```yaml
89+
- description: 'Replace uses text.rewrite, not bare replace'
90+
metadata: { category: mutation }
91+
vars:
92+
task: 'Replace "old title" with "new title" in the document.'
93+
assert:
94+
- type: tool-call-f1
95+
value: [query_match, apply_mutations]
96+
threshold: 0.5
97+
metric: tool_selection
98+
- type: javascript
99+
value: file://lib/checks.cjs:validOpNames
100+
metric: argument_accuracy
101+
- type: javascript
102+
value: file://lib/checks.cjs:noHallucinatedParams
103+
metric: argument_accuracy
104+
```
105+
106+
`tool-call-f1` checks tool selection (F1 score). `file://lib/checks.cjs:functionName` runs a specific assertion function.
107+
108+
### Execution test (Level 2)
109+
110+
```yaml
111+
- description: 'Replace: $25M to $50M, $150M untouched'
112+
vars:
113+
fixture: memorandum.docx
114+
task: 'Replace "$25,000,000" with "$50,000,000".'
115+
assert:
116+
- type: contains
117+
value: '$50,000,000'
118+
- type: not-contains
119+
value: '$25,000,000'
120+
- type: contains
121+
value: '$150,000,000'
122+
```
123+
124+
Every execution test asserts: new content exists, old content gone, unrelated content intact.
125+
126+
## Assertion functions (`lib/checks.cjs`)
127+
128+
| Function | What it checks |
129+
|----------|---------------|
130+
| `noHallucinatedParams` | No `doc` or `sessionId` in tool args |
131+
| `validOpNames` | Ops are `text.rewrite`/`text.insert`/`text.delete`, not bare `replace`/`insert`/`delete` |
132+
| `stepFields` | Every step has `op` and `where` |
133+
| `noRequireAny` | Mutations use `require: "first"`, not `"any"` |
134+
| `noMixedBatch` | No `text.rewrite` + `format.apply` in same batch |
135+
| `correctFormatArgs` | `format.apply` uses `{inline: {bold: true}}`, not `{bold: true}` |
136+
| `textSearchArgs` | `query_match` has `select.type: "text"` + `select.pattern` |
137+
| `nodeSearchArgs` | `query_match` has `select.type: "node"` + correct `nodeType` |
138+
| `noTextInsertForStructure` | Headings/paragraphs use standalone tools, not `text.insert` |
139+
| `validDiscoverGroups` | `discover_tools` groups are valid names |
140+
| `isTrackedMode` | Tracked changes set `changeMode: "tracked"` |
141+
| `isNotTrackedMode` | Direct edits do not set `changeMode: "tracked"` |
142+
| `atomicMultiStep` | Multi-step has `atomic: true` + 2+ steps |
143+
144+
Each function: `(output, context) => { pass, score, reason }` or `true` (skip).
145+
146+
## Adding a new model
147+
148+
Add a provider to any config YAML:
149+
150+
```yaml
151+
# In promptfooconfig.yaml (Level 1)
152+
- id: vercel:anthropic/claude-sonnet-4.6
153+
label: Claude Sonnet 4.6
154+
delay: 1000
155+
config:
156+
temperature: 0
157+
tools: file://lib/essential.json
158+
maxTokens: 1024
159+
160+
# In promptfooconfig.e2e.yaml (Level 2)
161+
- id: file://providers/superdoc-agent-gateway.mjs
162+
label: Claude Sonnet 4.6 (Gateway)
163+
config:
164+
modelId: anthropic/claude-sonnet-4.6
165+
```
166+
167+
## Notes
168+
169+
- All providers route through **Vercel AI Gateway** (`AI_GATEWAY_API_KEY`). One key, all models.
170+
- Run `pnpm run generate:all` from repo root if `extract-tools` fails (SDK artifacts need regenerating).
171+
- `prompts/agent.txt` is the canonical system prompt. Update it when changing tool documentation.
172+
- Promptfoo caches responses. Changing assertions re-runs on cached data for free. Clear: `npx promptfoo cache clear`.
173+
- `normalize.cjs` converts Anthropic `tool_use` and Google `functionCall` formats to OpenAI format so all assertions work across providers.
174+
- Execution provider caches results in `results/.cache/` (keyed by model+fixture+task). Disable: `PROMPTFOO_CACHE_ENABLED=false`.
175+
- Files prefixed with `__` (e.g. `__promptfooconfig.gdpval.yaml`) are disabled/legacy configs kept for reference.
7.44 KB
Binary file not shown.

β€Ževals/fixtures/contract.docxβ€Ž

25.6 KB
Binary file not shown.

β€Ževals/fixtures/document.docxβ€Ž

85 KB
Binary file not shown.
16.2 KB
Binary file not shown.
14.2 KB
Binary file not shown.

0 commit comments

Comments
Β (0)