Commit 0719412
authored
feat(evals): add Promptfoo-based AI tool call evaluation suite (#2351)
* feat(evals): add Promptfoo-based AI tool call evaluation suite
Add automated evaluation infrastructure for validating LLM tool call
quality across SuperDoc's Document Engine API. Tests whether models
select the correct tools and construct valid arguments when given
document editing tasks.
The suite extracts 6 essential tool definitions from the SDK and
runs them against multiple OpenAI models and cross-provider comparisons
(Anthropic, Google). Includes deterministic assertions for tool
selection, argument accuracy, and production correctness rules learned
from the labs agent implementation.
* docs(evals): simplify README
* feat(evals): enhance GDPval benchmark configuration and test assertions
Updated the GDPval benchmark configuration to include distinct prompts for SuperDoc tool-augmented and baseline models. Enhanced the test assertions in the GDPval workflows to provide clearer scoring criteria for model responses, focusing on the specificity and executable nature of the responses. Adjusted thresholds for scoring to better reflect the quality of tool calls and text descriptions in document editing tasks.
* chore(evals): update GPT model version in GDPval configuration
Changed the model identifier from GPT-4o to GPT-5.4 in the GDPval benchmark configuration for both SuperDoc tool-augmented and baseline prompts, ensuring alignment with the latest model updates.
* feat(evals): add execution tests and enhance configuration for SuperDoc agent
Introduced a new execution test suite for the SuperDoc agent, validating real document editing capabilities through the CLI. Added a new script command for executing these tests and updated the GDPval configuration to reflect the latest GPT model version. Included necessary dependencies and created a new provider for the SuperDoc agent to facilitate the execution of tool calls against DOCX files.
* feat(evals): enhance SuperDoc agent with document copy and round-trip validation
Updated the SuperDoc agent to create temporary copies of documents for editing, ensuring original fixtures remain unaltered. Implemented round-trip validation to verify that edits persist after saving and re-opening DOCX files. Added a new memorandum fixture and expanded execution tests to cover various document editing scenarios, enhancing overall test coverage and reliability.
* feat(evals): add keepFile option to SuperDoc agent for document preservation
Enhanced the SuperDoc agent to include a `keepFile` option, allowing users to save edited documents to a specified output directory. Updated the logic to create the output directory if it doesn't exist and modified the cleanup process to conditionally copy the edited document based on this new option. Adjusted execution tests to validate the new functionality, ensuring comprehensive coverage of document editing scenarios.
* feat(evals): increase maxConcurrency for SuperDoc agent tests and update execution logic
Enhanced the SuperDoc agent's execution configuration by increasing the `maxConcurrency` from 1 to 5, allowing for more efficient concurrent test execution. Updated the cleanup process to ensure isolated state directories are properly managed, improving resource handling during tests. Adjusted execution tests to reflect these changes, ensuring robust validation of document editing capabilities.
* feat(evals): refactor SuperDoc agent evaluation scripts and enhance tool configuration
Refactored the SuperDoc agent's evaluation scripts to streamline the execution process and improve clarity. Removed the deprecated cross-provider configuration and consolidated tool evaluation logic into a unified structure. Introduced new assertion checks for tool quality and argument accuracy, ensuring comprehensive validation of document editing tasks. Updated the test suite to reflect these changes, enhancing overall test coverage and reliability.
* feat(evals): add AI Gateway support and new execution configuration for SuperDoc agent
Introduced the AI Gateway API key in the environment configuration to enable optional integration with Vercel AI Gateway. Added a new script command for executing evaluations through the gateway, enhancing the SuperDoc agent's capabilities. Created a new YAML configuration file for execution tests via the AI Gateway, allowing for testing across multiple models. Updated the package dependencies to include the necessary SDK for AI Gateway functionality.
* feat(evals): enhance SuperDoc agent with usage tracking and new customer prompt tests
Updated the SuperDoc agent to include tracking of total usage and steps during text generation, improving performance insights. Added a series of customer prompt tests in YAML format to validate various document editing tasks, ensuring comprehensive coverage of real-world scenarios. This enhancement aims to bolster the agent's capabilities and testing framework.
* feat(evals): streamline evaluation configuration and remove deprecated files
Removed the JavaScript assertion file and context builder, simplifying the evaluation framework. Updated the prompt configuration to eliminate unused metrics and added new document fixtures for testing. Enhanced execution tests to validate document editing capabilities with the new fixtures, ensuring comprehensive coverage of various scenarios.
* fix(evals): update model labels and refine execution test descriptions
Updated the model labels in the execution gateway configuration for clarity and accuracy. Refined the execution test descriptions to better reflect the specific tasks being validated, enhancing the readability and intent of the tests. Commented out deprecated Google provider configurations to streamline the YAML files.
* chore(evals): clean up evaluation configurations and remove obsolete files
Updated the .gitignore to exclude temporary files and removed deprecated YAML configuration files related to GDPval and execution tests. Streamlined the package.json by eliminating unused evaluation scripts, enhancing overall project organization and clarity.
* chore(evals): update pnpm-lock.yaml and .gitignore for dependency management
Updated pnpm-lock.yaml to reflect new versions of dependencies, including @types/node and added new SDK entries for SuperDoc. Modified .gitignore to exclude additional temporary files and states, improving project cleanliness and organization.
* feat(evals): implement caching mechanism for SuperDoc agent evaluations
Added a caching system to the SuperDoc agent and gateway providers to improve performance by storing and retrieving results based on a generated cache key. Updated the utility functions to handle cache operations, ensuring efficient reuse of previous evaluation results. Modified the evaluation logic to check for cached results before executing tasks, enhancing overall efficiency in the evaluation framework. Additionally, updated the package.json to reflect changes in evaluation scripts and added a new YAML configuration for end-to-end tests via the AI Gateway.
* feat(evals): expand evaluation framework with two-level testing and enhanced documentation
Updated the evaluation framework to include two levels of testing: tool quality and execution. Enhanced the README to clarify testing processes, commands, and configurations. Introduced new YAML files for tool quality and execution tests, detailing the number of tests and providers involved. Improved command descriptions for better usability and added new document fixtures for comprehensive testing of document editing capabilities.
* feat(evals): add Vercel tools provider and enhance evaluation scripts
Introduced a new Vercel tools provider for the SuperDoc evaluation framework, enabling structured tool calls with the Vercel AI SDK. Updated the package.json to include a new script for evaluating tools with the Vercel configuration. Enhanced the prompt configuration by adding a new YAML file for tool evaluations and refined existing evaluation scripts to support the new provider. Additionally, made minor adjustments to the presentation HTML for improved accessibility and clarity.
* chore(deps): update pnpm-lock.yaml to remove naive-ui and add @superdoc/common dependency
Removed outdated naive-ui entries and added @superdoc/common as a workspace dependency in pnpm-lock.yaml, ensuring the project reflects the latest dependency structure.
* feat(evals): enhance evaluation scripts and update Vercel tools provider
Refined evaluation scripts in package.json to output results to specific JSON files for better organization. Updated the Vercel tools provider to support live discovery of tools and improved error handling. Enhanced YAML configurations for tool evaluations, including clearer descriptions and adjustments to thresholds for tool-call metrics. Added caching functionality to optimize performance and ensure efficient reuse of evaluation results.
* fix: remove unused Vercel AI SDK evaluation configuration file and update tool quality test descriptions for clarity and consistency
* feat: add analysis functionality for eval results and update tool quality tests
- Introduced a new script `analyze-results.mjs` for generating a visual HTML dashboard from evaluation results using the Claude Agent SDK.
- Added new npm scripts: `analyze` and `eval:analyze` for easier result analysis.
- Updated `package.json` to include the `@anthropic-ai/claude-agent-sdk` dependency.
- Enhanced tool quality tests to allow for node search using either `query_match` or `blocks_list` for improved flexibility in evaluations.
* chore(deps): update @types/node version across multiple dependencies in pnpm-lock.yaml
- Updated the version of @types/node from 25.5.0 to 22.19.2 for various dependencies to ensure compatibility and reduce potential conflicts.
- Added the @anthropic-ai/claude-agent-sdk dependency with version 0.2.76.
- Adjusted the version of esbuild in the vitest dependency to 0.27.2.
* fix(evals): address code review findings in eval assertions and caching
- correctFormatArgs: validate args.inline on all format.apply steps, not just bold
- nodeSearchOrBlocksList: enforce expected nodeType match for both query_match and blocks_list
- compare-baselines: match tests by identity (description+provider+prompt) instead of array index
- cacheKey: include prompt hash so prompt changes invalidate cached results
- Remove unused eval:tools script referencing deleted config1 parent edcb3c6 commit 0719412
30 files changed
Lines changed: 8430 additions & 155 deletions
File tree
- evals
- fixtures
- lib
- presentation
- prompts
- providers
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| 74 | + | |
74 | 75 | | |
75 | 76 | | |
76 | 77 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
112 | 112 | | |
113 | 113 | | |
114 | 114 | | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
115 | 133 | | |
116 | 134 | | |
117 | 135 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
0 commit comments