Skip to content

Commit e000efd

Browse files
authored
fix(evals): resolve SDK from local workspace with automatic prepare pipeline (#2420)
1 parent 39e1477 commit e000efd

6 files changed

Lines changed: 481 additions & 69 deletions

File tree

evals/README.md

Lines changed: 38 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,19 @@ Run these commands from the repo root:
1313

1414
```bash
1515
pnpm install
16-
pnpm run generate:all # if packages/sdk/tools/*.json are missing
1716
cp evals/.env.example evals/.env
1817
pnpm --filter @superdoc-testing/evals run eval:openai # Level 1
19-
pnpm --prefix apps/cli run build # required for Level 2
20-
pnpm --filter @superdoc-testing/evals run eval:e2e # Level 2
18+
pnpm --filter @superdoc-testing/evals run eval:e2e # Level 2
2119
pnpm --filter @superdoc-testing/evals run view
2220
```
2321

22+
Tool artifacts are **automatically regenerated** before each eval run via pre-hooks:
23+
24+
- **Level 1** (`eval`, `eval:openai`): regenerates SDK tool catalogs on the fast path, and automatically falls back to a one-time full bootstrap if prerequisites are missing
25+
- **Level 2** (`eval:e2e`): runs full `generate:all` (doc-api → CLI contract → SDK), then builds SDK + CLI
26+
27+
You do not need to build or bootstrap manually unless you are running `eval:repeat` or `eval:analyze` (which bypass pre-hooks). On a fresh clone, the first Level 1 run may take longer because it auto-bootstraps the doc-api contract before switching back to the fast path.
28+
2429
Edit `evals/.env` before running:
2530

2631
- `OPENAI_API_KEY` for `eval` and `eval:openai`
@@ -48,6 +53,8 @@ Both levels use the same **9 grouped public tools** from the SDK:
4853

4954
Level 1 loads the generated SDK provider bundle through a thin Promptfoo adapter that returns the bundle's `tools` array. Level 2 uses `sdk.chooseTools()`. The system prompt comes from `packages/sdk/tools/system-prompt.md`.
5055

56+
Both levels resolve `@superdoc-dev/sdk` from the **local workspace** (`workspace:*`), not from npm. A prepare script (`scripts/prepare-local-sdk.mjs`) runs automatically before each eval to regenerate tool artifacts, build the SDK/CLI, and verify the tool surface. See [Local SDK resolution](#local-sdk-resolution) for details.
57+
5158
## Two levels of testing
5259

5360
### Level 1: Tool quality
@@ -95,6 +102,8 @@ evals/
95102
tests/
96103
tool-quality.yaml 28 tool-selection / argument-shape tests
97104
execution.yaml 21 real DOCX editing tests
105+
scripts/
106+
prepare-local-sdk.mjs Pre-run pipeline: generate, build, verify
98107
providers/
99108
superdoc-agent-gateway.mjs AI SDK + AI Gateway execution provider
100109
superdoc-agent.mjs Legacy direct OpenAI execution provider
@@ -222,14 +231,38 @@ Add another entry to `promptfooconfig.e2e.yaml`:
222231
modelId: anthropic/claude-sonnet-4.6
223232
```
224233

234+
## Local SDK resolution
235+
236+
Evals depend on `@superdoc-dev/sdk` via `workspace:*`, so pnpm always resolves to the local workspace package at `packages/sdk/langs/node/` — never to the published npm version.
237+
238+
A prepare script (`scripts/prepare-local-sdk.mjs`) runs as a pre-hook before `eval`, `eval:openai`, and `eval:e2e`. It:
239+
240+
1. **Regenerates** artifacts — full `generate:all` chain for Level 2, SDK-only catalog regeneration for Level 1
241+
Level 1 automatically falls back to the full chain if the generated doc-api contract is missing.
242+
2. **Builds** the SDK and CLI (Level 2 only)
243+
3. **Verifies** all expected output files exist
244+
4. **Guards** that `@superdoc-dev/sdk` resolves from the workspace, not npm (Level 2 only)
245+
5. **Validates** the tool surface matches the expected 9 grouped public tools
246+
247+
The provider cache (`results/.cache/`) includes an SDK fingerprint — a hash of the tool catalogs, system prompt, full SDK `dist/` tree, and CLI binary. Switching branches or changing local tool/runtime artifacts automatically invalidates stale cache entries.
248+
249+
To skip preparation during rapid iteration (when you know your builds are current):
250+
251+
```bash
252+
SKIP_PREPARE=1 pnpm run eval:e2e
253+
```
254+
255+
To intentionally test the published npm SDK, change the dependency in `evals/package.json` from `workspace:*` to a version number.
256+
225257
## Notes
226258

227-
- If `packages/sdk/tools/*.json` are missing, run `pnpm run generate:all` from the repo root first.
259+
- Level 2 runs `generate:all` automatically. Level 1 uses the fast SDK-only path once prerequisites exist, and auto-falls-back to full bootstrap on a fresh clone.
228260
- Level 1 currently uses native OpenAI Promptfoo providers. Level 2 uses a custom provider that routes through Vercel AI Gateway.
229261
- `pnpm run view` is the correct script name. There is no `eval:view` script in the current package.
230262
- `pnpm run analyze` reads `results/latest.json`, writes `results/analysis.html`, and requires `ANTHROPIC_API_KEY`.
231263
- Promptfoo caches model responses. Clear Promptfoo's cache with `npx promptfoo cache clear`.
232-
- The custom execution provider also caches results in `results/.cache/`. Disable it with `PROMPTFOO_CACHE_ENABLED=false`.
264+
- The custom execution provider also caches results in `results/.cache/`. The cache key includes an SDK fingerprint, so local tool/runtime changes automatically invalidate old entries. Disable provider caching entirely with `PROMPTFOO_CACHE_ENABLED=false`.
265+
- `eval:repeat` and `eval:analyze` bypass the pre-hook (they call `npx promptfoo` directly). Run `node scripts/prepare-local-sdk.mjs` manually before these if needed.
233266

234267
## Exit codes and troubleshooting
235268

evals/package.json

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,11 @@
44
"private": true,
55
"type": "module",
66
"scripts": {
7-
"test": "node lib/checks.test.mjs",
7+
"test": "node lib/checks.test.mjs && node --test providers/utils.test.mjs",
88
"view": "npx promptfoo view",
9+
"preeval": "node scripts/prepare-local-sdk.mjs --light",
10+
"preeval:openai": "node scripts/prepare-local-sdk.mjs --light",
11+
"preeval:e2e": "node scripts/prepare-local-sdk.mjs",
912
"eval": "npx promptfoo eval --env-file .env -o results/latest.json",
1013
"eval:e2e": "npx promptfoo eval --env-file .env -c promptfooconfig.e2e.yaml -o results/latest-e2e.json",
1114
"eval:openai": "npx promptfoo eval --env-file .env --filter-providers 'GPT-*' -o results/latest-openai.json",
@@ -17,7 +20,7 @@
1720
},
1821
"devDependencies": {
1922
"@anthropic-ai/claude-agent-sdk": "^0.2.76",
20-
"@superdoc-dev/sdk": "1.0.0-alpha.38",
23+
"@superdoc-dev/sdk": "workspace:*",
2124
"ai": "^6.0.116",
2225
"openai": "^6.25.0",
2326
"promptfoo": "^0.121.1"

evals/providers/utils.mjs

Lines changed: 98 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,17 @@
33
*/
44

55
import { createHash } from 'node:crypto';
6-
import { copyFileSync, existsSync, mkdirSync, readFileSync, rmSync, unlinkSync, writeFileSync } from 'node:fs';
7-
import { dirname, resolve } from 'node:path';
6+
import {
7+
copyFileSync,
8+
existsSync,
9+
mkdirSync,
10+
readdirSync,
11+
readFileSync,
12+
rmSync,
13+
unlinkSync,
14+
writeFileSync,
15+
} from 'node:fs';
16+
import { dirname, relative, resolve, sep } from 'node:path';
817
import { fileURLToPath } from 'node:url';
918

1019
const __dirname = dirname(fileURLToPath(import.meta.url));
@@ -65,12 +74,97 @@ export function cleanArgs(args) {
6574
return rest;
6675
}
6776

77+
// --- SDK fingerprint (for cache invalidation) ---
78+
79+
const SDK_TOOLS_DIR = resolve(EVALS_ROOT, '..', 'packages/sdk/tools');
80+
const SDK_DIST_DIR = resolve(EVALS_ROOT, '..', 'packages/sdk/langs/node/dist');
81+
const SDK_FINGERPRINT_FILES = [
82+
resolve(SDK_TOOLS_DIR, 'tools.vercel.json'),
83+
resolve(SDK_TOOLS_DIR, 'tools.openai.json'),
84+
PATHS.prompt,
85+
PATHS.cliBin,
86+
];
87+
const SDK_FINGERPRINT_DIRECTORIES = [SDK_DIST_DIR];
88+
89+
function normalizeFingerprintPath(path) {
90+
return (path || '.').split(sep).join('/');
91+
}
92+
93+
function updateHashWithFile(hash, filePath, rootPath = dirname(filePath)) {
94+
const fingerprintPath = normalizeFingerprintPath(relative(rootPath, filePath));
95+
hash.update(`file:${fingerprintPath}\n`);
96+
hash.update(readFileSync(filePath));
97+
}
98+
99+
function updateHashWithDirectory(hash, dirPath, rootPath = dirPath) {
100+
const fingerprintPath = normalizeFingerprintPath(relative(rootPath, dirPath));
101+
hash.update(`dir:${fingerprintPath}\n`);
102+
103+
let entries;
104+
try {
105+
entries = readdirSync(dirPath, { withFileTypes: true })
106+
.sort((a, b) => a.name.localeCompare(b.name));
107+
} catch {
108+
hash.update(`missing-dir:${dirPath}\n`);
109+
return;
110+
}
111+
112+
for (const entry of entries) {
113+
const entryPath = resolve(dirPath, entry.name);
114+
const entryFingerprintPath = normalizeFingerprintPath(relative(rootPath, entryPath));
115+
116+
if (entry.isDirectory()) {
117+
updateHashWithDirectory(hash, entryPath, rootPath);
118+
continue;
119+
}
120+
121+
if (entry.isFile()) {
122+
updateHashWithFile(hash, entryPath, rootPath);
123+
continue;
124+
}
125+
126+
hash.update(`other:${entryFingerprintPath}\n`);
127+
}
128+
}
129+
130+
/**
131+
* Compute the artifact fingerprint used to invalidate cached eval results when
132+
* the local tool surface or runtime artifacts change.
133+
*
134+
* @param {{files?: string[], directories?: string[]}} [options]
135+
* @returns {string}
136+
*/
137+
export function computeSdkFingerprint({
138+
files = SDK_FINGERPRINT_FILES,
139+
directories = SDK_FINGERPRINT_DIRECTORIES,
140+
} = {}) {
141+
const hash = createHash('sha256');
142+
for (const file of [...files].sort()) {
143+
try {
144+
updateHashWithFile(hash, file);
145+
} catch {
146+
hash.update(`missing:${file}`);
147+
}
148+
}
149+
150+
for (const directory of [...directories].sort()) {
151+
updateHashWithDirectory(hash, directory);
152+
}
153+
154+
return hash.digest('hex').slice(0, 12);
155+
}
156+
157+
const SDK_FINGERPRINT = computeSdkFingerprint();
158+
68159
// --- Cache ---
69160

70-
/** Generate a cache key from model + fixture + task + prompt hash. */
161+
/** Generate a cache key from model + fixture + task + prompt hash + SDK fingerprint. */
71162
export function cacheKey(model, fixture, task, prompt) {
72163
const promptSig = prompt ? createHash('sha256').update(prompt).digest('hex').slice(0, 8) : '';
73-
const hash = createHash('sha256').update(`${model}|${fixture}|${task}|${promptSig}`).digest('hex').slice(0, 16);
164+
const hash = createHash('sha256')
165+
.update(`${model}|${fixture}|${task}|${promptSig}|${SDK_FINGERPRINT}`)
166+
.digest('hex')
167+
.slice(0, 16);
74168
return hash;
75169
}
76170

evals/providers/utils.test.mjs

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
import assert from 'node:assert/strict';
2+
import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from 'node:fs';
3+
import { tmpdir } from 'node:os';
4+
import { dirname, resolve } from 'node:path';
5+
import test from 'node:test';
6+
7+
import { computeSdkFingerprint } from './utils.mjs';
8+
9+
function withTempDir(run) {
10+
const tempDir = mkdtempSync(resolve(tmpdir(), 'superdoc-evals-utils-'));
11+
try {
12+
run(tempDir);
13+
} finally {
14+
rmSync(tempDir, { recursive: true, force: true });
15+
}
16+
}
17+
18+
function writeFile(path, contents) {
19+
mkdirSync(dirname(path), { recursive: true });
20+
writeFileSync(path, contents);
21+
}
22+
23+
test('computeSdkFingerprint changes when a nested SDK dist file changes', () => {
24+
withTempDir((root) => {
25+
const sdkDistDir = resolve(root, 'sdk/dist');
26+
const promptFile = resolve(root, 'packages/sdk/tools/system-prompt.md');
27+
const cliFile = resolve(root, 'apps/cli/dist/index.js');
28+
29+
writeFile(resolve(sdkDistDir, 'index.js'), "export { run } from './runtime/process.js';\n");
30+
writeFile(resolve(sdkDistDir, 'runtime/process.js'), 'export const run = () => "v1";\n');
31+
writeFile(promptFile, 'system prompt\n');
32+
writeFile(cliFile, 'console.log("cli");\n');
33+
34+
const before = computeSdkFingerprint({
35+
files: [promptFile, cliFile],
36+
directories: [sdkDistDir],
37+
});
38+
39+
writeFile(resolve(sdkDistDir, 'runtime/process.js'), 'export const run = () => "v2";\n');
40+
41+
const after = computeSdkFingerprint({
42+
files: [promptFile, cliFile],
43+
directories: [sdkDistDir],
44+
});
45+
46+
assert.notEqual(before, after);
47+
});
48+
});
49+
50+
test('computeSdkFingerprint changes when a new SDK dist file is added', () => {
51+
withTempDir((root) => {
52+
const sdkDistDir = resolve(root, 'sdk/dist');
53+
const promptFile = resolve(root, 'packages/sdk/tools/system-prompt.md');
54+
const cliFile = resolve(root, 'apps/cli/dist/index.js');
55+
56+
writeFile(resolve(sdkDistDir, 'index.js'), "export { run } from './runtime/process.js';\n");
57+
writeFile(resolve(sdkDistDir, 'runtime/process.js'), 'export const run = () => "ready";\n');
58+
writeFile(promptFile, 'system prompt\n');
59+
writeFile(cliFile, 'console.log("cli");\n');
60+
61+
const before = computeSdkFingerprint({
62+
files: [promptFile, cliFile],
63+
directories: [sdkDistDir],
64+
});
65+
66+
writeFile(resolve(sdkDistDir, 'generated/client.js'), 'export const generated = true;\n');
67+
68+
const after = computeSdkFingerprint({
69+
files: [promptFile, cliFile],
70+
directories: [sdkDistDir],
71+
});
72+
73+
assert.notEqual(before, after);
74+
});
75+
});

0 commit comments

Comments
 (0)