fix(evals): resolve SDK from local workspace with automatic prepare pipeline (#2420)

harbournick · web-flow · commit e000efd551c6 · 2026-03-17T15:57:31.000-07:00
diff --git a/evals/README.md b/evals/README.md
@@ -13,14 +13,19 @@ Run these commands from the repo root:
 
 ```bash
 pnpm install
-pnpm run generate:all                                  # if packages/sdk/tools/*.json are missing
 cp evals/.env.example evals/.env
 pnpm --filter @superdoc-testing/evals run eval:openai  # Level 1
-pnpm --prefix apps/cli run build                       # required for Level 2
-pnpm --filter @superdoc-testing/evals run eval:e2e    # Level 2
+pnpm --filter @superdoc-testing/evals run eval:e2e     # Level 2
 pnpm --filter @superdoc-testing/evals run view
 ```
 
+Tool artifacts are **automatically regenerated** before each eval run via pre-hooks:
+
+- **Level 1** (`eval`, `eval:openai`): regenerates SDK tool catalogs on the fast path, and automatically falls back to a one-time full bootstrap if prerequisites are missing
+- **Level 2** (`eval:e2e`): runs full `generate:all` (doc-api → CLI contract → SDK), then builds SDK + CLI
+
+You do not need to build or bootstrap manually unless you are running `eval:repeat` or `eval:analyze` (which bypass pre-hooks). On a fresh clone, the first Level 1 run may take longer because it auto-bootstraps the doc-api contract before switching back to the fast path.
+
 Edit `evals/.env` before running:
 
 - `OPENAI_API_KEY` for `eval` and `eval:openai`
@@ -48,6 +53,8 @@ Both levels use the same **9 grouped public tools** from the SDK:
 
 Level 1 loads the generated SDK provider bundle through a thin Promptfoo adapter that returns the bundle's `tools` array. Level 2 uses `sdk.chooseTools()`. The system prompt comes from `packages/sdk/tools/system-prompt.md`.
 
+Both levels resolve `@superdoc-dev/sdk` from the **local workspace** (`workspace:*`), not from npm. A prepare script (`scripts/prepare-local-sdk.mjs`) runs automatically before each eval to regenerate tool artifacts, build the SDK/CLI, and verify the tool surface. See [Local SDK resolution](#local-sdk-resolution) for details.
+
 ## Two levels of testing
 
 ### Level 1: Tool quality
@@ -95,6 +102,8 @@ evals/
   tests/
     tool-quality.yaml               28 tool-selection / argument-shape tests
     execution.yaml                  21 real DOCX editing tests
+  scripts/
+    prepare-local-sdk.mjs           Pre-run pipeline: generate, build, verify
   providers/
     superdoc-agent-gateway.mjs      AI SDK + AI Gateway execution provider
     superdoc-agent.mjs              Legacy direct OpenAI execution provider
@@ -222,14 +231,38 @@ Add another entry to `promptfooconfig.e2e.yaml`:
     modelId: anthropic/claude-sonnet-4.6
 ```
 
+## Local SDK resolution
+
+Evals depend on `@superdoc-dev/sdk` via `workspace:*`, so pnpm always resolves to the local workspace package at `packages/sdk/langs/node/` — never to the published npm version.
+
+A prepare script (`scripts/prepare-local-sdk.mjs`) runs as a pre-hook before `eval`, `eval:openai`, and `eval:e2e`. It:
+
+1. **Regenerates** artifacts — full `generate:all` chain for Level 2, SDK-only catalog regeneration for Level 1
+   Level 1 automatically falls back to the full chain if the generated doc-api contract is missing.
+2. **Builds** the SDK and CLI (Level 2 only)
+3. **Verifies** all expected output files exist
+4. **Guards** that `@superdoc-dev/sdk` resolves from the workspace, not npm (Level 2 only)
+5. **Validates** the tool surface matches the expected 9 grouped public tools
+
+The provider cache (`results/.cache/`) includes an SDK fingerprint — a hash of the tool catalogs, system prompt, full SDK `dist/` tree, and CLI binary. Switching branches or changing local tool/runtime artifacts automatically invalidates stale cache entries.
+
+To skip preparation during rapid iteration (when you know your builds are current):
+
+```bash
+SKIP_PREPARE=1 pnpm run eval:e2e
+```
+
+To intentionally test the published npm SDK, change the dependency in `evals/package.json` from `workspace:*` to a version number.
+
 ## Notes
 
-- If `packages/sdk/tools/*.json` are missing, run `pnpm run generate:all` from the repo root first.
+- Level 2 runs `generate:all` automatically. Level 1 uses the fast SDK-only path once prerequisites exist, and auto-falls-back to full bootstrap on a fresh clone.
 - Level 1 currently uses native OpenAI Promptfoo providers. Level 2 uses a custom provider that routes through Vercel AI Gateway.
 - `pnpm run view` is the correct script name. There is no `eval:view` script in the current package.
 - `pnpm run analyze` reads `results/latest.json`, writes `results/analysis.html`, and requires `ANTHROPIC_API_KEY`.
 - Promptfoo caches model responses. Clear Promptfoo's cache with `npx promptfoo cache clear`.
-- The custom execution provider also caches results in `results/.cache/`. Disable it with `PROMPTFOO_CACHE_ENABLED=false`.
+- The custom execution provider also caches results in `results/.cache/`. The cache key includes an SDK fingerprint, so local tool/runtime changes automatically invalidate old entries. Disable provider caching entirely with `PROMPTFOO_CACHE_ENABLED=false`.
+- `eval:repeat` and `eval:analyze` bypass the pre-hook (they call `npx promptfoo` directly). Run `node scripts/prepare-local-sdk.mjs` manually before these if needed.
 
 ## Exit codes and troubleshooting
 
diff --git a/evals/package.json b/evals/package.json
@@ -4,8 +4,11 @@
   "private": true,
   "type": "module",
   "scripts": {
-    "test": "node lib/checks.test.mjs",
+    "test": "node lib/checks.test.mjs && node --test providers/utils.test.mjs",
     "view": "npx promptfoo view",
+    "preeval": "node scripts/prepare-local-sdk.mjs --light",
+    "preeval:openai": "node scripts/prepare-local-sdk.mjs --light",
+    "preeval:e2e": "node scripts/prepare-local-sdk.mjs",
     "eval": "npx promptfoo eval --env-file .env -o results/latest.json",
     "eval:e2e": "npx promptfoo eval --env-file .env -c promptfooconfig.e2e.yaml -o results/latest-e2e.json",
     "eval:openai": "npx promptfoo eval --env-file .env --filter-providers 'GPT-*' -o results/latest-openai.json",
@@ -17,7 +20,7 @@
   },
   "devDependencies": {
     "@anthropic-ai/claude-agent-sdk": "^0.2.76",
-    "@superdoc-dev/sdk": "1.0.0-alpha.38",
+    "@superdoc-dev/sdk": "workspace:*",
     "ai": "^6.0.116",
     "openai": "^6.25.0",
     "promptfoo": "^0.121.1"
diff --git a/evals/providers/utils.mjs b/evals/providers/utils.mjs
@@ -3,8 +3,17 @@
  */
 
 import { createHash } from 'node:crypto';
-import { copyFileSync, existsSync, mkdirSync, readFileSync, rmSync, unlinkSync, writeFileSync } from 'node:fs';
-import { dirname, resolve } from 'node:path';
+import {
+  copyFileSync,
+  existsSync,
+  mkdirSync,
+  readdirSync,
+  readFileSync,
+  rmSync,
+  unlinkSync,
+  writeFileSync,
+} from 'node:fs';
+import { dirname, relative, resolve, sep } from 'node:path';
 import { fileURLToPath } from 'node:url';
 
 const __dirname = dirname(fileURLToPath(import.meta.url));
@@ -65,12 +74,97 @@ export function cleanArgs(args) {
   return rest;
 }
 
+// --- SDK fingerprint (for cache invalidation) ---
+
+const SDK_TOOLS_DIR = resolve(EVALS_ROOT, '..', 'packages/sdk/tools');
+const SDK_DIST_DIR = resolve(EVALS_ROOT, '..', 'packages/sdk/langs/node/dist');
+const SDK_FINGERPRINT_FILES = [
+  resolve(SDK_TOOLS_DIR, 'tools.vercel.json'),
+  resolve(SDK_TOOLS_DIR, 'tools.openai.json'),
+  PATHS.prompt,
+  PATHS.cliBin,
+];
+const SDK_FINGERPRINT_DIRECTORIES = [SDK_DIST_DIR];
+
+function normalizeFingerprintPath(path) {
+  return (path || '.').split(sep).join('/');
+}
+
+function updateHashWithFile(hash, filePath, rootPath = dirname(filePath)) {
+  const fingerprintPath = normalizeFingerprintPath(relative(rootPath, filePath));
+  hash.update(`file:${fingerprintPath}\n`);
+  hash.update(readFileSync(filePath));
+}
+
+function updateHashWithDirectory(hash, dirPath, rootPath = dirPath) {
+  const fingerprintPath = normalizeFingerprintPath(relative(rootPath, dirPath));
+  hash.update(`dir:${fingerprintPath}\n`);
+
+  let entries;
+  try {
+    entries = readdirSync(dirPath, { withFileTypes: true })
+      .sort((a, b) => a.name.localeCompare(b.name));
+  } catch {
+    hash.update(`missing-dir:${dirPath}\n`);
+    return;
+  }
+
+  for (const entry of entries) {
+    const entryPath = resolve(dirPath, entry.name);
+    const entryFingerprintPath = normalizeFingerprintPath(relative(rootPath, entryPath));
+
+    if (entry.isDirectory()) {
+      updateHashWithDirectory(hash, entryPath, rootPath);
+      continue;
+    }
+
+    if (entry.isFile()) {
+      updateHashWithFile(hash, entryPath, rootPath);
+      continue;
+    }
+
+    hash.update(`other:${entryFingerprintPath}\n`);
+  }
+}
+
+/**
+ * Compute the artifact fingerprint used to invalidate cached eval results when
+ * the local tool surface or runtime artifacts change.
+ *
+ * @param {{files?: string[], directories?: string[]}} [options]
+ * @returns {string}
+ */
+export function computeSdkFingerprint({
+  files = SDK_FINGERPRINT_FILES,
+  directories = SDK_FINGERPRINT_DIRECTORIES,
+} = {}) {
+  const hash = createHash('sha256');
+  for (const file of [...files].sort()) {
+    try {
+      updateHashWithFile(hash, file);
+    } catch {
+      hash.update(`missing:${file}`);
+    }
+  }
+
+  for (const directory of [...directories].sort()) {
+    updateHashWithDirectory(hash, directory);
+  }
+
+  return hash.digest('hex').slice(0, 12);
+}
+
+const SDK_FINGERPRINT = computeSdkFingerprint();
+
 // --- Cache ---
 
-/** Generate a cache key from model + fixture + task + prompt hash. */
+/** Generate a cache key from model + fixture + task + prompt hash + SDK fingerprint. */
 export function cacheKey(model, fixture, task, prompt) {
   const promptSig = prompt ? createHash('sha256').update(prompt).digest('hex').slice(0, 8) : '';
-  const hash = createHash('sha256').update(`${model}|${fixture}|${task}|${promptSig}`).digest('hex').slice(0, 16);
+  const hash = createHash('sha256')
+    .update(`${model}|${fixture}|${task}|${promptSig}|${SDK_FINGERPRINT}`)
+    .digest('hex')
+    .slice(0, 16);
   return hash;
 }
 
diff --git a/evals/providers/utils.test.mjs b/evals/providers/utils.test.mjs
@@ -0,0 +1,75 @@
+import assert from 'node:assert/strict';
+import { mkdirSync, mkdtempSync, rmSync, writeFileSync } from 'node:fs';
+import { tmpdir } from 'node:os';
+import { dirname, resolve } from 'node:path';
+import test from 'node:test';
+
+import { computeSdkFingerprint } from './utils.mjs';
+
+function withTempDir(run) {
+  const tempDir = mkdtempSync(resolve(tmpdir(), 'superdoc-evals-utils-'));
+  try {
+    run(tempDir);
+  } finally {
+    rmSync(tempDir, { recursive: true, force: true });
+  }
+}
+
+function writeFile(path, contents) {
+  mkdirSync(dirname(path), { recursive: true });
+  writeFileSync(path, contents);
+}
+
+test('computeSdkFingerprint changes when a nested SDK dist file changes', () => {
+  withTempDir((root) => {
+    const sdkDistDir = resolve(root, 'sdk/dist');
+    const promptFile = resolve(root, 'packages/sdk/tools/system-prompt.md');
+    const cliFile = resolve(root, 'apps/cli/dist/index.js');
+
+    writeFile(resolve(sdkDistDir, 'index.js'), "export { run } from './runtime/process.js';\n");
+    writeFile(resolve(sdkDistDir, 'runtime/process.js'), 'export const run = () => "v1";\n');
+    writeFile(promptFile, 'system prompt\n');
+    writeFile(cliFile, 'console.log("cli");\n');
+
+    const before = computeSdkFingerprint({
+      files: [promptFile, cliFile],
+      directories: [sdkDistDir],
+    });
+
+    writeFile(resolve(sdkDistDir, 'runtime/process.js'), 'export const run = () => "v2";\n');
+
+    const after = computeSdkFingerprint({
+      files: [promptFile, cliFile],
+      directories: [sdkDistDir],
+    });
+
+    assert.notEqual(before, after);
+  });
+});
+
+test('computeSdkFingerprint changes when a new SDK dist file is added', () => {
+  withTempDir((root) => {
+    const sdkDistDir = resolve(root, 'sdk/dist');
+    const promptFile = resolve(root, 'packages/sdk/tools/system-prompt.md');
+    const cliFile = resolve(root, 'apps/cli/dist/index.js');
+
+    writeFile(resolve(sdkDistDir, 'index.js'), "export { run } from './runtime/process.js';\n");
+    writeFile(resolve(sdkDistDir, 'runtime/process.js'), 'export const run = () => "ready";\n');
+    writeFile(promptFile, 'system prompt\n');
+    writeFile(cliFile, 'console.log("cli");\n');
+
+    const before = computeSdkFingerprint({
+      files: [promptFile, cliFile],
+      directories: [sdkDistDir],
+    });
+
+    writeFile(resolve(sdkDistDir, 'generated/client.js'), 'export const generated = true;\n');
+
+    const after = computeSdkFingerprint({
+      files: [promptFile, cliFile],
+      directories: [sdkDistDir],
+    });
+
+    assert.notEqual(before, after);
+  });
+});
diff --git a/evals/scripts/prepare-local-sdk.mjs b/evals/scripts/prepare-local-sdk.mjs
diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml