Skip to content

Commit 23f8175

Browse files
authored
test: add eval framework for installer agent testing (#36)
* feat: add eval framework for testing installer agent Introduces a structured evaluation system to validate the WorkOS installer agent against framework fixtures. Phase 1 includes: - Core types and interfaces for grading results - File and build graders with pattern matching - Next.js-specific grader checking AuthKit integration - Fixture manager for temp dir setup/cleanup - Eval runner orchestrating fixture → agent → grade flow - CLI entry point with --framework and --verbose flags - Minimal Next.js 14 App Router fixture The agent executor is stubbed to validate framework structure first. Run with: pnpm eval * feat: expand eval framework to full 15-scenario test matrix Add CLI with filtering (--framework, --state, --json), matrix reporter, and graders for all 5 frameworks. Create fixtures for fresh, existing, and existing-auth0 states across Next.js, React SPA, React Router, TanStack Start, and Vanilla JS. * feat: add retry logic, debug tooling, and history tracking to evals - Add history.ts for results persistence with compare functionality - Extend CLI with --debug, --keep-on-fail, --retry, --no-retry flags - Add history and compare subcommands (pnpm eval:history, eval:compare) - Implement retry loop in runner for handling LLM non-determinism - Add verbose failure output with expected/actual values - Create README documentation for eval framework usage * feat(evals): wire AgentExecutor to use Claude Agent SDK Replace stub implementation with real agent execution: - Add env-loader for credentials from .env.local - Configure SDK with direct auth mode (bypasses gateway) - Capture tool calls and output from message stream - Add ToolCall interface to types * style: format eval files with prettier * fix(evals): update nextjs grader to match actual SDK patterns - Use glob + content matching for callback route (path is configurable) - Remove process.env.WORKOS_ check (SDK abstracts env access) - Add checkFileWithPattern helper for flexible file discovery * fix(evals): let agent configure redirect URI per framework * feat(evals): add --keep flag to preserve temp directory for manual testing * fix(evals): update tanstack grader and skill to match SDK patterns - Grader: support src/ directory (v1.132+) in addition to app/ - Grader: check for authkitMiddleware instead of createServerFn - Grader: fix package name to @workos/authkit-tanstack-react-start - Grader: remove AuthKitProvider requirement (optional for server-only) - Grader: support both flat and nested route patterns for callback - Skill: add directory detection guidance (src/ vs app/) - Skill: fix handleAuth() → handleCallbackRoute() - Skill: add SDK exports reference section * fix(evals): update React grader to match SDK patterns - Remove callback component check (SDK handles OAuth internally) - Use glob pattern to find useAuth anywhere in src/**/*.tsx - Support both Vite (main.tsx) and CRA (index.tsx) entry points - Add comprehensive header documenting SDK patterns * fix(evals): correct React Router grader package name and patterns - Fix package name: @workos-inc/authkit-react-router (was @workos-inc/authkit) - Use glob patterns instead of hardcoded file paths - Check for authLoader in callback routes (flexible location) - Check for authkitLoader in route files for auth state - Remove unnecessary ProtectedRoute.tsx/auth.ts checks (SDK has ensureSignedIn) - Support both app/ and src/ directory structures * fix(evals): correct Vanilla JS grader for createClient pattern - Remove callback.html/callback.js checks (SDK handles OAuth internally) - Remove auth.js with getAuthorizationUrl (old pattern) - Check for createClient from @workos-inc/authkit-js or CDN WorkOS.createClient - Check for auth methods (signIn, signOut, getUser, getAccessToken) - Support both bundled (ESM import) and CDN (script tag) patterns * feat(evals): add parallel execution and live Ink dashboard Phase 1: Parallel execution infrastructure - Add ParallelRunner with p-limit concurrency control - Auto-detect concurrency based on CPU/memory - Graceful shutdown with fixture cleanup on SIGINT/SIGTERM Phase 2: Event-driven live dashboard - Add EvalEventEmitter for scenario lifecycle events - Create Ink/React dashboard with real-time status updates - TTY detection: dashboard for interactive, logging for CI/pipes - Add --no-dashboard flag to disable live UI * chore(deps): add p-limit for parallel execution * feat(evals): add structured JSON log export and CLI commands Add debugging export layer that writes detailed JSON logs during eval runs: - LogWriter subscribes to eval events and writes incrementally - Includes all retry attempts, tool calls, and agent output (truncated at 10KB) - New CLI commands: `pnpm eval:logs` to list, `pnpm eval:show` to view - Survives interrupts by writing after each scenario completes * fix(evals): add scenario labels to verbose output Prefix verbose console output with [framework/state] labels so parallel execution logs are easier to follow. * fix(evals): use latest versions for tanstack-start fixtures Switch from pinned semver ranges to `latest` tag for tanstack packages. This ensures evals test against what users actually install, surfacing upstream breakage as signal rather than hiding it behind old versions. Note: tanstack-start is currently broken upstream (incompatible internal deps). Tests will auto-heal when they publish a fix. * fix(evals): rebuild tanstack-start fixtures for new Vite-based architecture TanStack Start moved from vinxi to pure Vite. Old fixtures used the deprecated vinxi-based structure which no longer works with latest. Changes: - Replace vinxi scripts with vite dev/build - Replace @tanstack/start with @tanstack/react-start - Move app/ to src/ directory structure - Add vite.config.ts with tanstackStart plugin - Update to React 19 - Add .gitignore for fixture artifacts All three fixtures now build successfully with latest TanStack versions. * refactor(evals): consolidate fixtures from 15 to 10 scenarios Remove redundant `fresh` fixture variants since they provided no meaningful test coverage difference from `existing` fixtures. Renamed: existing → example, existing-auth0 → example-auth0 This reduces the test matrix from 5×3=15 to 5×2=10 scenarios while maintaining coverage for both greenfield installs and Auth0 migrations. * fix(evals): include root.tsx in React Router authkitLoader check The grader pattern only checked routes/**/*.tsx but in React Router v7 Framework mode, authkitLoader belongs in app/root.tsx (the root layout). Updated pattern to match both root.tsx and routes/**/*. * fix(evals): align result matrix table columns
1 parent 147792f commit 23f8175

124 files changed

Lines changed: 4280 additions & 31 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.env.local.example

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# Local development configuration
22
# Copy this to .env.local and fill in values
33

4-
# WorkOS Client ID for local development
5-
WORKOS_CLIENT_ID=client_xxx
4+
# Required for running evals
5+
ANTHROPIC_API_KEY=sk-ant-...
6+
7+
# WorkOS credentials (optional for evals - placeholders used if missing)
8+
WORKOS_API_KEY=sk_test_...
9+
WORKOS_CLIENT_ID=client_...

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,6 @@ src/version.ts
2828
.idea
2929
*.sublime-*
3030
dist/
31+
32+
# Eval results
33+
tests/eval-results/

package.json

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@
6363
"@vitest/coverage-v8": "^4.0.18",
6464
"@vitest/ui": "^4.0.18",
6565
"dotenv": "^17.2.3",
66+
"p-limit": "^7.2.0",
6667
"prettier": "^3.8.0",
6768
"tsx": "^4.20.3",
6869
"typescript": "^5.9.3",
@@ -71,7 +72,7 @@
7172
"engines": {
7273
"node": ">=20.20"
7374
},
74-
"packageManager": "pnpm@10.23.0+sha512.21c4e5698002ade97e4efe8b8b4a89a8de3c85a37919f957e7a0f30f38fbc5bbdd05980ffe29179b2fb6e6e691242e098d945d1601772cad0fef5fb6411e2a4b",
75+
"packageManager": "pnpm@10.28.2",
7576
"scripts": {
7677
"clean": "rm -rf ./dist",
7778
"prebuild": "pnpm clean",
@@ -85,7 +86,12 @@
8586
"test": "vitest run",
8687
"test:watch": "vitest",
8788
"test:coverage": "vitest run --coverage",
88-
"typecheck": "tsc --noEmit"
89+
"typecheck": "tsc --noEmit",
90+
"eval": "tsx tests/evals/index.ts",
91+
"eval:history": "tsx tests/evals/index.ts history",
92+
"eval:compare": "tsx tests/evals/index.ts compare",
93+
"eval:logs": "tsx tests/evals/index.ts logs",
94+
"eval:show": "tsx tests/evals/index.ts show"
8995
},
9096
"author": "WorkOS",
9197
"license": "MIT"

pnpm-lock.yaml

Lines changed: 40 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)