Commit 23f8175
authored
test: add eval framework for installer agent testing (#36)
* feat: add eval framework for testing installer agent
Introduces a structured evaluation system to validate the WorkOS installer
agent against framework fixtures. Phase 1 includes:
- Core types and interfaces for grading results
- File and build graders with pattern matching
- Next.js-specific grader checking AuthKit integration
- Fixture manager for temp dir setup/cleanup
- Eval runner orchestrating fixture → agent → grade flow
- CLI entry point with --framework and --verbose flags
- Minimal Next.js 14 App Router fixture
The agent executor is stubbed to validate framework structure first.
Run with: pnpm eval
* feat: expand eval framework to full 15-scenario test matrix
Add CLI with filtering (--framework, --state, --json), matrix reporter,
and graders for all 5 frameworks. Create fixtures for fresh, existing,
and existing-auth0 states across Next.js, React SPA, React Router,
TanStack Start, and Vanilla JS.
* feat: add retry logic, debug tooling, and history tracking to evals
- Add history.ts for results persistence with compare functionality
- Extend CLI with --debug, --keep-on-fail, --retry, --no-retry flags
- Add history and compare subcommands (pnpm eval:history, eval:compare)
- Implement retry loop in runner for handling LLM non-determinism
- Add verbose failure output with expected/actual values
- Create README documentation for eval framework usage
* feat(evals): wire AgentExecutor to use Claude Agent SDK
Replace stub implementation with real agent execution:
- Add env-loader for credentials from .env.local
- Configure SDK with direct auth mode (bypasses gateway)
- Capture tool calls and output from message stream
- Add ToolCall interface to types
* style: format eval files with prettier
* fix(evals): update nextjs grader to match actual SDK patterns
- Use glob + content matching for callback route (path is configurable)
- Remove process.env.WORKOS_ check (SDK abstracts env access)
- Add checkFileWithPattern helper for flexible file discovery
* fix(evals): let agent configure redirect URI per framework
* feat(evals): add --keep flag to preserve temp directory for manual testing
* fix(evals): update tanstack grader and skill to match SDK patterns
- Grader: support src/ directory (v1.132+) in addition to app/
- Grader: check for authkitMiddleware instead of createServerFn
- Grader: fix package name to @workos/authkit-tanstack-react-start
- Grader: remove AuthKitProvider requirement (optional for server-only)
- Grader: support both flat and nested route patterns for callback
- Skill: add directory detection guidance (src/ vs app/)
- Skill: fix handleAuth() → handleCallbackRoute()
- Skill: add SDK exports reference section
* fix(evals): update React grader to match SDK patterns
- Remove callback component check (SDK handles OAuth internally)
- Use glob pattern to find useAuth anywhere in src/**/*.tsx
- Support both Vite (main.tsx) and CRA (index.tsx) entry points
- Add comprehensive header documenting SDK patterns
* fix(evals): correct React Router grader package name and patterns
- Fix package name: @workos-inc/authkit-react-router (was @workos-inc/authkit)
- Use glob patterns instead of hardcoded file paths
- Check for authLoader in callback routes (flexible location)
- Check for authkitLoader in route files for auth state
- Remove unnecessary ProtectedRoute.tsx/auth.ts checks (SDK has ensureSignedIn)
- Support both app/ and src/ directory structures
* fix(evals): correct Vanilla JS grader for createClient pattern
- Remove callback.html/callback.js checks (SDK handles OAuth internally)
- Remove auth.js with getAuthorizationUrl (old pattern)
- Check for createClient from @workos-inc/authkit-js or CDN WorkOS.createClient
- Check for auth methods (signIn, signOut, getUser, getAccessToken)
- Support both bundled (ESM import) and CDN (script tag) patterns
* feat(evals): add parallel execution and live Ink dashboard
Phase 1: Parallel execution infrastructure
- Add ParallelRunner with p-limit concurrency control
- Auto-detect concurrency based on CPU/memory
- Graceful shutdown with fixture cleanup on SIGINT/SIGTERM
Phase 2: Event-driven live dashboard
- Add EvalEventEmitter for scenario lifecycle events
- Create Ink/React dashboard with real-time status updates
- TTY detection: dashboard for interactive, logging for CI/pipes
- Add --no-dashboard flag to disable live UI
* chore(deps): add p-limit for parallel execution
* feat(evals): add structured JSON log export and CLI commands
Add debugging export layer that writes detailed JSON logs during eval runs:
- LogWriter subscribes to eval events and writes incrementally
- Includes all retry attempts, tool calls, and agent output (truncated at 10KB)
- New CLI commands: `pnpm eval:logs` to list, `pnpm eval:show` to view
- Survives interrupts by writing after each scenario completes
* fix(evals): add scenario labels to verbose output
Prefix verbose console output with [framework/state] labels so parallel
execution logs are easier to follow.
* fix(evals): use latest versions for tanstack-start fixtures
Switch from pinned semver ranges to `latest` tag for tanstack packages.
This ensures evals test against what users actually install, surfacing
upstream breakage as signal rather than hiding it behind old versions.
Note: tanstack-start is currently broken upstream (incompatible internal
deps). Tests will auto-heal when they publish a fix.
* fix(evals): rebuild tanstack-start fixtures for new Vite-based architecture
TanStack Start moved from vinxi to pure Vite. Old fixtures used the
deprecated vinxi-based structure which no longer works with latest.
Changes:
- Replace vinxi scripts with vite dev/build
- Replace @tanstack/start with @tanstack/react-start
- Move app/ to src/ directory structure
- Add vite.config.ts with tanstackStart plugin
- Update to React 19
- Add .gitignore for fixture artifacts
All three fixtures now build successfully with latest TanStack versions.
* refactor(evals): consolidate fixtures from 15 to 10 scenarios
Remove redundant `fresh` fixture variants since they provided no
meaningful test coverage difference from `existing` fixtures.
Renamed: existing → example, existing-auth0 → example-auth0
This reduces the test matrix from 5×3=15 to 5×2=10 scenarios while
maintaining coverage for both greenfield installs and Auth0 migrations.
* fix(evals): include root.tsx in React Router authkitLoader check
The grader pattern only checked routes/**/*.tsx but in React Router v7
Framework mode, authkitLoader belongs in app/root.tsx (the root layout).
Updated pattern to match both root.tsx and routes/**/*.
* fix(evals): align result matrix table columns1 parent 147792f commit 23f8175
124 files changed
Lines changed: 4280 additions & 31 deletions
File tree
- skills/workos-authkit-tanstack-start
- src/utils
- tests
- evals
- dashboard
- graders
- fixtures
- nextjs
- example-auth0
- app
- about
- api/auth/[auth0]
- dashboard
- example
- app
- about
- api/hello
- dashboard
- react-router
- example-auth0
- app
- routes
- example
- app
- routes
- react
- example-auth0
- src
- pages
- example
- src
- pages
- tanstack-start
- example-auth0
- src
- routes
- example
- src
- routes
- vanilla-js
- example-auth0
- example
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | 3 | | |
4 | | - | |
5 | | - | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| 66 | + | |
66 | 67 | | |
67 | 68 | | |
68 | 69 | | |
| |||
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
74 | | - | |
| 75 | + | |
75 | 76 | | |
76 | 77 | | |
77 | 78 | | |
| |||
85 | 86 | | |
86 | 87 | | |
87 | 88 | | |
88 | | - | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
89 | 95 | | |
90 | 96 | | |
91 | 97 | | |
| |||
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
0 commit comments