feat(wizard-ci): real-TUI e2e + snapshot review#2012
Draft
gewenyu99 wants to merge 18 commits into
Draft
Conversation
Adds two modes to the existing wizard-ci, as an alternative to classic --ci (LoggingUI: agent-only, stdout-grep). --e2e drives the WHOLE interactive flow headlessly through the wizard-ci-tools control plane and asserts on structured state; --replay plays a recorded run back in the terminal. Core files: - services/wizard-ci/e2e.ts — runE2e(): /tmp app-copy isolation, env hygiene (strips host CLAUDE*/ANTHROPIC* so the spawned agent auths with the phx key instead of deferring to the host), scoped --project-id, the happy-path policy (skip mcp+slack, delete skills, continue past health issues), spawns the wizard repo's headless harness, then asserts the structured result (runPhase=completed, posthog dep/.env, reached keep-skills, skillsComplete). replayRecording(): shells to the wizard repo's terminal replayer. - services/wizard-ci/index.ts — wires --e2e (positional app, --project-id, --keep-skills) and --replay (--step/--delay) into the CLI + --help. Engine lives in the wizard repo (store + driver must run in-process); point WIZARD_PATH at it. See PostHog/wizard PR for src/lib/ci-driver + harness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nitions Run each CI-e2e test definition (for now: integration on express-todo) as a real --e2e agent run, render every key-moment frame of the recording to a real-Ink ANSI snapshot, and diff against a committed baseline. Surfaces run-to-run differences (e.g. the agent enqueuing tasks differently) side-by-side for a human to review — same screens every run, deltas flagged. No mocks: real agent, real recording, real render. - services/wizard-ci/snapshots.ts — the flow (run → render → diff → report) - services/wizard-ci/ansi-html.ts — dependency-free ANSI→HTML for the side-by-side - services/wizard-ci/snapshots/express-todo/ — committed baseline (47 frames) - pnpm wizard-ci-snapshots (+ mprocs entry); --update to accept a new baseline Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The snapshots.ts header now lists what the flow needs in .env (POSTHOG_PERSONAL_API_KEY, POSTHOG_WIZARD_PROJECT_ID, POSTHOG_REGION) and that WIZARD_PATH must point at a checkout containing e2e-harness/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A real agent emits frames a little differently run to run (different number of status updates → shifted indices), so drift is expected. Print the per-frame diffs + report.html and exit 0; only a genuine failure (run died, no recording) exits non-zero. Accept a new baseline with --update. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
After the diff, prompt "Replay <name> snapshots in the terminal? [y/N]" and, on yes, launch the replay stepper directly on the run's recording — no copy/paste. TTY-only (auto-declines in CI so nothing hangs); the replayer inherits stdio for its own Enter-to-step loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Document handing the Wizard to an agent to run/drive/explore it headlessly, pointing at the runbook (wizard repo e2e-harness/EXPLORING-AS-AN-AGENT.md) with a copy-paste example prompt that targets wasp-lang/open-saas — the agent works out how to build + run the target itself. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… comments The agentic-exploration section belongs in the wizard repo's README, not here. Also trim snapshots.ts / index.ts comments to concise current-behavior. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…wright) services/wizard-ci/screenshot.ts rasterizes the side-by-side report — one PNG per key-moment frame (baseline │ current) plus a full-flow strip — for attaching to a review PR. Reuses the report HTML (ansi-html), so no new ANSI logic. Adds playwright as a dev dep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…shots snapshot-review.ts runs the e2e, renders the report to side-by-side PNGs, and opens a review PR whose body embeds them (raw URLs), changed frames first — instead of running the agent evaluator. --dry-run writes the bundle locally. wizard-snapshots.yml dispatches it (bot token, setup-wizard-deps, Playwright). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
`wizard-ci --e2e` and `wizard-ci-snapshots` run the wizard repo's tui-snapshots: the real wizard TUI, driven by store state manipulation, captured per screen as text. --e2e asserts on the result JSON it emits; snapshots diff the captured screens against a committed baseline; snapshot-review rasterizes them to a side-by-side image PR. Drops the recording/replay plumbing (the --replay flag, the render step, ansi-html) — the captured screens are already clean text. WIZARD_PATH defaults to a sibling wizard checkout. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…as a comment Comment `/wizard-ci [app] [wizard_ref]` on a PR to run the real-TUI e2e. The workflow acks with 👀, checks out the PR, runs snapshot-review, and posts a comment on the PR with the flow strip and a link to the full side-by-side review (--comment-pr). Restricted to repo members/owner/collaborators. Manual workflow_dispatch still works; no auto-run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment on lines
+116
to
+119
| - name: Install dependencies | ||
| run: pnpm install --frozen-lockfile | ||
|
|
||
| - name: Install Chromium for Playwright |
Comment on lines
+119
to
+122
| - name: Install Chromium for Playwright | ||
| run: pnpm exec playwright install --with-deps chromium | ||
|
|
||
| - name: Setup wizard dependencies |
Comment on lines
+122
to
+132
| - name: Setup wizard dependencies | ||
| # Exports WIZARD_PATH / CONTEXT_MILL_PATH / MCP_PATH. | ||
| uses: ./.github/actions/setup-wizard-deps | ||
| with: | ||
| wizard_ref: ${{ steps.req.outputs.wizard_ref }} | ||
| context_mill_ref: ${{ inputs.context_mill_ref || 'main' }} | ||
| posthog_ref: ${{ inputs.posthog_ref || 'master' }} | ||
| app_token: ${{ steps.app-token.outputs.token }} | ||
| save_cache: 'false' | ||
|
|
||
| - name: Render snapshots + report (review PR, and a comment when triggered by /wizard-ci) |
Comment on lines
+132
to
+145
| - name: Render snapshots + report (review PR, and a comment when triggered by /wizard-ci) | ||
| env: | ||
| GH_TOKEN: ${{ steps.app-token.outputs.token }} | ||
| POSTHOG_PERSONAL_API_KEY: ${{ secrets.GH_APP_POSTHOG_WIZARD_CI_BOT_POSTHOG_PERSONAL_KEY }} | ||
| POSTHOG_WIZARD_PROJECT_ID: ${{ inputs.project_id || github.event.client_payload.project_id || vars.WIZARD_SNAPSHOTS_PROJECT_ID }} | ||
| POSTHOG_REGION: ${{ inputs.posthog_region || 'us' }} | ||
| APP: ${{ steps.req.outputs.app }} | ||
| COMMENT_PR: ${{ steps.req.outputs.comment_pr }} | ||
| run: | | ||
| if [ -n "$COMMENT_PR" ]; then | ||
| pnpm wizard-ci-snapshot-review "$APP" --comment-pr "$COMMENT_PR" | ||
| else | ||
| pnpm wizard-ci-snapshot-review "$APP" | ||
| fi |
…rue) A standalone snapshots job, independent of the evaluator (the eval still runs as normal). Dispatch "Wizard CI" with snapshots=true to also open a real-TUI review PR for the app — same app token + setup-wizard-deps + PostHog key as the evaluator, project hard-coded to 2 (the bot key's project). Because wizard-ci.yml is on main, this is dispatchable from a PR branch (pre-merge). The /wizard-ci comment trigger stays in wizard-snapshots.yml. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…snapshot) The wizard-ci job's Execute step runs the headless eval by default, or — with the snapshots=true input — the real-TUI snapshot review for the app. One switch, the same job; no separate parallel job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sterize) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…R step)
git('rev-parse HEAD'), not the array form — the helper runs git ${cmd}. Pass cwd
to getRepoRoot too. tsc clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…able text) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
How to test
Runs the full real agent flow against express-todo through the real wizard TUI, captures each key moment, diffs the committed baseline, and writes
report.html(WIZARD_PATHauto-resolves to a siblingwizard-e2e; creds from.env).Side-by-side image review PR:
Or comment
/wizard-ci [app] [wizard_ref]on a PR — the workflow runs the e2e and posts the report back as a comment (members only). Set the repo variableWIZARD_SNAPSHOTS_PROJECT_IDfor the default project id.What this is
wizard-ci --e2eandwizard-ci-snapshotsdrive the real wizard TUI (via the wizard repo'stui-snapshots): the realstartTUI, driven by state manipulation, captured per key moment as text.--e2easserts on the result JSON the run emits (run completed, posthog dep /.env, reachedkeep-skills).wizard-ci-snapshotsdiffs the captured real-TUI screens vs a committed baseline; drift is surfaced, never fails the run.snapshot-reviewrasterizes the screens to a side-by-side image PR — and, when triggered by/wizard-ci, posts the report back as a comment on the PR.Drops the recording/replay plumbing (the
--replayflag, the render step,ansi-html) — the captured screens are already clean text. Pairs with PostHog/wizard#702.