feat(testing-framework): POC — Gherkin + JS dual front-end flow-IR with reusable prompt flows by ScriptedAlchemy · Pull Request #2639 · web-infra-dev/midscene

ScriptedAlchemy · 2026-06-10T00:56:36Z

What is this?

A POC that lets you write UI tests as plain-English Gherkin (Given / When / Then) with no step-definition code at all — the AI executes each step directly. It builds on the v2 testing-framework from #2589.

Classic Cucumber needs a JS function registered for every step. Here, the steps are the prompts:

Scenario: Checkout as admin
  When I run the "Login" flow with role "admin"
  And I remember the price of the "Trail Backpack" product as "price"
  When I add the "Trail Backpack" to the cart and open the cart
  Then the cart total equals {price}

When steps → the Midscene UI agent acts on the page (aiAct)
Then steps → a general agent judges a fail-closed verdict from the screenshot
I remember ... as "price" → extracts a value into a variable table; {price} is substituted into later prompts mechanically — the model never sees a placeholder
I run the "Login" flow → calls a reusable, parameterized prompt flow (own scope, declared args/returns, optional once-per-run memoization) — the answer to chaining prompts across complex multi-step journeys

Three interchangeable authoring styles, one engine

All three compile to the same internal flow-IR and run identically (the demo proves trace-for-trace parity). Decision rule: you probably only need style 1. Plain .feature files run end-to-end with nothing else — styles 2 and 3 are an alternative and an optional escape hatch, never requirements.

Style	For whom	What it looks like
1. Pure Gherkin	non-engineers, living docs — fully sufficient on its own	`.feature` files only, end to end
2. Pure JS/TS	engineering-owned suites wanting loops/types/computed args	`defineFlow()` / `scenario()` builders
3. Gherkin + overlay (optional escape hatch)	only when Gherkin is the contract AND a few scenarios need programmatic exceptions: bind-time computed values (dates/env-derived data), env-specific tweaks without forking the feature file (skip in CI, verify→soft, inserted steps)	`.feature` stays the source of truth; a sparse JS overlay patches anchored steps, drift caught at bind time so the prose↔JS seam can't silently rot

The example/ directory has one folder per style, each laid out as a realistic multi-file suite (shared flows reused by separate cart/checkout test modules) — see example/README.md.

Try it (no API key needed)

pnpm --filter @midscene/testing-framework demo            # offline, scripted agents, narrated walkthrough
pnpm --filter @midscene/testing-framework demo -- --live  # real browser + model via `codex login` (codex app-server)

The offline demo narrates every resolved prompt, variable capture, flow call, and verdict, then diffs the three styles against each other.

Status / validation

119+ unit tests (fake agents — no model/browser), package build, repo lint: all green
Live run verified end-to-end on gpt-5.5 via Midscene's codex://app-server provider
Design notes and open questions: packages/testing-framework/POC-GHERKIN.md

Feedback wanted

Is the flow-call composition model (explicit args/returns, fresh scope, depth cap) the right reuse semantics?
Which style(s) should graduate into the v2 framework proper?
Naming: step conventions like I remember ... as "var" and I run the "X" flow with ...

Shared flow IR (variable table, named flows with scoped args/returns, keyword-to-node policy) with three authoring surfaces: .feature files via @cucumber/gherkin, a fluent typed JS API, and bindFeature sparse overlays with drift validation. Includes offline demo and unit tests with fake agents.

Offline-by-default demo narrating the login/checkout journey through pure Gherkin, pure JS, and bound overlay modes with scripted fake agents, proving identical traces across front-ends and diffing overlay changes. Experimental --live mode runs against the static demo shop.

…wip) In-progress increment: codex-backed general agent for the POC demo's live mode; validation and real-run verification still pending.

Completes the codex live-mode increment: lazy-load @midscene/core/ai-model in CodexGeneralAgent (keeps the package index importable under vitest), fail fast when a capture extracts an empty value, add the missing back-to-shop step the real journey exposed, note live-mode verdict nondeterminism in the trace comparison, and document the codex setup (codex login, auto-configured MIDSCENE_MODEL_* env) in POC-GHERKIN.md. Verified live against codex gpt-5.5: all three modes pass.

Share engine step bookkeeping and getReportFile between runCase and the IR executor, merge prompt/capture step scaffolding, dedupe var-record stringification and identifier regexes, drop dead API (PromptStepIR.role, unused executor options, FlowRegistry.names), clean up codex screenshot temp files per call, fix nested-JSON verdict parsing with a regression test, and memoize the demo's codex CLI probe.

- implement memo: 'once-per-run' flow memoization with a shareable memoStore on RunScenarioOptions; only fully successful completions are cached, hits replay returns with a narrated info step - make verdict-channel instructions adapter-supplied (verdictInstructions on GeneralAgentAdapter) so Pi keeps report_verdict wording while codex prompts demand its JSON reply channel; adapter-neutral fail-closed reason - bindFeature now throws on duplicate anchors targeting the same step instead of silently merging overlays - introduce structural UiAgentLike and use it across the engine/executor, removing the `as unknown as Agent` casts from fakes and demo agents - write codex screenshot temp files under midscene_run/tmp (getMidsceneRunSubDir) instead of mkdtemp, keeping per-call deletion - make feature()/FeatureIR symmetric with the Gherkin CompiledFeature ({ name, scenarios, flows }); CompiledFeature is now an alias - update POC-GHERKIN.md to match

cloudflare-workers-and-pages · 2026-06-10T01:02:28Z

Deploying midscene with Cloudflare Pages

Latest commit:	`4b1a261`
Status:	✅ Deploy successful!
Preview URL:	https://e535d003.midscene.pages.dev
Branch Preview URL:	https://poc-gherkin-ai-flows.midscene.pages.dev

View logs

…with three style folders - example/ now shows one realistic suite authored three interchangeable ways: style-1-gherkin (shared flows/*.feature + independent feature modules), style-2-js (shared defineFlow module + per-module *.flows.ts), style-3-overlay (sparse bindFeature patch over style-1's checkout.feature) - add compileSuite() to the gherkin front-end: glob a suite directory (or file list), merge all @flow definitions into one registry, fail loudly on duplicate flow names across files - add a second shared flow ("Add product to cart") and cart-inspection scenarios; extend demo-app with a second product, quantity controls and a header cart badge; scripted agents cover the new steps - demo runs the suite module-by-module per style, narrating each source file, and keeps the Gherkin-vs-JS trace parity proof and the overlay diff - rich first-reader comments per style (what flows, captures and overlays are); example/README.md is the orientation point; POC-GHERKIN.md updated

…escape hatch Pure .feature files are fully sufficient on their own; style 2 is for engineering-owned dynamic suites, and style 3 (bindFeature overlay) only earns its keep for bind-time computed values, per-environment tweaks without forking the feature file, and a drift-validated seam between prose and JS. Adds a blunt "Which style do I need?" decision section to example/README.md ("you probably only need style 1"), reframes the read-this-first comments in styles 1 and 3, and aligns POC-GHERKIN.md's mode-selection table. Docs/comments only — no behavior changes.

ScriptedAlchemy · 2026-06-10T07:11:24Z

Closing in favor of a fresh implementation: the POC validated the concept (AI-executed Gherkin, reusable prompt flows, three routing modes), and the follow-up design session concluded we should build it as a new standalone package — @midscene/bdd — that embeds real cucumber-js instead of extending packages/testing-framework. New PR to follow from feat/midscene-bdd. Branch kept for reference.

ScriptedAlchemy added 7 commits June 9, 2026 23:10

feat(testing-framework): wire codex app-server agent into live demo (…

ab2b22f

…wip) In-progress increment: codex-backed general agent for the POC demo's live mode; validation and real-run verification still pending.

refactor(testing-framework): remove AI slop from flow-IR POC

9d6496c

ScriptedAlchemy added 2 commits June 10, 2026 03:54

ScriptedAlchemy closed this Jun 10, 2026

ScriptedAlchemy mentioned this pull request Jun 10, 2026

feat(bdd): @midscene/bdd — AI-native BDD runner on cucumber-js #2646

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(testing-framework): POC — Gherkin + JS dual front-end flow-IR with reusable prompt flows#2639

feat(testing-framework): POC — Gherkin + JS dual front-end flow-IR with reusable prompt flows#2639
ScriptedAlchemy wants to merge 9 commits into
claude/upbeat-noether-u2JRqfrom
poc/gherkin-ai-flows

ScriptedAlchemy commented Jun 10, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

ScriptedAlchemy commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ScriptedAlchemy commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is this?

Three interchangeable authoring styles, one engine

Try it (no API key needed)

Status / validation

Feedback wanted

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying midscene with Cloudflare Pages

Uh oh!

ScriptedAlchemy commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ScriptedAlchemy commented Jun 10, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jun 10, 2026 •

edited

Loading