Skip to content

feat(testing-framework): POC — Gherkin + JS dual front-end flow-IR with reusable prompt flows#2639

Closed
ScriptedAlchemy wants to merge 9 commits into
claude/upbeat-noether-u2JRqfrom
poc/gherkin-ai-flows
Closed

feat(testing-framework): POC — Gherkin + JS dual front-end flow-IR with reusable prompt flows#2639
ScriptedAlchemy wants to merge 9 commits into
claude/upbeat-noether-u2JRqfrom
poc/gherkin-ai-flows

Conversation

@ScriptedAlchemy

@ScriptedAlchemy ScriptedAlchemy commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

What is this?

A POC that lets you write UI tests as plain-English Gherkin (Given / When / Then) with no step-definition code at all — the AI executes each step directly. It builds on the v2 testing-framework from #2589.

Classic Cucumber needs a JS function registered for every step. Here, the steps are the prompts:

Scenario: Checkout as admin
  When I run the "Login" flow with role "admin"
  And I remember the price of the "Trail Backpack" product as "price"
  When I add the "Trail Backpack" to the cart and open the cart
  Then the cart total equals {price}
  • When steps → the Midscene UI agent acts on the page (aiAct)
  • Then steps → a general agent judges a fail-closed verdict from the screenshot
  • I remember ... as "price" → extracts a value into a variable table; {price} is substituted into later prompts mechanically — the model never sees a placeholder
  • I run the "Login" flow → calls a reusable, parameterized prompt flow (own scope, declared args/returns, optional once-per-run memoization) — the answer to chaining prompts across complex multi-step journeys

Three interchangeable authoring styles, one engine

All three compile to the same internal flow-IR and run identically (the demo proves trace-for-trace parity). Decision rule: you probably only need style 1. Plain .feature files run end-to-end with nothing else — styles 2 and 3 are an alternative and an optional escape hatch, never requirements.

Style For whom What it looks like
1. Pure Gherkin non-engineers, living docs — fully sufficient on its own .feature files only, end to end
2. Pure JS/TS engineering-owned suites wanting loops/types/computed args defineFlow() / scenario() builders
3. Gherkin + overlay (optional escape hatch) only when Gherkin is the contract AND a few scenarios need programmatic exceptions: bind-time computed values (dates/env-derived data), env-specific tweaks without forking the feature file (skip in CI, verify→soft, inserted steps) .feature stays the source of truth; a sparse JS overlay patches anchored steps, drift caught at bind time so the prose↔JS seam can't silently rot

The example/ directory has one folder per style, each laid out as a realistic multi-file suite (shared flows reused by separate cart/checkout test modules) — see example/README.md.

Try it (no API key needed)

pnpm --filter @midscene/testing-framework demo            # offline, scripted agents, narrated walkthrough
pnpm --filter @midscene/testing-framework demo -- --live  # real browser + model via `codex login` (codex app-server)

The offline demo narrates every resolved prompt, variable capture, flow call, and verdict, then diffs the three styles against each other.

Status / validation

  • 119+ unit tests (fake agents — no model/browser), package build, repo lint: all green
  • Live run verified end-to-end on gpt-5.5 via Midscene's codex://app-server provider
  • Design notes and open questions: packages/testing-framework/POC-GHERKIN.md

Feedback wanted

  • Is the flow-call composition model (explicit args/returns, fresh scope, depth cap) the right reuse semantics?
  • Which style(s) should graduate into the v2 framework proper?
  • Naming: step conventions like I remember ... as "var" and I run the "X" flow with ...

ScriptedAlchemy added 7 commits June 9, 2026 23:10
Shared flow IR (variable table, named flows with scoped args/returns,
keyword-to-node policy) with three authoring surfaces: .feature files
via @cucumber/gherkin, a fluent typed JS API, and bindFeature sparse
overlays with drift validation. Includes offline demo and unit tests
with fake agents.
Offline-by-default demo narrating the login/checkout journey through
pure Gherkin, pure JS, and bound overlay modes with scripted fake
agents, proving identical traces across front-ends and diffing overlay
changes. Experimental --live mode runs against the static demo shop.
…wip)

In-progress increment: codex-backed general agent for the POC demo's
live mode; validation and real-run verification still pending.
Completes the codex live-mode increment: lazy-load @midscene/core/ai-model
in CodexGeneralAgent (keeps the package index importable under vitest),
fail fast when a capture extracts an empty value, add the missing
back-to-shop step the real journey exposed, note live-mode verdict
nondeterminism in the trace comparison, and document the codex setup
(codex login, auto-configured MIDSCENE_MODEL_* env) in POC-GHERKIN.md.
Verified live against codex gpt-5.5: all three modes pass.
Share engine step bookkeeping and getReportFile between runCase and the
IR executor, merge prompt/capture step scaffolding, dedupe var-record
stringification and identifier regexes, drop dead API (PromptStepIR.role,
unused executor options, FlowRegistry.names), clean up codex screenshot
temp files per call, fix nested-JSON verdict parsing with a regression
test, and memoize the demo's codex CLI probe.
- implement memo: 'once-per-run' flow memoization with a shareable
  memoStore on RunScenarioOptions; only fully successful completions are
  cached, hits replay returns with a narrated info step
- make verdict-channel instructions adapter-supplied (verdictInstructions
  on GeneralAgentAdapter) so Pi keeps report_verdict wording while codex
  prompts demand its JSON reply channel; adapter-neutral fail-closed reason
- bindFeature now throws on duplicate anchors targeting the same step
  instead of silently merging overlays
- introduce structural UiAgentLike and use it across the engine/executor,
  removing the `as unknown as Agent` casts from fakes and demo agents
- write codex screenshot temp files under midscene_run/tmp
  (getMidsceneRunSubDir) instead of mkdtemp, keeping per-call deletion
- make feature()/FeatureIR symmetric with the Gherkin CompiledFeature
  ({ name, scenarios, flows }); CompiledFeature is now an alias
- update POC-GHERKIN.md to match
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 10, 2026

Copy link
Copy Markdown

Deploying midscene with  Cloudflare Pages  Cloudflare Pages

Latest commit: 4b1a261
Status: ✅  Deploy successful!
Preview URL: https://e535d003.midscene.pages.dev
Branch Preview URL: https://poc-gherkin-ai-flows.midscene.pages.dev

View logs

ScriptedAlchemy added 2 commits June 10, 2026 03:54
…with three style folders

- example/ now shows one realistic suite authored three interchangeable
  ways: style-1-gherkin (shared flows/*.feature + independent feature
  modules), style-2-js (shared defineFlow module + per-module *.flows.ts),
  style-3-overlay (sparse bindFeature patch over style-1's checkout.feature)
- add compileSuite() to the gherkin front-end: glob a suite directory (or
  file list), merge all @flow definitions into one registry, fail loudly on
  duplicate flow names across files
- add a second shared flow ("Add product to cart") and cart-inspection
  scenarios; extend demo-app with a second product, quantity controls and a
  header cart badge; scripted agents cover the new steps
- demo runs the suite module-by-module per style, narrating each source
  file, and keeps the Gherkin-vs-JS trace parity proof and the overlay diff
- rich first-reader comments per style (what flows, captures and overlays
  are); example/README.md is the orientation point; POC-GHERKIN.md updated
…escape hatch

Pure .feature files are fully sufficient on their own; style 2 is for
engineering-owned dynamic suites, and style 3 (bindFeature overlay) only
earns its keep for bind-time computed values, per-environment tweaks
without forking the feature file, and a drift-validated seam between prose
and JS. Adds a blunt "Which style do I need?" decision section to
example/README.md ("you probably only need style 1"), reframes the
read-this-first comments in styles 1 and 3, and aligns POC-GHERKIN.md's
mode-selection table. Docs/comments only — no behavior changes.
@ScriptedAlchemy

Copy link
Copy Markdown
Collaborator Author

Closing in favor of a fresh implementation: the POC validated the concept (AI-executed Gherkin, reusable prompt flows, three routing modes), and the follow-up design session concluded we should build it as a new standalone package — @midscene/bdd — that embeds real cucumber-js instead of extending packages/testing-framework. New PR to follow from feat/midscene-bdd. Branch kept for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant