Lykhoyda
diff --git a/‎.changeset/maestro-structured-step-results.md‎
Lines changed: 8 additions & 0 deletions b/‎.changeset/maestro-structured-step-results.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎docs/superpowers/plans/2026-06-14-211-maestro-structured-results.md‎
Lines changed: 627 additions & 0 deletions b/‎docs/superpowers/plans/2026-06-14-211-maestro-structured-results.md‎
Lines changed: 627 additions & 0 deletions
diff --git a/‎docs/superpowers/specs/2026-06-14-211-maestro-structured-results-design.md‎
Lines changed: 163 additions & 0 deletions b/‎docs/superpowers/specs/2026-06-14-211-maestro-structured-results-design.md‎
Lines changed: 163 additions & 0 deletions
diff --git a/‎scripts/cdp-bridge/dist/domain/maestro-step-parser.js‎
Lines changed: 129 additions & 0 deletions b/‎scripts/cdp-bridge/dist/domain/maestro-step-parser.js‎
Lines changed: 129 additions & 0 deletions
@@ -0,0 +1,8 @@
+---
+"rn-dev-agent-cdp": patch
+"rn-dev-agent-plugin": patch
+---
+
+`maestro_run` now returns structured per-step results and partial progress on timeout (GH #211).
+
+The result gains `steps[]` (`{index,name,verb,status,durationMs}`), `failedStep`, `reason` (sanitized `{kind,selector}` — never the raw runner log), `lastStep` (progress marker), `timedOut`, and `outputTruncated`. On timeout the partial steps are returned instead of a bare failure, and the failure headline names the failing/last step. Parsed from maestro-runner stdout (the JVM Maestro CLI fallback degrades fail-open to empty steps); `tapOn` latencies for #263 now derive from the shared parser. Additive — `output` is preserved for `run-action` consumers.
@@ -0,0 +1,163 @@
+# Design — #211: `maestro_run` structured step results + partial-progress-on-timeout
+
+**Date:** 2026-06-14
+**Issue:** [#211](https://github.com/Lykhoyda/rn-dev-agent/issues/211)
+**Status:** Approved design → ready for plan
+**Branch:** `feat/211-maestro-structured-results` (off `main`; #263 already merged)
+
+## Problem
+
+`maestro_run` works, but verification is harder than it should be:
+
+1. **Output truncated mid-flow** — success is only confirmable via top-level `passed: true` + a separate `cdp_navigation_state`. Per-step pass/fail/durations and terminal assertions aren't visible. The reporter re-ran `grep 'message=' reports/<ts>/junit-report.xml` after *nearly every* failed run (~30/session) to find the failing step.
+2. **Timeout returns a bare failure** — a flow that exceeds the cap yields no verdict and no "how far did it get".
+
+Issue item 3 (iOS `clearState` needing `--app-file`) **already shipped** in #276/#201 (`resolveAppFileForClearState`) — out of scope here.
+
+## Goal
+
+Add **structured, additive** fields to the `maestro_run` result so the failing step, reason, and per-step durations are visible without grepping report files, and so a timeout returns partial progress. The per-step durations also become the clean data source #263's degraded-tap-latency heuristic already wants.
+
+## Scope
+
+- **IN:** structured step results; partial progress on timeout.
+- **OUT:** iOS `clearState` (shipped); JUnit/report-file parsing (chose stdout-only); `screenshots[]` (YAGNI — the reporter's own suggestion had it empty); bumping the default timeout (the partial-progress *return* is the fix, not a bigger cap).
+- **Source:** maestro-runner **stdout** only. The JVM Maestro CLI fallback (iOS-no-adb) emits a different format and degrades **fail-open** to `steps: []` + raw `output`.
+
+## Architecture
+
+Three files; one new pure module. The win is that the per-step data is **already in the stdout** maestro-runner prints — the exact lines #263 parses — so #211's parser is a *generalization* of #263's, and `parseTapLatencies` collapses to a filter over it.
+
+### 1. NEW `src/domain/maestro-step-parser.ts` (pure, no I/O, fail-open)
+
+```ts
+export interface ReasonSummary {
+  kind: 'SELECTOR_NOT_FOUND' | 'TIMEOUT' | 'ASSERTION_FAILED';
+  selector: string | null;
+}
+
+export interface MaestroStep {
+  index: number;                  // 0-based observed order — disambiguates loops / runFlow repeats
+  name: string;                   // full step text minus the trailing (N.Ns), e.g. `tapOn: id="submit"`
+  verb: string;                   // first token after the glyph, trailing ':' stripped, e.g. `tapOn`
+  status: 'pass' | 'fail';
+  durationMs: number;
+}
+
+export function stripAnsi(s: string): string;                 // remove SGR codes before matching
+export function parseSteps(output: string): MaestroStep[];    // completed steps only (those with a (N.Ns))
+export function findFailedStep(steps: MaestroStep[]): MaestroStep | null;     // last status==='fail'
+export function lastObservedStep(steps: MaestroStep[]): MaestroStep | null;   // steps.at(-1)
+export function summarizeReason(output: string): ReasonSummary | null;        // sanitized — NO raw
+
+export interface StepSummary {
+  steps: MaestroStep[];
+  failedStep: MaestroStep | null;   // terminal failure only; null unless opts.failed
+  reason: ReasonSummary | null;     // null unless opts.failed
+  lastStep: MaestroStep | null;     // last observed (completed) step — the progress marker
+}
+export function buildStepSummary(output: string, opts: { failed: boolean }): StepSummary;
+```
+
+**Line grammar (verified against the #263 fixtures):** each step prints as
+`  {✓|✗} {verb}[: {selector}] (N.Ns)`. Parser rules:
+
+- `stripAnsi()` first (belt-and-suspenders; `execFile` is not a TTY so color is *usually* off, but unverified against the real binary — see Risks).
+- Anchor on a leading status glyph `✓`/`✗` after trimming.
+- **Require a trailing `(N.Ns)`** — this excludes the summary line `✗ rn-maestro-run 23.8s` (no parens) and the count lines `3 steps passing` (no glyph). Belt-and-suspenders: also skip a line whose verb is `rn-maestro-run`.
+- `verb` = first whitespace-delimited token after the glyph, **with a trailing `:` stripped** (`tapOn:` → `tapOn`). This is load-bearing for the #263 refactor (filter `verb === 'tapOn'`).
+- `name` = the line minus the glyph and the trailing `(N.Ns)`.
+- `durationMs` = `round(seconds * 1000)`.
+- `verb` is the FIRST token, so a verb-name *inside a selector value* (`✓ assertVisible: text="tapOn …"`) is recorded as `assertVisible` — preserves #263 review-finding #2.
+- Garbage / empty / CLI-fallback format → `[]`. Never throws.
+
+**`failedStep` is terminal-only.** `findFailedStep` returns the last `✗` step, but `buildStepSummary` only populates `failedStep`/`reason` when `opts.failed` is true. maestro-runner logs transient retries; a step that fails-then-retries-✓ on a run that ultimately **passed** must NOT report a `failedStep` (mirrors `parseMaestroFailure`'s END→START terminal-preference, GH#118). The handler passes `failed = !passed`, so on the success path `failedStep` is always null even if a transient `✗` appears in `steps`.
+
+**`reason` is sanitized — never carries `raw`.** `summarizeReason` calls `parseMaestroFailure` but **projects to `{ kind, selector }`**, explicitly dropping the `raw: string` field that every `MaestroFailure` variant carries. Returning the parser's object directly would re-embed the full unsliced runner log into the result, defeating the 2000/4000-char `output` slice. (UNKNOWN → `null`.)
+
+### 2. REFACTOR `src/domain/tap-latency.ts`
+
+```ts
+import { parseSteps } from './maestro-step-parser.js';
+export function parseTapLatencies(output: string): number[] {
+  return parseSteps(output)
+    .filter((s) => s.verb === 'tapOn' && s.status === 'pass')
+    .map((s) => s.durationMs);
+}
+```
+
+`gh-263-tap-latency.test.js` is the regression guard — the `DEGRADED` fixture must still yield `[2800, 3000]` and the single-failed-tap fixture `[]`. `classifyRuntimeDegradation`, `median`, `resolveFloorMs`, `augmentFailureWithDegradation` are unchanged. (The ≥2-sample gate `MIN_SAMPLES_FOR_DEGRADED` is unaffected — it operates on the filtered latency array.)
+
+### 3. WIRE `src/tools/maestro-run.ts`
+
+The current `meta` object (the payload passed to `okResult`/`warnResult`/`failResult`) is extended with the **same field set on all three paths**. Because `okResult(x)`/`warnResult(x,…)` place `x` in `envelope.data` while `failResult(msg,x)` places `x` in `envelope.meta`, the structured fields appear under `data.*` on pass/warn and `meta.*` on fail — and `output` is preserved on every path (`run-action.ts:144` reads `data.output` then `meta.output`).
+
+Added fields (stable set, present on every path):
+
+```ts
+steps: MaestroStep[]
+failedStep: MaestroStep | null
+reason: ReasonSummary | null
+lastStep: MaestroStep | null
+timedOut: boolean
+outputTruncated: boolean
+```
+
+- **success** (exit 0, `passed`): `buildStepSummary(output, { failed: false })` → `steps` + `lastStep`; `failedStep:null, reason:null, timedOut:false, outputTruncated:false`.
+- **warn** (exit 0 but `outputIndicatesFlowFailure`): `buildStepSummary(output, { failed: true })`; `timedOut:false, outputTruncated:false`; existing `augmentFailureWithDegradation` (#263) unchanged.
+- **catch** (non-zero / timeout / overflow): parse the partial `combined` (stdout+stderr Node attaches to the thrown error); `buildStepSummary(combined, { failed: true })`; existing `#263` augmentation unchanged. Timeout vs overflow discrimination:
+  ```ts
+  const killed = (err as any).killed === true;
+  const overflow = (err as any).code === 'ERR_CHILD_PROCESS_STDIO_MAXBUFFER';
+  const timedOut = killed && !overflow;
+  const outputTruncated = overflow;
+  ```
+  `err.killed` is the authoritative timeout discriminator (empirical Node probe: timeout → `killed:true, signal:'SIGTERM', code:null`; normal non-zero → `killed:false, code:N`; a SIGTERM-trapping child can leave `code` non-null while killed, so `code` is used only to *subtract* the maxBuffer case). On a pure timeout `failedStep` is `null` (nothing asserted-failed) and `lastStep` is the last **completed** step — the progress marker (an in-flight step has no `(N.Ns)` yet, so it isn't parsed).
+
+## Result shape (consumer view)
+
+```jsonc
+// success/warn → envelope.data ; fail/timeout → envelope.meta
+{
+  "passed": false,
+  "flowFile": "/tmp/rn-maestro-run-….yaml",
+  "platform": "ios",
+  "runner": "maestro-runner",
+  "output": "…sliced 2000/4000…",          // unchanged — back-compat
+  "steps": [
+    { "index": 0, "name": "launchApp", "verb": "launchApp", "status": "pass", "durationMs": 2300 },
+    { "index": 1, "name": "tapOn: id=\"submit\"", "verb": "tapOn", "status": "fail", "durationMs": 12700 }
+  ],
+  "failedStep": { "index": 1, "name": "tapOn: id=\"submit\"", "verb": "tapOn", "status": "fail", "durationMs": 12700 },
+  "reason": { "kind": "SELECTOR_NOT_FOUND", "selector": "submit" },
+  "lastStep": { "index": 1, "name": "tapOn: id=\"submit\"", "verb": "tapOn", "status": "fail", "durationMs": 12700 },
+  "timedOut": false,
+  "outputTruncated": false,
+  "runtimeDegraded": { "medianTapMs": 1800, "floorMs": 1500, "sampleCount": 3 }  // #263, only when degraded
+}
+```
+
+## Testing (TDD)
+
+- **NEW `test/unit/gh-211-maestro-step-parser.test.js`** — pure parser + helpers:
+  - verbs/status/durations; verb has NO trailing colon; index is observed order.
+  - excludes `✗ rn-maestro-run 23.8s` summary line and `N steps passing/failing` count lines.
+  - verb-in-selector (`assertVisible: text="tapOn …"`) → verb `assertVisible`.
+  - empty / garbage / CLI-format → `[]`; never throws.
+  - `stripAnsi` removes SGR codes; an ANSI-wrapped glyph line still parses.
+  - `findFailedStep` = last `✗`; `lastObservedStep` = `steps.at(-1)`.
+  - `summarizeReason` returns `{ kind, selector }` and **contains no `raw`** (assert the key is absent); UNKNOWN → null.
+  - `buildStepSummary(out,{failed:false})` → `failedStep:null,reason:null`; fail-then-retry-✓ output with `{failed:false}` → `failedStep:null`.
+- **REGRESSION `test/unit/gh-263-tap-latency.test.js`** stays green (proves `parseTapLatencies` unchanged).
+- **NEW `test/unit/gh-211-maestro-run-structured-results.test.js`** — exercise the pure assembly seam directly (no `execFile` mocking): success/warn metas via `buildStepSummary`; catch-path via a fake error `{ killed:true, code:null, stdout:'…partial…', stderr:'' }` asserting `timedOut:true, failedStep:null, lastStep=<last ✓>`; a maxBuffer fake `{ killed:true, code:'ERR_CHILD_PROCESS_STDIO_MAXBUFFER' }` asserting `timedOut:false, outputTruncated:true`.
+- **Patch changeset.**
+
+## Risks / open items
+
+- **ANSI (unverified against the real binary).** No ANSI handling exists in the repo and `execFile` is not a TTY (color usually off), but not guaranteed. Mitigation: ship `stripAnsi()` + test now; during device-verify run `~/.maestro-runner/bin/maestro-runner --platform ios test <flow> | grep -c $'\x1b'` to settle whether the strip is load-bearing or belt-and-suspenders.
+- **`runFlow` sub-flows.** `runFlow` is allowlisted/used (GH#186). No captured fixture shows how maestro-runner renders sub-flow child glyphs. `steps[]` is documented as a **flat observed list** — no parent/child hierarchy promised; `index` disambiguates repeats. Confirm rendering during device-verify.
+- **CLI fallback** produces no structured steps (different format) → `steps: []`, `output` intact. Acceptable: maestro-runner is the default fast path; fail-open matches #263.
+
+## Provenance
+
+Plan reviewed pre-code via `/brainstorm codex,antigravity` (2026-06-14). Codex + Claude file-grounded research caught: the `reason`-re-embeds-`raw` blocker, the maxBuffer-vs-timeout blocker, the `verb` trailing-colon trap, the `data` vs `meta` envelope placement, terminal-only `failedStep`, and ANSI/`runFlow` edges — all folded into this design. (Antigravity hung with no output.)
@@ -0,0 +1,129 @@
+// src/domain/maestro-step-parser.ts
+// GH #211: structure maestro_run results from maestro-runner stdout. Pure, no
+// I/O, fail-open: unparseable output yields []. Generalizes the #263 step-line
+// parser (tap-latency.ts derives parseTapLatencies from parseSteps).
+import { parseMaestroFailure } from './maestro-error-parser.js';
+// Strip ANSI SGR/color escape sequences. execFile output is usually un-colored
+// (child stdout is a pipe, not a TTY) but maestro-runner is not guaranteed to
+// honor that, and a glyph-anchored match breaks on a colored `✓`. Built via
+// fromCharCode(27) (ESC) to keep a raw control char out of the source/regex.
+const ANSI_RE = new RegExp(String.fromCharCode(27) + '\\[[0-9;]*m', 'g');
+export function stripAnsi(s) {
+    return s.replace(ANSI_RE, '');
+}
+// `  {✓|✗} <name> (N.Ns)` — the trailing (N.Ns) is REQUIRED, which excludes the
+// `✗ rn-maestro-run 23.8s` summary line and the `N steps passing` count lines.
+// The name is `\S.*\S|\S` (must start AND end non-whitespace). This keeps a
+// duration-looking token inside the selector value (`text="took (2.0s)"`) losing
+// to the real trailing `$`-anchored duration, AND removes the overlapping
+// whitespace quantifiers (`\s+(.*?)\s*`) that made the prior pattern
+// catastrophically backtrack (ReDoS) on a glyph + long-whitespace line — the
+// combined stdout+stderr carries untrusted multi-MB app logs (multi-LLM review).
+const STEP_RE = /^([✓✗])\s+(\S.*\S|\S)\s*\(([\d.]+)s\)\s*$/;
+// Bound any text interpolated into results/headline so a pathological step name
+// or selector (e.g. a multi-KB inputText value) can't balloon the failure
+// message and defeat the sliced `output` field (codex-pair review).
+const MAX_FIELD = 200;
+function cap(s) {
+    return s.length > MAX_FIELD ? s.slice(0, MAX_FIELD) + '…' : s;
+}
+// Cap the returned steps so a pathological run (a multi-MB stdout/stderr with
+// many step-shaped log lines) can't bloat the MCP response past the `output`
+// slice. Keep the most recent steps — failures and partial-progress live at the
+// tail — with their true `index` preserved (a gap signals truncation).
+const MAX_STEPS = 1000;
+export function parseSteps(output) {
+    if (!output || typeof output !== 'string')
+        return [];
+    const steps = [];
+    let index = 0;
+    for (const raw of stripAnsi(output).split('\n')) {
+        const m = STEP_RE.exec(raw.trim());
+        if (!m)
+            continue;
+        const name = m[2].trim();
+        const verb = name.split(/\s+/)[0].replace(/:$/, '');
+        if (verb === 'rn-maestro-run')
+            continue; // belt-and-suspenders vs a future summary format
+        const seconds = Number(m[3]);
+        if (!Number.isFinite(seconds))
+            continue;
+        steps.push({
+            index: index++,
+            name: cap(name),
+            verb,
+            status: m[1] === '✓' ? 'pass' : 'fail',
+            durationMs: Math.round(seconds * 1000),
+        });
+    }
+    return steps.length > MAX_STEPS ? steps.slice(-MAX_STEPS) : steps;
+}
+// The TERMINAL failed step: the last parsed step iff it failed. maestro-runner
+// stops at the first real failure, so the terminal ✗ is the last parsed step; a
+// transient ✗ that was retried-✓ before a later timeout is NOT reported, because
+// the last parsed step is then the recovery ✓ (codex-pair review).
+export function findFailedStep(steps) {
+    const last = steps.length ? steps[steps.length - 1] : null;
+    return last && last.status === 'fail' ? last : null;
+}
+export function lastObservedStep(steps) {
+    return steps.length ? steps[steps.length - 1] : null;
+}
+// Project parseMaestroFailure to {kind, selector}, DROPPING its `raw` field —
+// every MaestroFailure variant carries `raw` = the full unsliced output, which
+// must not be re-embedded into the result (it would defeat the output slice).
+export function summarizeReason(output) {
+    const f = parseMaestroFailure(output);
+    if (f.kind === 'UNKNOWN')
+        return null;
+    const selector = 'selector' in f ? (f.selector ?? null) : null;
+    return { kind: f.kind, selector: selector === null ? null : cap(selector) };
+}
+// failedStep/reason are populated ONLY when the run's terminal verdict is fail
+// (opts.failed). maestro-runner logs transient retries; a fail-then-retry-✓ on
+// a PASSED run must not report a failedStep (mirrors parseMaestroFailure GH#118).
+export function buildStepSummary(output, opts) {
+    const steps = parseSteps(output);
+    return {
+        steps,
+        failedStep: opts.failed ? findFailedStep(steps) : null,
+        reason: opts.failed ? summarizeReason(output) : null,
+        lastStep: lastObservedStep(steps),
+    };
+}
+// execFile timeout kills the child (killed===true, signal 'SIGTERM', code null).
+// A 10MB maxBuffer overflow ALSO rejects with killed===true but code
+// 'ERR_CHILD_PROCESS_STDIO_MAXBUFFER' — that's truncation, not a timeout, so it
+// must not be mislabeled. `killed` is authoritative; `code` only subtracts the
+// overflow case (a SIGTERM-trapping child can leave a non-null exit code).
+export function classifyExecError(err) {
+    const e = err;
+    const killed = e?.killed === true;
+    const overflow = e?.code === 'ERR_CHILD_PROCESS_STDIO_MAXBUFFER';
+    return { timedOut: killed && !overflow, outputTruncated: overflow };
+}
+// Headline for a failed maestro_run, built from STRUCTURED data so it never
+// re-embeds raw runner/app output. The raw fallbackMsg (err.message, which
+// execFile populates with stderr) is used ONLY when there is no structured
+// signal — e.g. a spawn/system error with no step output. Raw output still
+// lives in the bounded `output` field.
+export function formatFailureHeadline(summary, cls, fallbackMsg) {
+    if (cls.timedOut) {
+        return `Maestro flow timed out${summary.lastStep ? ` after step "${summary.lastStep.name}"` : ''}`;
+    }
+    if (cls.outputTruncated) {
+        return 'Maestro flow output exceeded the 10MB buffer';
+    }
+    if (summary.failedStep) {
+        const r = summary.reason;
+        const reasonStr = r ? ` (${r.kind}${r.selector ? `: ${r.selector}` : ''})` : '';
+        return `Maestro flow failed at step "${summary.failedStep.name}"${reasonStr}`;
+    }
+    // No terminal ✗ step line (e.g. it was truncated) but a recognizable error
+    // string survived — prefer the structured, raw-free reason over the raw msg.
+    if (summary.reason) {
+        const r = summary.reason;
+        return `Maestro flow failed (${r.kind}${r.selector ? `: ${r.selector}` : ''})`;
+    }
+    return `Maestro flow failed: ${fallbackMsg.slice(0, 500)}`;
+}