Stage 2: judge intermediate scenario screenshots, not just the last

## Context

Layer 3 currently scores at most two screens per run:
- **Stage 1** — the post-launch home screen (substrate-leak + renders-cleanly rubric).
- **Stage 2** — the *last* screenshot captured during the scripted mobile-mcp walk (domain-content + substrate-leak rubric).

Stage 2's walker (`src/validation/stage2.ts` → `runStage2Scenario`) already captures a screenshot at every meaningful step of the scenario — they all sit in `scenario.screenshots: string[]`. The judge then drops everything except the last one at `src/validation/stage2-judge.ts:167`:

\`\`\`ts
const representative = scenario.screenshots[scenario.screenshots.length - 1]!;
const layer3 = await runLayer3({ screenshotPath: representative, rubric: args.rubric, spec: args.spec });
\`\`\`

This leaves mid-flow screens (list view, detail view, post-toggle confirmation) unjudged even though they're already on disk.

## Problem

Mid-flow screens are exactly where domain-rename regressions show up — \"Edit Shop\" in a form field, \"Shopkeeper\" in a tab bar, etc. Today, if those leak only on the list/detail screen and the toggle screen looks fine, Stage 2 PASSes and the leak ships. The screenshots that would have caught it are on disk, just never sent to the judge.

## Proposal

Judge **N representative screenshots** per Stage 2 walk, not just one, and surface per-screenshot scores in `report.json`.

**Minimal version:**
1. In `runOnePlatform` (`src/validation/stage2-judge.ts:113`), iterate `runLayer3` over a curated subset of `scenario.screenshots` — e.g. first screen with domain content, the post-toggle screen, and the last screen. Don't blindly judge all of them (cost: median-of-3 × N × criteria).
2. Extend `Stage2PlatformReport` (`src/agents/types.ts:83`) so `layer3Scores` becomes an array of `{ screenshot, scores }` instead of one flat `layer3Scores` field. Keep `representativeScreenshot` for backward compatibility with the HTML report or remove it after migrating `src/report/render.ts:155`.
3. `Stage2PlatformReport.pass` flips from `scenario.ok && layer3.pass` to `scenario.ok && everyJudgedScreen.pass`. Any judged screen failing the rubric fails Stage 2.
4. Mark each judged screen in the HTML report so the operator can see which step's screenshot triggered the failure.

**Open questions:**
- Which screens count as \"representative\"? Probably annotate scenario steps with a \`judgeable: true\` flag so the scenario author picks (queue.ts knows which step shows domain content).
- Cost ceiling — Stage 2 today is ~one Layer 3 call × 2 platforms = ~6 vision samples. Judging 3 screens per platform pushes that to ~18 samples per run. Worth measuring; may want a config knob.
- Does this also let us drop walk-app's hypothetical \`judge\` command? Yes — once Stage 2 covers the right screens, there's no gap for walk-app to fill (see prior discussion: the right home for mid-flow judging is the path that already owns \`report.json\`, not a second skill).

## Out of scope

- Adding new screens to the scenario itself. That's `src/validation/scenarios/queue.ts` work and a different change.
- Live walk-app integration with the judge. Decided against in conversation 2026-05-28 — Stage 2 extension is the cleaner fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stage 2: judge intermediate scenario screenshots, not just the last #114

Context

Problem

Proposal

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Stage 2: judge intermediate scenario screenshots, not just the last #114

Description

Context

Problem

Proposal

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions