Context
Layer 3 currently scores at most two screens per run:
- Stage 1 — the post-launch home screen (substrate-leak + renders-cleanly rubric).
- Stage 2 — the last screenshot captured during the scripted mobile-mcp walk (domain-content + substrate-leak rubric).
Stage 2's walker (src/validation/stage2.ts → runStage2Scenario) already captures a screenshot at every meaningful step of the scenario — they all sit in scenario.screenshots: string[]. The judge then drops everything except the last one at src/validation/stage2-judge.ts:167:
```ts
const representative = scenario.screenshots[scenario.screenshots.length - 1]!;
const layer3 = await runLayer3({ screenshotPath: representative, rubric: args.rubric, spec: args.spec });
```
This leaves mid-flow screens (list view, detail view, post-toggle confirmation) unjudged even though they're already on disk.
Problem
Mid-flow screens are exactly where domain-rename regressions show up — "Edit Shop" in a form field, "Shopkeeper" in a tab bar, etc. Today, if those leak only on the list/detail screen and the toggle screen looks fine, Stage 2 PASSes and the leak ships. The screenshots that would have caught it are on disk, just never sent to the judge.
Proposal
Judge N representative screenshots per Stage 2 walk, not just one, and surface per-screenshot scores in report.json.
Minimal version:
- In
runOnePlatform (src/validation/stage2-judge.ts:113), iterate runLayer3 over a curated subset of scenario.screenshots — e.g. first screen with domain content, the post-toggle screen, and the last screen. Don't blindly judge all of them (cost: median-of-3 × N × criteria).
- Extend
Stage2PlatformReport (src/agents/types.ts:83) so layer3Scores becomes an array of { screenshot, scores } instead of one flat layer3Scores field. Keep representativeScreenshot for backward compatibility with the HTML report or remove it after migrating src/report/render.ts:155.
Stage2PlatformReport.pass flips from scenario.ok && layer3.pass to scenario.ok && everyJudgedScreen.pass. Any judged screen failing the rubric fails Stage 2.
- Mark each judged screen in the HTML report so the operator can see which step's screenshot triggered the failure.
Open questions:
- Which screens count as "representative"? Probably annotate scenario steps with a `judgeable: true` flag so the scenario author picks (queue.ts knows which step shows domain content).
- Cost ceiling — Stage 2 today is ~one Layer 3 call × 2 platforms = ~6 vision samples. Judging 3 screens per platform pushes that to ~18 samples per run. Worth measuring; may want a config knob.
- Does this also let us drop walk-app's hypothetical `judge` command? Yes — once Stage 2 covers the right screens, there's no gap for walk-app to fill (see prior discussion: the right home for mid-flow judging is the path that already owns `report.json`, not a second skill).
Out of scope
- Adding new screens to the scenario itself. That's
src/validation/scenarios/queue.ts work and a different change.
- Live walk-app integration with the judge. Decided against in conversation 2026-05-28 — Stage 2 extension is the cleaner fix.
Context
Layer 3 currently scores at most two screens per run:
Stage 2's walker (
src/validation/stage2.ts→runStage2Scenario) already captures a screenshot at every meaningful step of the scenario — they all sit inscenario.screenshots: string[]. The judge then drops everything except the last one atsrc/validation/stage2-judge.ts:167:```ts
const representative = scenario.screenshots[scenario.screenshots.length - 1]!;
const layer3 = await runLayer3({ screenshotPath: representative, rubric: args.rubric, spec: args.spec });
```
This leaves mid-flow screens (list view, detail view, post-toggle confirmation) unjudged even though they're already on disk.
Problem
Mid-flow screens are exactly where domain-rename regressions show up — "Edit Shop" in a form field, "Shopkeeper" in a tab bar, etc. Today, if those leak only on the list/detail screen and the toggle screen looks fine, Stage 2 PASSes and the leak ships. The screenshots that would have caught it are on disk, just never sent to the judge.
Proposal
Judge N representative screenshots per Stage 2 walk, not just one, and surface per-screenshot scores in
report.json.Minimal version:
runOnePlatform(src/validation/stage2-judge.ts:113), iteraterunLayer3over a curated subset ofscenario.screenshots— e.g. first screen with domain content, the post-toggle screen, and the last screen. Don't blindly judge all of them (cost: median-of-3 × N × criteria).Stage2PlatformReport(src/agents/types.ts:83) solayer3Scoresbecomes an array of{ screenshot, scores }instead of one flatlayer3Scoresfield. KeeprepresentativeScreenshotfor backward compatibility with the HTML report or remove it after migratingsrc/report/render.ts:155.Stage2PlatformReport.passflips fromscenario.ok && layer3.passtoscenario.ok && everyJudgedScreen.pass. Any judged screen failing the rubric fails Stage 2.Open questions:
Out of scope
src/validation/scenarios/queue.tswork and a different change.