Skip to content

Stage 2: judge intermediate scenario screenshots, not just the last #114

Description

@dadachi

Context

Layer 3 currently scores at most two screens per run:

  • Stage 1 — the post-launch home screen (substrate-leak + renders-cleanly rubric).
  • Stage 2 — the last screenshot captured during the scripted mobile-mcp walk (domain-content + substrate-leak rubric).

Stage 2's walker (src/validation/stage2.tsrunStage2Scenario) already captures a screenshot at every meaningful step of the scenario — they all sit in scenario.screenshots: string[]. The judge then drops everything except the last one at src/validation/stage2-judge.ts:167:

```ts
const representative = scenario.screenshots[scenario.screenshots.length - 1]!;
const layer3 = await runLayer3({ screenshotPath: representative, rubric: args.rubric, spec: args.spec });
```

This leaves mid-flow screens (list view, detail view, post-toggle confirmation) unjudged even though they're already on disk.

Problem

Mid-flow screens are exactly where domain-rename regressions show up — "Edit Shop" in a form field, "Shopkeeper" in a tab bar, etc. Today, if those leak only on the list/detail screen and the toggle screen looks fine, Stage 2 PASSes and the leak ships. The screenshots that would have caught it are on disk, just never sent to the judge.

Proposal

Judge N representative screenshots per Stage 2 walk, not just one, and surface per-screenshot scores in report.json.

Minimal version:

  1. In runOnePlatform (src/validation/stage2-judge.ts:113), iterate runLayer3 over a curated subset of scenario.screenshots — e.g. first screen with domain content, the post-toggle screen, and the last screen. Don't blindly judge all of them (cost: median-of-3 × N × criteria).
  2. Extend Stage2PlatformReport (src/agents/types.ts:83) so layer3Scores becomes an array of { screenshot, scores } instead of one flat layer3Scores field. Keep representativeScreenshot for backward compatibility with the HTML report or remove it after migrating src/report/render.ts:155.
  3. Stage2PlatformReport.pass flips from scenario.ok && layer3.pass to scenario.ok && everyJudgedScreen.pass. Any judged screen failing the rubric fails Stage 2.
  4. Mark each judged screen in the HTML report so the operator can see which step's screenshot triggered the failure.

Open questions:

  • Which screens count as "representative"? Probably annotate scenario steps with a `judgeable: true` flag so the scenario author picks (queue.ts knows which step shows domain content).
  • Cost ceiling — Stage 2 today is ~one Layer 3 call × 2 platforms = ~6 vision samples. Judging 3 screens per platform pushes that to ~18 samples per run. Worth measuring; may want a config knob.
  • Does this also let us drop walk-app's hypothetical `judge` command? Yes — once Stage 2 covers the right screens, there's no gap for walk-app to fill (see prior discussion: the right home for mid-flow judging is the path that already owns `report.json`, not a second skill).

Out of scope

  • Adding new screens to the scenario itself. That's src/validation/scenarios/queue.ts work and a different change.
  • Live walk-app integration with the judge. Decided against in conversation 2026-05-28 — Stage 2 extension is the cleaner fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions