From 277c80dc954ed93ffe40e3477fa6161daf03ee00 Mon Sep 17 00:00:00 2001 From: dadachi Date: Thu, 28 May 2026 14:28:08 +0900 Subject: [PATCH] =?UTF-8?q?docs(README):=20add=20Runtime=20section=20?= =?UTF-8?q?=E2=80=94=20wall-clock=20+=20cost=20per=20VISUAL=20level?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Surface the end-to-end runtime for VISUAL=0/1/2 measured on a 2021 M1 Max, with an Approx cost column (VISUAL=0 ≈ $0.05; V1/V2 pending measurement) and a note that cost scales with model usage, not wall clock. --- README.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/README.md b/README.md index 1e80899..34b998e 100644 --- a/README.md +++ b/README.md @@ -119,6 +119,20 @@ After the queue cell's Stage-2 hardening landed, sushi and the task tracker pass Both screenshots are real captures from the booted iOS Simulator and Android emulator post-`./gradlew assembleDebug` / `xcodebuild build`, after the agent installed and launched the generated app. +## Runtime + +End-to-end wall-clock per run, measured from `report.json`'s `meta.durationMs` on a 2021 M1 Max with both simulators pre-booted: + +| `NATIVEAPPTEMPLATE_VISUAL` | What runs | Observed time | Approx cost | +|---|---|---|---| +| `0` (default) | Layer 1 (ripgrep) + Layer 2 *fast* (Rails boot probe, iOS/Android type-check) + reviewer | ~2–3 min | ~$0.05 | +| `1` | + full `xcodebuild build` + `./gradlew assembleDebug` + home-screen capture + Stage 1 vision judge | ~2.5 min (barbershop-queue · 2026-05-24) | — | +| `2` | + Rails server boot + scripted-CRUD walk via `mobile-mcp` + Stage 2 vision judge | ~7–8 min (sentova / sushi / task-tracker · 2026-05-23) | — | + +Cold builds, first-run cocoapods/gradle dependency resolution, or unbooted simulators add a one-time minute or two. The self-repair loop (opt-in, hard-capped at 5 iterations) can extend a failing run substantially — budget for it if you set `NATIVEAPPTEMPLATE_REPAIR=on`. + +**Cost** scales with model usage, not wall-clock. The agent makes real `claude-opus-4-7` API calls (planner, workers, reviewer, judge) and a single VISUAL=2 run consumes tens of thousands of tokens across multiple sub-agents. Set a workspace spend cap (see [Security](#security)) as a backstop — the agent exits non-zero on validation failure, but doesn't gate on spend itself. + ## Architecture ```mermaid