callstackincubator
diff --git a/‎AGENTS.md‎
Lines changed: 15 additions & 13 deletions b/‎AGENTS.md‎
Lines changed: 15 additions & 13 deletions
diff --git a/‎skills/agent-device/SKILL.md‎
Lines changed: 1 addition & 1 deletion b/‎skills/agent-device/SKILL.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎skills/dogfood/SKILL.md‎
Lines changed: 1 addition & 1 deletion b/‎skills/dogfood/SKILL.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/utils/__tests__/args.test.ts‎
Lines changed: 16 additions & 1 deletion b/‎src/utils/__tests__/args.test.ts‎
Lines changed: 16 additions & 1 deletion
diff --git a/‎src/utils/command-schema.ts‎
Lines changed: 16 additions & 5 deletions b/‎src/utils/command-schema.ts‎
Lines changed: 16 additions & 5 deletions
diff --git a/‎test/skillgym/README.md‎
Lines changed: 5 additions & 5 deletions b/‎test/skillgym/README.md‎
Lines changed: 5 additions & 5 deletions
@@ -153,7 +153,9 @@ Command-only flags (like `find --first`) that don't flow to the platform layer o
 - Do not duplicate `makeSessionStore`, `makeSession`, or device constants when a shared helper already exists.
 
 ## Testing Matrix
-- Docs/skills only: no tests required.
+- Docs/skills only: no tests required unless a more specific rule below applies.
+- CLI help/guidance changes in `src/utils/command-schema.ts`: run `pnpm exec vitest run src/utils/__tests__/args.test.ts`.
+- SkillGym prompt/assertion changes: run the touched `--case` checks; for broad validation, run cases in batches of 20 or fewer because full-suite runs can hang.
 - Non-TS, no behavior impact: no tests unless requested.
 - Keep tests behavioral; do not assert shapes or cases TypeScript already proves.
 - Any TS change: `pnpm typecheck` or `pnpm check:quick`.
@@ -182,18 +184,18 @@ Command-only flags (like `find --first`) that don't flow to the platform layer o
 - Changing `tsconfig.lib.json`/build tooling without running `pnpm check:tooling`; declaration generation is stricter than `tsc --noEmit`.
 
 ## Docs & Skills
-- For behavior/CLI surface changes, evaluate docs/skills updates.
-- Update `README.md` and relevant `website/docs/**` pages for command behavior/flags/aliases/workflows.
-- Update relevant `skills/**/SKILL.md` when usage examples/workflow recommendations change.
-- Keep skill docs task-first:
-  - top-level `SKILL.md` should stay a thin router, not a full manual.
-  - keep detailed workflows/troubleshooting in a `references/` folder instead of growing the router.
-  - isolate true platform/infra exceptions (for example macOS-only or remote-tenancy-only guidance) in dedicated files.
-  - do not delete high-value operational guidance during refactors; move or condense it unless the behavior is obsolete.
-- Optimize skills for cheap, less capable models:
-  - keep routing explicit, shallow, and easy to follow in one pass.
-  - prefer short task-first steps, concrete commands, and low-ambiguity wording over dense prose.
-  - avoid long reference chains or “figure it out” guidance when a direct next action can be stated.
+- Versioned CLI help is the agent-facing source of truth. Put workflow guidance in `src/utils/command-schema.ts` help topics and assert important copy in `src/utils/__tests__/args.test.ts`.
+- Skills are thin routers. Keep `skills/**/SKILL.md` focused on when to use the skill, version gating, which `agent-device help <topic>` page to read, and a short default loop. Do not duplicate full CLI manuals in skills.
+- For behavior/CLI surface changes, update `README.md`, relevant `website/docs/**`, and router skills only when their short routing guidance or version assumptions change.
+- For command-planning guidance changes, update `test/skillgym/suites/agent-device-smoke-suite.ts` when the change should alter what an agent plans.
+- Keep SkillGym cases behavioral and command-planning oriented. Prefer prompts that assert the user-visible contract and expected command family over brittle exact output, but forbid known bad patterns.
+- Build before SkillGym when local CLI help is needed: `pnpm build`, then `pnpm exec skillgym run ... --case <id>`.
+- Run SkillGym broad validation in batches of 20 cases or fewer using repeated `--case` runs; do not rely on one full-suite invocation for large runs.
+- Preserve current high-value workflow guidance:
+  - iOS Expo Go dogfood: prefer `agent-device open "Expo Go" <url> --platform ios` when the shell is known, then `snapshot -i` to confirm the project UI rather than the runner splash.
+  - `keyboard dismiss` is best-effort on iOS; prefer a visible app dismiss control, or `back --system` only when system navigation is acceptable.
+  - Empty replacement is not a supported clear-field command; do not document or test `fill <target> ""` as clearing. Prefer visible clear/reset controls or report the tool gap.
+  - Mutating commands against one session must run serially. Parallelize only read-only commands or commands on separate sessions/devices.
 - In final summaries, state whether docs/skills were updated; if not, explain why.
 
 ## When Blocked
 
@@ -31,4 +31,4 @@ agent-device help dogfood
 
 Default loop: `open -> snapshot/-i -> get/is/find or press/fill/scroll/wait -> verify -> close`.
 
-Keep refs current, prefer selectors/refs over coordinates, use `fill` to replace text, and use `back` for app-owned navigation. Let `help workflow` provide the exact command shapes.
+Keep refs current, prefer selectors/refs over coordinates, use `fill` to replace text, and use `back` for app-owned navigation. Serialize mutating commands against one session; only parallelize read-only work or separate sessions. Let `help workflow` provide the exact command shapes, including Expo Go, keyboard, and clear-field limits.
@@ -22,4 +22,4 @@ agent-device help dogfood
 
 Loop: open named session -> snapshot -i + screenshot -> explore flows -> capture evidence per issue -> close.
 
-Target app is required; infer platform or ask. Default output is `./dogfood-output/`. Findings must come from runtime behavior, not source reads. Re-snapshot after mutations. Use logs, network, trace, perf, overlay screenshots, or react-devtools only when they add evidence.
+Target app is required; infer platform or ask. Default output is `./dogfood-output/`. Findings must come from runtime behavior, not source reads. Re-snapshot after mutations. Keep mutating commands serial within one session; use `help dogfood` and `help workflow` for Expo Go, keyboard, and clear-field limits. Use logs, network, trace, perf, overlay screenshots, or react-devtools only when they add evidence.
@@ -796,9 +796,11 @@ test('usage includes agent workflows, config, environment, and examples footers'
   assert.match(usageText, /Truncated text\/input preview: expand first with snapshot -s @e12/);
   assert.match(usageText, /RN warning\/error overlays can block taps: snapshot -i/);
   assert.match(usageText, /Expo Go\/dev clients: use the provided URL when given/);
-  assert.match(usageText, /if only a target name is given, open that target/);
+  assert.match(usageText, /on iOS prefer open "Expo Go" <url>/);
   assert.match(usageText, /Install flows: install\/install-from-source first/);
   assert.match(usageText, /fill 'id="field-email"' "qa@example\.com" replaces/);
+  assert.match(usageText, /do not use fill <target> ""/);
+  assert.match(usageText, /Run mutating commands serially against one session/);
   assert.match(usageText, /After mutation: diff snapshot -i/);
   assert.match(usageText, /app-owned back uses back/);
   assert.match(usageText, /logs clear --restart\/mark\/path/);
@@ -856,13 +858,24 @@ test('usageForCommand resolves workflow help topic', () => {
   assert.match(help, /report that gap instead of typing\/searching\/navigating/);
   assert.match(help, /If snapshot -i shows one, dismiss\/close its visible control/);
   assert.match(help, /iOS Allow Paste prompt cannot be exercised under XCUITest/);
+  assert.match(help, /Empty replacement is not a supported clear-field command/);
+  assert.match(help, /do not plan fill <target> ""/);
+  assert.match(help, /iOS keyboard dismiss is best-effort/);
+  assert.match(help, /UNSUPPORTED_OPERATION/);
+  assert.match(help, /Stateful commands against one --session must run serially/);
+  assert.match(
+    help,
+    /Do not run open\/press\/fill\/type\/scroll\/back\/alert\/replay\/batch\/close commands in parallel/,
+  );
   assert.match(help, /agent-device clipboard write "some text"/);
   assert.match(help, /trusted ADB keyboard IME/);
   assert.match(help, /if no URL is provided but a target\/app name is provided, open that target/);
   assert.match(help, /do not split clear\/restart/);
   assert.match(help, /do not write network log headers/);
   assert.match(help, /agent-device open exp:\/\/127\.0\.0\.1:8081 --platform ios/);
   assert.match(help, /agent-device open "Expo Go" exp:\/\/127\.0\.0\.1:8081 --platform ios/);
+  assert.match(help, /direct URL open can report success while leaving the runner\/shell focused/);
+  assert.match(help, /verify with snapshot -i after opening/);
   assert.match(help, /agent-device open exp:\/\/127\.0\.0\.1:8081 --platform android/);
   assert.match(help, /apps lookup misses the project but shows Expo Go\/dev-client/);
   assert.match(help, /metro prepare --kind expo/);
@@ -909,6 +922,8 @@ test('usageForCommand resolves dogfood help topic', () => {
   assert.match(help, /Static\/on-load issues can use one screenshot/);
   assert.match(help, /React Native warning\/error overlays can be real findings/);
   assert.match(help, /Expo Go\/dev-client shells/);
+  assert.match(help, /Keep stateful commands serial within the same session/);
+  assert.match(help, /prefer agent-device open "Expo Go" <url>/);
   assert.match(help, /dogfood-output\/report\.md/);
   assert.match(help, /ID, severity, category, title, affected flow\/screen/);
   assert.match(help, /Never delete screenshots, videos, traces, or report artifacts/);
 
@@ -180,9 +180,11 @@ const AGENT_QUICKSTART_LINES = [
   'Read-only visible/state question: use snapshot/get/is/find; use snapshot -i only when refs are needed.',
   'Truncated text/input preview: expand first with snapshot -s @e12, not get text.',
   'RN warning/error overlays can block taps: snapshot -i, dismiss/close, then diff snapshot -i.',
-  'Expo Go/dev clients: use the provided URL when given; if only a target name is given, open that target and do not search project files for a URL.',
+  'Expo Go/dev clients: use the provided URL when given; on iOS prefer open "Expo Go" <url> when the host shell is known.',
   'Install flows: install/install-from-source first, then open the installed id with --relaunch.',
   'Text: fill \'id="field-email"\' "qa@example.com" replaces; type appends after press.',
+  'Clearing text: do not use fill <target> ""; use a visible clear/reset control or report that clearing is unsupported.',
+  'Run mutating commands serially against one session; parallelize only read-only commands or separate sessions.',
   'Clipboard limits: iOS Allow Paste cannot be automated through XCUITest; prefill with clipboard write. Android non-ASCII should use fill/type, not raw adb input.',
   'After mutation: diff snapshot -i. Off-screen hints: scroll, then snapshot -i.',
   'Raw coordinates are fallback-only: use snapshot -i -c --json rects when iOS refs no-op or child refs are missing.',
@@ -283,11 +285,17 @@ Text entry:
     agent-device fill 'id="field-email"' "qa@example.com"
     agent-device press 'id="product-note"'
     agent-device type "Handle with care" --delay-ms 80
-  Debounced field with no result selector: agent-device wait 1000. Keyboard read-only: keyboard status/get. Blocked control: keyboard dismiss.
+  Empty replacement is not a supported clear-field command: do not plan fill <target> "" or fill <target> ''. Prefer a visible clear/reset control; if the app exposes none, report the tool gap instead of inventing a clear command.
+  Debounced field with no result selector: agent-device wait 1000. Keyboard read-only: keyboard status/get. Blocked control: try keyboard dismiss when supported.
+  iOS keyboard dismiss is best-effort and can return UNSUPPORTED_OPERATION when no native dismiss gesture/control is available. Prefer a visible app dismiss control, or use back --system only when system navigation is an acceptable side effect.
   Search-as-you-type fields on iOS can drop characters when driven too fast; use --delay-ms on fill/type before trying clipboard paste.
   iOS Allow Paste prompt cannot be exercised under XCUITest. To test paste-driven app behavior, prefill first with agent-device clipboard write "some text"; test the system prompt manually.
   Android non-ASCII can fail on some system images. Try fill/type normally; agent-device uses safer fallbacks. If the shell reports unsupported non-ASCII input, configure a trusted ADB keyboard IME outside the command plan and restore the previous IME afterward.
 
+Session ordering:
+  Stateful commands against one --session must run serially. Do not run open/press/fill/type/scroll/back/alert/replay/batch/close commands in parallel against the same session.
+  It is fine to parallelize independent read-only collection or commands that use different sessions/devices.
+
 Read-only and waits:
   Read-only visible/state question: use snapshot/get/is/find.
   agent-device snapshot
@@ -334,9 +342,11 @@ React Native dev loop:
     agent-device find "Home"
   Do not use agent-device reload. Use open --relaunch for native startup reset.
   Warning/error overlays can obscure UI and intercept taps. If snapshot -i shows one, dismiss/close its visible control (for example Dismiss or Close) if it is not the task target, then diff snapshot -i or snapshot -i before tapping the real UI.
-  Expo Go is a host shell. Use a provided project URL instead of inventing a bundle id; if no URL is provided but a target/app name is provided, open that target and do not inspect project files to find one. iOS simulators can open a URL directly; use host + URL when targeting a specific host shell:
-    agent-device open exp://127.0.0.1:8081 --platform ios
+  Expo Go is a host shell. Use a provided project URL instead of inventing a bundle id; if no URL is provided but a target/app name is provided, open that target and do not inspect project files to find one. On iOS, prefer host + URL when the host shell is known because direct URL open can report success while leaving the runner/shell focused; verify with snapshot -i after opening:
     agent-device open "Expo Go" exp://127.0.0.1:8081 --platform ios
+    agent-device snapshot -i --platform ios
+  Direct iOS URL open remains valid when no host shell is known, but verify that the app UI loaded:
+    agent-device open exp://127.0.0.1:8081 --platform ios
   Android uses the URL target directly; do not write open <app> <url> there:
     agent-device open exp://127.0.0.1:8081 --platform android
   If apps lookup misses the project but shows Expo Go/dev-client and a project URL is available, open the URL/host shell; if no URL is available, ask instead of inventing an app id.
@@ -536,11 +546,12 @@ Loop:
   4. Map top-level navigation, then exercise primary flows and edge states.
   5. For each issue, capture evidence and write the finding immediately, then continue.
   6. Close the session and reconcile the report summary.
+  Keep stateful commands serial within the same session. Parallel runs can pollute text fields, focus, alerts, and navigation state.
 
 Coverage:
   Navigation, forms, empty/error/loading states, offline or retry behavior, permissions, settings, accessibility labels, orientation/keyboard, and obvious performance stalls.
   React Native warning/error overlays can be real findings or test blockers. Capture them, dismiss if unrelated, re-snapshot, and report them.
-  Expo Go/dev-client shells: use the provided exp:// or dev-client URL and record whether the shell, project load, or app UI is being tested.
+  Expo Go/dev-client shells: use the provided exp:// or dev-client URL and record whether the shell, project load, or app UI is being tested. On iOS dogfood, prefer agent-device open "Expo Go" <url> when Expo Go is the known shell, then snapshot -i to confirm the project UI rather than the runner splash.
   Categories: visual, functional, UX, content, performance, diagnostics, permissions, accessibility.
   Severity: critical blocks a core flow/data/crashes; high breaks a major feature; medium has friction or workaround; low is polish.
 
 
@@ -16,7 +16,7 @@ The included suite focuses on the first two layers so it stays stable and CI-saf
 
 - `../../examples/test-app/`: minimal Expo SDK 55 fixture app for broad UI coverage
 - `skillgym.config.ts`: starter config that runs Codex and Claude Haiku against this repo
-- `suites/agent-device-smoke-suite.ts`: 66-case suite for skill routing, fixture-aware planning, and skill-guidance regressions
+- `suites/agent-device-smoke-suite.ts`: planning suite for skill routing, fixture-aware flows, and skill-guidance regressions
 
 ## Current coverage
 
@@ -28,19 +28,19 @@ Fixture smoke cases cover concrete app surfaces:
 - banners, alerts, toggles, and quick actions on Home
 - search debounce, filters, long-list scroll, favorites, and cart updates in Catalog
 - detail navigation, quantity edits, note append, and save-to-cart on Product
-- form validation, success submit, keyboard dismiss, and reset on Checkout form
+- form validation, success submit, iOS keyboard-dismiss fallback, and reset on Checkout form
 - diagnostics load/error/retry plus reset alert handling in Settings
 - accessibility audit via screenshot + snapshot
 
 Skill-guidance regression cases cover distinct command-planning habits:
 
 - read-only inspection versus mutation
 - fresh `@ref` targeting, durable selectors, raw-rect fallbacks, and off-screen scroll recovery
-- text replacement, append semantics, keyboard status, and keyboard dismiss
-- install/open setup, app discovery, session scoping, and app-owned navigation fallbacks
+- text replacement, append semantics, supported field clearing, keyboard status, and keyboard fallback
+- install/open setup, Expo Go host-shell launch, app discovery, session scoping, and app-owned navigation fallbacks
 - Metro reload, logs, network dump, alert fallback, and screenshot evidence
 - performance metrics, React DevTools profiling, gestures, settings, and trace capture
-- remote config, macOS menu bar surfaces, replay update, and batch schema/recording
+- remote config, macOS menu bar surfaces, replay update, same-session mutation ordering, and batch schema/recording
 
 `assertAgentDeviceEvidence` is intentionally soft when a runner does not expose skill-detection telemetry. When telemetry exists, the suite asserts that `agent-device` was loaded; when it is absent, the cases still judge command-planning output instead of failing on missing runner metadata.
Original file line number	Diff line number	Diff line change
`@@ -31,4 +31,4 @@ agent-device help dogfood`
`31`	`31`
`32`	`32`	Default loop: `open -> snapshot/-i -> get/is/find or press/fill/scroll/wait -> verify -> close`.
`33`	`33`
`34`		-Keep refs current, prefer selectors/refs over coordinates, use `fill` to replace text, and use `back` for app-owned navigation. Let `help workflow` provide the exact command shapes.
	`34`	+Keep refs current, prefer selectors/refs over coordinates, use `fill` to replace text, and use `back` for app-owned navigation. Serialize mutating commands against one session; only parallelize read-only work or separate sessions. Let `help workflow` provide the exact command shapes, including Expo Go, keyboard, and clear-field limits.
Original file line number	Diff line number	Diff line change
`@@ -22,4 +22,4 @@ agent-device help dogfood`
`22`	`22`
`23`	`23`	`Loop: open named session -> snapshot -i + screenshot -> explore flows -> capture evidence per issue -> close.`
`24`	`24`
`25`		-Target app is required; infer platform or ask. Default output is `./dogfood-output/`. Findings must come from runtime behavior, not source reads. Re-snapshot after mutations. Use logs, network, trace, perf, overlay screenshots, or react-devtools only when they add evidence.
	`25`	+Target app is required; infer platform or ask. Default output is `./dogfood-output/`. Findings must come from runtime behavior, not source reads. Re-snapshot after mutations. Keep mutating commands serial within one session; use `help dogfood` and `help workflow` for Expo Go, keyboard, and clear-field limits. Use logs, network, trace, perf, overlay screenshots, or react-devtools only when they add evidence.