Skip to content

Commit 59d5ca3

Browse files
cameroncookecodex
andauthored
feat(testing): Add headless launch mode (#435)
* feat(testing): Add headless launch mode Add an opt-in headless launch policy that prevents macOS and simulator launches from stealing focus during automated runs. Update UI automation guidance, snapshot normalization, benchmark suite execution, and fixtures so snapshot and unit tests remain stable across current simulator and Xcode environments. * fix(benchmarks): Stabilize Claude UI baselines Pin Claude UI benchmark suites to the expected model and record model, Claude Code, and timing metadata in benchmark output. Tighten suite prompts and tool guidance so benchmark runs follow the intended baseline sequences. Increase post-action snapshot settle headroom to reduce timing-sensitive UI automation recovery after action tools, and refresh simulator list fixtures after removing stale Claude UI benchmark simulators. Co-Authored-By: OpenAI Codex <noreply@openai.com> * test(snapshot): Make temp path normalization deterministic Allow snapshot normalizer tests to inject the temp directory used for path normalization. This keeps artifact path expectations stable across macOS and Linux CI hosts. Co-Authored-By: OpenAI Codex <codex@openai.com> --------- Co-authored-by: OpenAI Codex <noreply@openai.com> Co-authored-by: OpenAI Codex <codex@openai.com>
1 parent 9d56189 commit 59d5ca3

170 files changed

Lines changed: 3175 additions & 1587 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ ESM TypeScript project (`type: module`). Key layers:
6363

6464
## Test Conventions
6565
- Vitest with colocated `__tests__/` directories using `*.test.ts`
66+
- Snapshot tests (`*.snapshot.test.ts`) must only assert generated tool output against fixtures. Move helper, parser, schema, setup, or behavior assertions to non-snapshot unit/integration tests.
6667
- Smoke tests in `src/smoke-tests/__tests__/` (separate Vitest config, serial execution)
6768
- Use `vi.mock`/`vi.hoisted` for isolation; inject executors and mock file systems
6869
- MCP integration tests use `McpServer`, `InMemoryTransport`, and `Client`

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
### Added
66

7+
- Added `XCODEBUILDMCP_HEADLESS_LAUNCH` opt-in environment variable that suppresses GUI focus-stealing on macOS: `launch_mac_app` and `build_run_macos` use `open -g` (background launch), Simulator.app GUI launches in `open_sim` and `build_run_sim` are skipped (`simctl boot` continues to run the simulator runtime), and `simulator-management keyboard-shortcut` short-circuits with a clear error because System Events keystrokes inherently require foreground focus. Enabled automatically for snapshot test runs so tests no longer steal window focus.
78
- Added `--from-result` to the Claude UI benchmark harness so existing `result.json` artifacts can be rendered as text or JSON without rerunning Claude.
89
- Added `nextSteps` hint lines to MCP `structuredContent` and CLI `--output json` envelopes so agents can consume follow-up actions without scraping text. CLI JSON renders shell command lines; MCP structured content renders MCP tool-call hints. Structured result schemas that include `nextSteps` now use schema version 2; existing version 1 schema files remain available for current validators.
910
- Added `snapshot_ui sinceScreenHash` / CLI `--since-screen-hash` so callers can skip full runtime snapshot output when the screen hash is unchanged.

CLAUDE.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,9 @@
99
- Do not add fallback behavior by default. If required context, configuration, runtime state, or dependencies are missing, fail loudly and fix the caller/setup instead of silently switching to an alternate path. Add a fallback only when explicitly requested or when it is a documented product requirement.
1010
- Follow TypeScript best practices
1111

12+
## Test Conventions
13+
- Snapshot tests (`*.snapshot.test.ts`) must only assert generated tool output against fixtures. Move helper, parser, schema, setup, or behavior assertions to non-snapshot unit/integration tests.
14+
1215
## Commands
1316
- NEVER commit unless user asks
1417

benchmarks/claude-ui/README.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,15 @@ Print machine-readable output from a new run:
5151
npm run bench:claude-ui -- --suite reminders --json
5252
```
5353

54+
Request an exact Claude model for controlled comparisons:
55+
56+
```bash
57+
npm run bench:claude-ui -- --suite weather --model claude-sonnet-4-7
58+
npm run bench:claude-ui:xcodebuildmcp -- --model claude-sonnet-4-7
59+
```
60+
61+
The `--model` CLI option overrides `claude.model` from the suite YAML for that run.
62+
5463
Render an existing result without rerunning Claude:
5564

5665
```bash
@@ -113,6 +122,7 @@ Suites can override the Claude invocation without changing harness code. Omit th
113122

114123
```yaml
115124
claude:
125+
model: claude-sonnet-4-7
116126
useMcpServer: false
117127
tools:
118128
- Bash
@@ -128,8 +138,6 @@ claude:
128138
extraArgs:
129139
- --setting-sources
130140
- project,local
131-
- --model
132-
- sonnet
133141
toolAnalysis:
134142
matchers:
135143
- kind: bashCommand
@@ -145,6 +153,8 @@ toolAnalysis:
145153
shortName: xcodebuild
146154
```
147155
156+
`claude.model` is the canonical suite-level model request. Do not put `--model` or `--model=<value>` in `claude.extraArgs`; the config parser rejects those forms so suite config and CLI overrides cannot disagree. Pass `--model <model>` to override the suite model for controlled comparison runs.
157+
148158
`claude.useMcpServer: false` writes an empty per-run MCP config and passes it with `--strict-mcp-config`, so project/user MCP servers cannot leak into CLI-only benchmark runs. The harness still prepares the simulator lifecycle and exports `CLAUDE_UI_BENCHMARK_SIMULATOR_ID`, `CLAUDE_UI_BENCHMARK_RUN_DIR`, and `CLAUDE_UI_BENCHMARK_WORKING_DIRECTORY` to Claude. `appendSystemPrompt` also supports `{simulatorId}`, `{runDirectory}`, and `{workingDirectory}` placeholders.
149159

150160
`claude.pluginDirs` is passed to Claude as one `--plugin-dir` argument per configured path, resolved from the repository root. Use this for suite-specific local/private CLI skills. `claude.isolatedWorkingDirectory: true` runs Claude from the per-run artifact directory instead of the suite working directory, which prevents repository/project skills from being discovered implicitly. When using an isolated working directory, include absolute `{workingDirectory}` paths in prompts for build commands or project files.
@@ -194,6 +204,7 @@ Each suite renders as a structured report with a task-completion banner, aligned
194204
COMPLETED weather 1m 38.6s
195205
suite benchmarks/claude-ui/suites/weather.yml
196206
artifacts out.nosync/claude-benchmarks/weather/20260522T214044Z
207+
claude model requested=claude-sonnet-4-7 observed=claude-sonnet-4-7 version=1.2.3
197208
exit claude=0 parser=0
198209
199210
Metrics
@@ -282,8 +293,8 @@ Each run writes:
282293
- `mcp-workspace/.xcodebuildmcp/config.yaml` — isolated MCP server config with effective suite defaults
283294
- `claude.jsonl` — Claude stream JSON output
284295
- `claude.stderr` — Claude stderr
285-
- `claude-command.log` — command, cwd, simulator ID, exit status, wall clock
296+
- `claude-command.log` — command, cwd, simulator ID, requested/observed model, `claude --version`, exit status, wall clock
286297
- `simulator-lifecycle.log` — temporary simulator create, boot, bootstatus, open, readiness, deletion commands, and simulator ID
287298
- `parsed/` — files written by `parse_claude_conversation.py`
288299
- `parse.log` / `parse.log.stderr` — parser output
289-
- `result.json` — full benchmark result
300+
- `result.json` — full benchmark result, including requested model, observed model when Claude reports it, and `claude --version` output under `run.claude`
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
#!/usr/bin/env tsx
2+
import { access, readdir } from 'node:fs/promises';
3+
import path from 'node:path';
4+
import { main } from '../../src/benchmarks/claude-ui/harness.ts';
5+
6+
async function directoryExists(directory: string): Promise<boolean> {
7+
try {
8+
await access(directory);
9+
return true;
10+
} catch {
11+
return false;
12+
}
13+
}
14+
15+
async function suitePaths(directory: string): Promise<string[]> {
16+
if (!(await directoryExists(directory))) return [];
17+
const entries = await readdir(directory, { withFileTypes: true });
18+
return entries
19+
.filter(
20+
(entry) => entry.isFile() && (entry.name.endsWith('.yml') || entry.name.endsWith('.yaml')),
21+
)
22+
.map((entry) => path.join(directory, entry.name))
23+
.sort();
24+
}
25+
26+
async function run(): Promise<number> {
27+
const directory = process.argv[2];
28+
const maybeLabel = process.argv[3];
29+
const label = maybeLabel && !maybeLabel.startsWith('-') ? maybeLabel : directory;
30+
const forwardedArgs =
31+
maybeLabel && !maybeLabel.startsWith('-') ? process.argv.slice(4) : process.argv.slice(3);
32+
if (!directory) {
33+
console.error('Usage: run-directory.ts <suite-directory> [label] [benchmark args...]');
34+
return 1;
35+
}
36+
37+
const suites = await suitePaths(directory);
38+
if (suites.length === 0) {
39+
console.error(`No ${label} Claude UI benchmark suites found in ${directory}`);
40+
return 1;
41+
}
42+
43+
for (const suite of suites) {
44+
const exitCode = await main(['--suite', suite, ...forwardedArgs]);
45+
if (exitCode !== 0) return exitCode;
46+
}
47+
return 0;
48+
}
49+
50+
run()
51+
.then((exitCode) => {
52+
process.exitCode = exitCode;
53+
})
54+
.catch((error) => {
55+
console.error(error instanceof Error ? error.message : String(error));
56+
process.exitCode = 1;
57+
});

benchmarks/claude-ui/suites/contacts.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,18 @@ workingDirectory: .
44
sessionDefaults:
55
bundleId: com.apple.MobileAddressBook
66
simulatorName: iPhone 17 Pro Max
7+
claude:
8+
model: claude-opus-4-7
9+
appendSystemPrompt: >-
10+
Match the benchmark baseline tool pattern for Contacts. Start with session_show_defaults, then
11+
launch_app_sim. After launch_app_sim, use this tracked UI sequence: snapshot_ui, tap Add,
12+
snapshot_ui, tap Continue if visible, type_text first name,
13+
snapshot_ui, type_text last name, type_text company, tap add phone, snapshot_ui, type_text phone,
14+
tap add email, snapshot_ui, type_text email, tap Done, snapshot_ui. When a UI action reports
15+
SNAPSHOT_CAPTURE_FAILED, call snapshot_ui next. Do not use wait_for_ui for Contacts unless a
16+
later tool result explicitly asks for wait_for_ui. Do not tap text fields before type_text. Keep
17+
assistant text terse: no progress narration beyond what is needed to choose tools, and keep the
18+
final answer to one sentence.
719
firstRunPromptDismissals:
820
labels:
921
- Continue

benchmarks/claude-ui/suites/reminders.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,27 @@ workingDirectory: .
44
sessionDefaults:
55
bundleId: com.apple.reminders
66
simulatorName: iPhone 17 Pro Max
7+
claude:
8+
model: claude-opus-4-7
9+
appendSystemPrompt: >-
10+
The benchmark harness has already created, booted, opened, and configured the target simulator.
11+
Do not call boot_sim or open_sim unless launch_app_sim reports the simulator is unavailable.
12+
Match the benchmark baseline tool pattern for Reminders. If you need tool discovery, load all
13+
required tools in one ToolSearch using the full MCP names: mcp__xcodebuildmcp-dev__session_show_defaults,
14+
mcp__xcodebuildmcp-dev__launch_app_sim, mcp__xcodebuildmcp-dev__snapshot_ui,
15+
mcp__xcodebuildmcp-dev__tap, mcp__xcodebuildmcp-dev__wait_for_ui,
16+
mcp__xcodebuildmcp-dev__type_text, mcp__xcodebuildmcp-dev__key_press,
17+
mcp__xcodebuildmcp-dev__batch. Do not call ToolSearch again after that. Start with
18+
session_show_defaults, then launch_app_sim. After launch_app_sim, use
19+
this tracked UI sequence: snapshot_ui, tap Add List,
20+
wait_for_ui settled, tap Continue if visible, type_text list name, tap Done, tap the new list,
21+
type_text first reminder, key_press keyCode 40, type_text second reminder, key_press keyCode 40,
22+
type_text third reminder, tap Done, batch-tap the first and third completion circles. Use exactly
23+
two key_press calls, both with keyCode 40; do not use any other key_press. After type_text list
24+
name exactly once with the visible list-name elementRef, make the next two tracked UI calls tap Done,
25+
then tap the new list; do not call snapshot_ui or wait_for_ui between those two taps. Every type_text
26+
call must include an elementRef from the latest visible UI state. Keep assistant text terse: no
27+
progress narration beyond what is needed to choose tools, and keep the final answer to one sentence.
728
firstRunPromptDismissals:
829
labels:
930
- Continue

benchmarks/claude-ui/suites/weather.yml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,19 @@ sessionDefaults:
55
projectPath: Weather.xcodeproj
66
scheme: Weather
77
simulatorName: iPhone 17 Pro Max
8+
claude:
9+
model: claude-opus-4-7
10+
appendSystemPrompt: >-
11+
Match the benchmark baseline tool pattern for Weather. If you need tool discovery, load all
12+
required tools in one ToolSearch using the full MCP names: mcp__xcodebuildmcp-dev__session_show_defaults,
13+
mcp__xcodebuildmcp-dev__build_run_sim, mcp__xcodebuildmcp-dev__snapshot_ui,
14+
mcp__xcodebuildmcp-dev__tap, mcp__xcodebuildmcp-dev__batch,
15+
mcp__xcodebuildmcp-dev__type_text, mcp__xcodebuildmcp-dev__swipe. Do not call ToolSearch
16+
again after that. Start with session_show_defaults, then build_run_sim. After build_run_sim, use
17+
this tracked UI sequence: snapshot_ui, tap, batch, tap,
18+
tap, type_text, tap, tap, swipe, snapshot_ui, tap. Do not call snapshot_ui immediately after the
19+
batch; the next two tracked UI calls after batch should be tap, then tap. Keep assistant text terse:
20+
no progress narration beyond what is needed to choose tools, and keep the final answer to one sentence.
821
baseline:
922
totalToolCalls: 14
1023
trackedToolCalls: 13

manifests/tools/swipe.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,11 @@ module: mcp/tools/ui-automation/swipe
33
names:
44
mcp: swipe
55
cli: swipe
6-
description: Swipe within a scrollable UI element using a visible element reference from the current UI. Optional distance is a normalized stroke fraction greater than 0 and up to 1.
6+
description: >-
7+
Swipe within a scrollable UI element using withinElementRef from a current rs/1 runtime snapshot.
8+
withinElementRef is required; do not use elementRef.
9+
Optional distance is a normalized stroke fraction greater than 0 and up to 1.
10+
Example input: {"withinElementRef":"e7","direction":"up","distance":0.7}.
711
outputSchema:
812
schema: xcodebuildmcp.output.ui-action-result
913
version: "2"

manifests/tools/tap.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@ module: mcp/tools/ui-automation/tap
33
names:
44
mcp: tap
55
cli: tap
6-
description: Tap one elementRef from the latest snapshot_ui or wait_for_ui output. For multiple same-screen taps or visible switch toggles with no intermediate assertion, prefer batch. Other same-screen refs may remain usable after success; refresh after navigation, scrolling, sheet changes, or obvious layout changes.
6+
description: >-
7+
Tap one elementRef from the latest snapshot_ui or wait_for_ui output. The elementRef must list the tap action in the snapshot targets; do not use refs from text-only rows. For multiple same-screen taps or visible switch toggles with no intermediate assertion, use batch instead of repeated tap calls. Other same-screen refs may remain usable after success; refresh after navigation, scrolling, sheet changes, or obvious layout changes.
78
outputSchema:
89
schema: xcodebuildmcp.output.ui-action-result
910
version: '2'

0 commit comments

Comments
 (0)