You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* feat(testing): Add headless launch mode
Add an opt-in headless launch policy that prevents macOS and simulator
launches from stealing focus during automated runs.
Update UI automation guidance, snapshot normalization, benchmark suite
execution, and fixtures so snapshot and unit tests remain stable across
current simulator and Xcode environments.
* fix(benchmarks): Stabilize Claude UI baselines
Pin Claude UI benchmark suites to the expected model and record model,
Claude Code, and timing metadata in benchmark output. Tighten suite
prompts and tool guidance so benchmark runs follow the intended baseline
sequences.
Increase post-action snapshot settle headroom to reduce timing-sensitive
UI automation recovery after action tools, and refresh simulator list
fixtures after removing stale Claude UI benchmark simulators.
Co-Authored-By: OpenAI Codex <noreply@openai.com>
* test(snapshot): Make temp path normalization deterministic
Allow snapshot normalizer tests to inject the temp directory used for path normalization.
This keeps artifact path expectations stable across macOS and Linux CI hosts.
Co-Authored-By: OpenAI Codex <codex@openai.com>
---------
Co-authored-by: OpenAI Codex <noreply@openai.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
- Vitest with colocated `__tests__/` directories using `*.test.ts`
66
+
- Snapshot tests (`*.snapshot.test.ts`) must only assert generated tool output against fixtures. Move helper, parser, schema, setup, or behavior assertions to non-snapshot unit/integration tests.
66
67
- Smoke tests in `src/smoke-tests/__tests__/` (separate Vitest config, serial execution)
67
68
- Use `vi.mock`/`vi.hoisted` for isolation; inject executors and mock file systems
68
69
- MCP integration tests use `McpServer`, `InMemoryTransport`, and `Client`
Copy file name to clipboardExpand all lines: CHANGELOG.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@
4
4
5
5
### Added
6
6
7
+
- Added `XCODEBUILDMCP_HEADLESS_LAUNCH` opt-in environment variable that suppresses GUI focus-stealing on macOS: `launch_mac_app` and `build_run_macos` use `open -g` (background launch), Simulator.app GUI launches in `open_sim` and `build_run_sim` are skipped (`simctl boot` continues to run the simulator runtime), and `simulator-management keyboard-shortcut` short-circuits with a clear error because System Events keystrokes inherently require foreground focus. Enabled automatically for snapshot test runs so tests no longer steal window focus.
7
8
- Added `--from-result` to the Claude UI benchmark harness so existing `result.json` artifacts can be rendered as text or JSON without rerunning Claude.
8
9
- Added `nextSteps` hint lines to MCP `structuredContent` and CLI `--output json` envelopes so agents can consume follow-up actions without scraping text. CLI JSON renders shell command lines; MCP structured content renders MCP tool-call hints. Structured result schemas that include `nextSteps` now use schema version 2; existing version 1 schema files remain available for current validators.
9
10
- Added `snapshot_ui sinceScreenHash` / CLI `--since-screen-hash` so callers can skip full runtime snapshot output when the screen hash is unchanged.
Copy file name to clipboardExpand all lines: CLAUDE.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,9 @@
9
9
- Do not add fallback behavior by default. If required context, configuration, runtime state, or dependencies are missing, fail loudly and fix the caller/setup instead of silently switching to an alternate path. Add a fallback only when explicitly requested or when it is a documented product requirement.
10
10
- Follow TypeScript best practices
11
11
12
+
## Test Conventions
13
+
- Snapshot tests (`*.snapshot.test.ts`) must only assert generated tool output against fixtures. Move helper, parser, schema, setup, or behavior assertions to non-snapshot unit/integration tests.
Copy file name to clipboardExpand all lines: benchmarks/claude-ui/README.md
+15-4Lines changed: 15 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -51,6 +51,15 @@ Print machine-readable output from a new run:
51
51
npm run bench:claude-ui -- --suite reminders --json
52
52
```
53
53
54
+
Request an exact Claude model for controlled comparisons:
55
+
56
+
```bash
57
+
npm run bench:claude-ui -- --suite weather --model claude-sonnet-4-7
58
+
npm run bench:claude-ui:xcodebuildmcp -- --model claude-sonnet-4-7
59
+
```
60
+
61
+
The `--model` CLI option overrides `claude.model` from the suite YAML for that run.
62
+
54
63
Render an existing result without rerunning Claude:
55
64
56
65
```bash
@@ -113,6 +122,7 @@ Suites can override the Claude invocation without changing harness code. Omit th
113
122
114
123
```yaml
115
124
claude:
125
+
model: claude-sonnet-4-7
116
126
useMcpServer: false
117
127
tools:
118
128
- Bash
@@ -128,8 +138,6 @@ claude:
128
138
extraArgs:
129
139
- --setting-sources
130
140
- project,local
131
-
- --model
132
-
- sonnet
133
141
toolAnalysis:
134
142
matchers:
135
143
- kind: bashCommand
@@ -145,6 +153,8 @@ toolAnalysis:
145
153
shortName: xcodebuild
146
154
```
147
155
156
+
`claude.model` is the canonical suite-level model request. Do not put `--model` or `--model=<value>` in `claude.extraArgs`; the config parser rejects those forms so suite config and CLI overrides cannot disagree. Pass `--model <model>` to override the suite model for controlled comparison runs.
157
+
148
158
`claude.useMcpServer: false` writes an empty per-run MCP config and passes it with `--strict-mcp-config`, so project/user MCP servers cannot leak into CLI-only benchmark runs. The harness still prepares the simulator lifecycle and exports `CLAUDE_UI_BENCHMARK_SIMULATOR_ID`, `CLAUDE_UI_BENCHMARK_RUN_DIR`, and `CLAUDE_UI_BENCHMARK_WORKING_DIRECTORY` to Claude. `appendSystemPrompt` also supports `{simulatorId}`, `{runDirectory}`, and `{workingDirectory}` placeholders.
149
159
150
160
`claude.pluginDirs` is passed to Claude as one `--plugin-dir` argument per configured path, resolved from the repository root. Use this for suite-specific local/private CLI skills. `claude.isolatedWorkingDirectory: true` runs Claude from the per-run artifact directory instead of the suite working directory, which prevents repository/project skills from being discovered implicitly. When using an isolated working directory, include absolute `{workingDirectory}` paths in prompts for build commands or project files.
@@ -194,6 +204,7 @@ Each suite renders as a structured report with a task-completion banner, aligned
- `simulator-lifecycle.log`— temporary simulator create, boot, bootstatus, open, readiness, deletion commands, and simulator ID
287
298
- `parsed/`— files written by `parse_claude_conversation.py`
288
299
- `parse.log`/ `parse.log.stderr` — parser output
289
-
- `result.json`— full benchmark result
300
+
- `result.json`— full benchmark result, including requested model, observed model when Claude reports it, and `claude --version` output under `run.claude`
description: Swipe within a scrollable UI element using a visible element reference from the current UI. Optional distance is a normalized stroke fraction greater than 0 and up to 1.
6
+
description: >-
7
+
Swipe within a scrollable UI element using withinElementRef from a current rs/1 runtime snapshot.
8
+
withinElementRef is required; do not use elementRef.
9
+
Optional distance is a normalized stroke fraction greater than 0 and up to 1.
10
+
Example input: {"withinElementRef":"e7","direction":"up","distance":0.7}.
description: Tap one elementRef from the latest snapshot_ui or wait_for_ui output. For multiple same-screen taps or visible switch toggles with no intermediate assertion, prefer batch. Other same-screen refs may remain usable after success; refresh after navigation, scrolling, sheet changes, or obvious layout changes.
6
+
description: >-
7
+
Tap one elementRef from the latest snapshot_ui or wait_for_ui output. The elementRef must list the tap action in the snapshot targets; do not use refs from text-only rows. For multiple same-screen taps or visible switch toggles with no intermediate assertion, use batch instead of repeated tap calls. Other same-screen refs may remain usable after success; refresh after navigation, scrolling, sheet changes, or obvious layout changes.
0 commit comments