getsentry
diff --git a/‎AGENTS.md‎
Lines changed: 1 addition & 0 deletions b/‎AGENTS.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 3 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎benchmarks/claude-ui/README.md‎
Lines changed: 15 additions & 4 deletions b/‎benchmarks/claude-ui/README.md‎
Lines changed: 15 additions & 4 deletions
diff --git a/‎benchmarks/claude-ui/run-directory.ts‎
Lines changed: 57 additions & 0 deletions b/‎benchmarks/claude-ui/run-directory.ts‎
Lines changed: 57 additions & 0 deletions
diff --git a/‎benchmarks/claude-ui/suites/contacts.yml‎
Lines changed: 12 additions & 0 deletions b/‎benchmarks/claude-ui/suites/contacts.yml‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎benchmarks/claude-ui/suites/reminders.yml‎
Lines changed: 21 additions & 0 deletions b/‎benchmarks/claude-ui/suites/reminders.yml‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎benchmarks/claude-ui/suites/weather.yml‎
Lines changed: 13 additions & 0 deletions b/‎benchmarks/claude-ui/suites/weather.yml‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎manifests/tools/swipe.yaml‎
Lines changed: 5 additions & 1 deletion b/‎manifests/tools/swipe.yaml‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎manifests/tools/tap.yaml‎
Lines changed: 2 additions & 1 deletion b/‎manifests/tools/tap.yaml‎
Lines changed: 2 additions & 1 deletion
@@ -63,6 +63,7 @@ ESM TypeScript project (`type: module`). Key layers:
 
 ## Test Conventions
 - Vitest with colocated `__tests__/` directories using `*.test.ts`
+- Snapshot tests (`*.snapshot.test.ts`) must only assert generated tool output against fixtures. Move helper, parser, schema, setup, or behavior assertions to non-snapshot unit/integration tests.
 - Smoke tests in `src/smoke-tests/__tests__/` (separate Vitest config, serial execution)
 - Use `vi.mock`/`vi.hoisted` for isolation; inject executors and mock file systems
 - MCP integration tests use `McpServer`, `InMemoryTransport`, and `Client`
 
@@ -4,6 +4,7 @@
 
 ### Added
 
+- Added `XCODEBUILDMCP_HEADLESS_LAUNCH` opt-in environment variable that suppresses GUI focus-stealing on macOS: `launch_mac_app` and `build_run_macos` use `open -g` (background launch), Simulator.app GUI launches in `open_sim` and `build_run_sim` are skipped (`simctl boot` continues to run the simulator runtime), and `simulator-management keyboard-shortcut` short-circuits with a clear error because System Events keystrokes inherently require foreground focus. Enabled automatically for snapshot test runs so tests no longer steal window focus.
 - Added `--from-result` to the Claude UI benchmark harness so existing `result.json` artifacts can be rendered as text or JSON without rerunning Claude.
 - Added `nextSteps` hint lines to MCP `structuredContent` and CLI `--output json` envelopes so agents can consume follow-up actions without scraping text. CLI JSON renders shell command lines; MCP structured content renders MCP tool-call hints. Structured result schemas that include `nextSteps` now use schema version 2; existing version 1 schema files remain available for current validators.
 - Added `snapshot_ui sinceScreenHash` / CLI `--since-screen-hash` so callers can skip full runtime snapshot output when the screen hash is unchanged.
 
@@ -9,6 +9,9 @@
 - Do not add fallback behavior by default. If required context, configuration, runtime state, or dependencies are missing, fail loudly and fix the caller/setup instead of silently switching to an alternate path. Add a fallback only when explicitly requested or when it is a documented product requirement.
 - Follow TypeScript best practices
 
+## Test Conventions
+- Snapshot tests (`*.snapshot.test.ts`) must only assert generated tool output against fixtures. Move helper, parser, schema, setup, or behavior assertions to non-snapshot unit/integration tests.
+
 ## Commands
 - NEVER commit unless user asks
 
 
@@ -51,6 +51,15 @@ Print machine-readable output from a new run:
 npm run bench:claude-ui -- --suite reminders --json
 ```
 
+Request an exact Claude model for controlled comparisons:
+
+```bash
+npm run bench:claude-ui -- --suite weather --model claude-sonnet-4-7
+npm run bench:claude-ui:xcodebuildmcp -- --model claude-sonnet-4-7
+```
+
+The `--model` CLI option overrides `claude.model` from the suite YAML for that run.
+
 Render an existing result without rerunning Claude:
 
 ```bash
@@ -113,6 +122,7 @@ Suites can override the Claude invocation without changing harness code. Omit th
 
 ```yaml
 claude:
+  model: claude-sonnet-4-7
   useMcpServer: false
   tools:
     - Bash
@@ -128,8 +138,6 @@ claude:
   extraArgs:
     - --setting-sources
     - project,local
-    - --model
-    - sonnet
 toolAnalysis:
   matchers:
     - kind: bashCommand
@@ -145,6 +153,8 @@ toolAnalysis:
       shortName: xcodebuild
 ```
 
+`claude.model` is the canonical suite-level model request. Do not put `--model` or `--model=<value>` in `claude.extraArgs`; the config parser rejects those forms so suite config and CLI overrides cannot disagree. Pass `--model <model>` to override the suite model for controlled comparison runs.
+
 `claude.useMcpServer: false` writes an empty per-run MCP config and passes it with `--strict-mcp-config`, so project/user MCP servers cannot leak into CLI-only benchmark runs. The harness still prepares the simulator lifecycle and exports `CLAUDE_UI_BENCHMARK_SIMULATOR_ID`, `CLAUDE_UI_BENCHMARK_RUN_DIR`, and `CLAUDE_UI_BENCHMARK_WORKING_DIRECTORY` to Claude. `appendSystemPrompt` also supports `{simulatorId}`, `{runDirectory}`, and `{workingDirectory}` placeholders.
 
 `claude.pluginDirs` is passed to Claude as one `--plugin-dir` argument per configured path, resolved from the repository root. Use this for suite-specific local/private CLI skills. `claude.isolatedWorkingDirectory: true` runs Claude from the per-run artifact directory instead of the suite working directory, which prevents repository/project skills from being discovered implicitly. When using an isolated working directory, include absolute `{workingDirectory}` paths in prompts for build commands or project files.
@@ -194,6 +204,7 @@ Each suite renders as a structured report with a task-completion banner, aligned
 COMPLETED  weather                                             1m 38.6s
   suite     benchmarks/claude-ui/suites/weather.yml
   artifacts out.nosync/claude-benchmarks/weather/20260522T214044Z
+  claude   model requested=claude-sonnet-4-7 observed=claude-sonnet-4-7 version=1.2.3
   exit      claude=0 parser=0
 
 Metrics
@@ -282,8 +293,8 @@ Each run writes:
 - `mcp-workspace/.xcodebuildmcp/config.yaml` — isolated MCP server config with effective suite defaults
 - `claude.jsonl` — Claude stream JSON output
 - `claude.stderr` — Claude stderr
-- `claude-command.log` — command, cwd, simulator ID, exit status, wall clock
+- `claude-command.log` — command, cwd, simulator ID, requested/observed model, `claude --version`, exit status, wall clock
 - `simulator-lifecycle.log` — temporary simulator create, boot, bootstatus, open, readiness, deletion commands, and simulator ID
 - `parsed/` — files written by `parse_claude_conversation.py`
 - `parse.log` / `parse.log.stderr` — parser output
-- `result.json` — full benchmark result
+- `result.json` — full benchmark result, including requested model, observed model when Claude reports it, and `claude --version` output under `run.claude`
@@ -0,0 +1,57 @@
+#!/usr/bin/env tsx
+import { access, readdir } from 'node:fs/promises';
+import path from 'node:path';
+import { main } from '../../src/benchmarks/claude-ui/harness.ts';
+
+async function directoryExists(directory: string): Promise<boolean> {
+  try {
+    await access(directory);
+    return true;
+  } catch {
+    return false;
+  }
+}
+
+async function suitePaths(directory: string): Promise<string[]> {
+  if (!(await directoryExists(directory))) return [];
+  const entries = await readdir(directory, { withFileTypes: true });
+  return entries
+    .filter(
+      (entry) => entry.isFile() && (entry.name.endsWith('.yml') || entry.name.endsWith('.yaml')),
+    )
+    .map((entry) => path.join(directory, entry.name))
+    .sort();
+}
+
+async function run(): Promise<number> {
+  const directory = process.argv[2];
+  const maybeLabel = process.argv[3];
+  const label = maybeLabel && !maybeLabel.startsWith('-') ? maybeLabel : directory;
+  const forwardedArgs =
+    maybeLabel && !maybeLabel.startsWith('-') ? process.argv.slice(4) : process.argv.slice(3);
+  if (!directory) {
+    console.error('Usage: run-directory.ts <suite-directory> [label] [benchmark args...]');
+    return 1;
+  }
+
+  const suites = await suitePaths(directory);
+  if (suites.length === 0) {
+    console.error(`No ${label} Claude UI benchmark suites found in ${directory}`);
+    return 1;
+  }
+
+  for (const suite of suites) {
+    const exitCode = await main(['--suite', suite, ...forwardedArgs]);
+    if (exitCode !== 0) return exitCode;
+  }
+  return 0;
+}
+
+run()
+  .then((exitCode) => {
+    process.exitCode = exitCode;
+  })
+  .catch((error) => {
+    console.error(error instanceof Error ? error.message : String(error));
+    process.exitCode = 1;
+  });
@@ -4,6 +4,18 @@ workingDirectory: .
 sessionDefaults:
   bundleId: com.apple.MobileAddressBook
   simulatorName: iPhone 17 Pro Max
+claude:
+  model: claude-opus-4-7
+  appendSystemPrompt: >-
+    Match the benchmark baseline tool pattern for Contacts. Start with session_show_defaults, then
+    launch_app_sim. After launch_app_sim, use this tracked UI sequence: snapshot_ui, tap Add,
+    snapshot_ui, tap Continue if visible, type_text first name,
+    snapshot_ui, type_text last name, type_text company, tap add phone, snapshot_ui, type_text phone,
+    tap add email, snapshot_ui, type_text email, tap Done, snapshot_ui. When a UI action reports
+    SNAPSHOT_CAPTURE_FAILED, call snapshot_ui next. Do not use wait_for_ui for Contacts unless a
+    later tool result explicitly asks for wait_for_ui. Do not tap text fields before type_text. Keep
+    assistant text terse: no progress narration beyond what is needed to choose tools, and keep the
+    final answer to one sentence.
 firstRunPromptDismissals:
   labels:
     - Continue
 
@@ -4,6 +4,27 @@ workingDirectory: .
 sessionDefaults:
   bundleId: com.apple.reminders
   simulatorName: iPhone 17 Pro Max
+claude:
+  model: claude-opus-4-7
+  appendSystemPrompt: >-
+    The benchmark harness has already created, booted, opened, and configured the target simulator.
+    Do not call boot_sim or open_sim unless launch_app_sim reports the simulator is unavailable.
+    Match the benchmark baseline tool pattern for Reminders. If you need tool discovery, load all
+    required tools in one ToolSearch using the full MCP names: mcp__xcodebuildmcp-dev__session_show_defaults,
+    mcp__xcodebuildmcp-dev__launch_app_sim, mcp__xcodebuildmcp-dev__snapshot_ui,
+    mcp__xcodebuildmcp-dev__tap, mcp__xcodebuildmcp-dev__wait_for_ui,
+    mcp__xcodebuildmcp-dev__type_text, mcp__xcodebuildmcp-dev__key_press,
+    mcp__xcodebuildmcp-dev__batch. Do not call ToolSearch again after that. Start with
+    session_show_defaults, then launch_app_sim. After launch_app_sim, use
+    this tracked UI sequence: snapshot_ui, tap Add List,
+    wait_for_ui settled, tap Continue if visible, type_text list name, tap Done, tap the new list,
+    type_text first reminder, key_press keyCode 40, type_text second reminder, key_press keyCode 40,
+    type_text third reminder, tap Done, batch-tap the first and third completion circles. Use exactly
+    two key_press calls, both with keyCode 40; do not use any other key_press. After type_text list
+    name exactly once with the visible list-name elementRef, make the next two tracked UI calls tap Done,
+    then tap the new list; do not call snapshot_ui or wait_for_ui between those two taps. Every type_text
+    call must include an elementRef from the latest visible UI state. Keep assistant text terse: no
+    progress narration beyond what is needed to choose tools, and keep the final answer to one sentence.
 firstRunPromptDismissals:
   labels:
     - Continue
 
@@ -5,6 +5,19 @@ sessionDefaults:
   projectPath: Weather.xcodeproj
   scheme: Weather
   simulatorName: iPhone 17 Pro Max
+claude:
+  model: claude-opus-4-7
+  appendSystemPrompt: >-
+    Match the benchmark baseline tool pattern for Weather. If you need tool discovery, load all
+    required tools in one ToolSearch using the full MCP names: mcp__xcodebuildmcp-dev__session_show_defaults,
+    mcp__xcodebuildmcp-dev__build_run_sim, mcp__xcodebuildmcp-dev__snapshot_ui,
+    mcp__xcodebuildmcp-dev__tap, mcp__xcodebuildmcp-dev__batch,
+    mcp__xcodebuildmcp-dev__type_text, mcp__xcodebuildmcp-dev__swipe. Do not call ToolSearch
+    again after that. Start with session_show_defaults, then build_run_sim. After build_run_sim, use
+    this tracked UI sequence: snapshot_ui, tap, batch, tap,
+    tap, type_text, tap, tap, swipe, snapshot_ui, tap. Do not call snapshot_ui immediately after the
+    batch; the next two tracked UI calls after batch should be tap, then tap. Keep assistant text terse:
+    no progress narration beyond what is needed to choose tools, and keep the final answer to one sentence.
 baseline:
   totalToolCalls: 14
   trackedToolCalls: 13
 
@@ -3,7 +3,11 @@ module: mcp/tools/ui-automation/swipe
 names:
   mcp: swipe
   cli: swipe
-description: Swipe within a scrollable UI element using a visible element reference from the current UI. Optional distance is a normalized stroke fraction greater than 0 and up to 1.
+description: >-
+  Swipe within a scrollable UI element using withinElementRef from a current rs/1 runtime snapshot.
+  withinElementRef is required; do not use elementRef.
+  Optional distance is a normalized stroke fraction greater than 0 and up to 1.
+  Example input: {"withinElementRef":"e7","direction":"up","distance":0.7}.
 outputSchema:
   schema: xcodebuildmcp.output.ui-action-result
   version: "2"
 
@@ -3,7 +3,8 @@ module: mcp/tools/ui-automation/tap
 names:
   mcp: tap
   cli: tap
-description: Tap one elementRef from the latest snapshot_ui or wait_for_ui output. For multiple same-screen taps or visible switch toggles with no intermediate assertion, prefer batch. Other same-screen refs may remain usable after success; refresh after navigation, scrolling, sheet changes, or obvious layout changes.
+description: >-
+  Tap one elementRef from the latest snapshot_ui or wait_for_ui output. The elementRef must list the tap action in the snapshot targets; do not use refs from text-only rows. For multiple same-screen taps or visible switch toggles with no intermediate assertion, use batch instead of repeated tap calls. Other same-screen refs may remain usable after success; refresh after navigation, scrolling, sheet changes, or obvious layout changes.
 outputSchema:
   schema: xcodebuildmcp.output.ui-action-result
   version: '2'