feat: e2e command perf benchmark harness + nightly CI by thymikee · Pull Request #630 · callstack/agent-device

thymikee · 2026-05-31T07:34:01Z

What

Adds scripts/perf/ — a cheap end-to-end perf benchmark that drives the built CLI through an ordered "Settings tour" of 24 commands for N rounds, on a fully isolated daemon/state-dir and self-cleaning device, and emits a JSON + Markdown report. Plus a scheduled/manual CI job (perf-nightly.yml).

pnpm perf:ios          # or perf:android
# flags: --n <rounds=5> --warmup <1> --keep-artifacts --out-dir --device --udid --serial

How it works

Timing for free: each batchable command is wrapped in its own single-step batch, so we read the daemon-side durationMs (src/core/batch.ts) per command, plus wall-clock around the child process (≈ spawn + socket overhead). No new in-command instrumentation.
Isolation: throwaway --state-dir (own daemon/socket/logs) + --session perf + device selectors. Never touches the default ~/.agent-device daemon or other devices. Teardown = close --shutdown + rm -rf.
N rounds: each round open --relaunch resets to root; warmup rounds dropped; min/median/p95/max reported. Snapshots record element counts (tree-size proxy: root vs deep).
CI (perf-nightly.yml, schedule + workflow_dispatch): reuses the same build artifacts as the device suites — the cached iOS XCUITest runner (setup-apple-replay, ios-runner-prebuilt cache) and the Android replay host — and runs the CLI from source via --experimental-strip-types src/bin.ts (no dist build), matching the replay workflows. Uploads the reports as artifacts.

Findings (n=3, both platforms 24/24 clean)

⚠️ Absolute numbers are inflated by concurrent device load on the dev machine (co-resident sims/emulators). Treat within-run, cross-command comparisons as the signal; CI will give cleaner trends. Times are daemon durationMs medians (ms).

command	iOS	Android	note
snapshot (root / deep)	251 / 328	2025 / 2811	iOS element counts 109 / 201; Android 60 / 47
snapshot -i (root / deep)	2690 / 187	2113 / 1957	iOS root `-i` oddly ~10× the deep `-i`
boot (cold iOS / warm Android)	9944	761
find	1114	2296
get text	24980	4650	iOS ~20× `find`/`is` on the same target
is visible	6257	2137
fill	27672	9947	iOS hangs if the field is focused first
type	12827	479	iOS ~27× Android
scroll	8010	2506
wait text	3854	2050
press	4665	2409
back	949	264
screenshot	1400	3478
logs mark / clear	0 / 0	0 / 1
perf	500	164
record start / stop	332 / 434 (wall)	692 / 2248 (wall)

Standouts worth investigating as bugs / perf issues (follow-up work in progress):

iOS get text is ~20× find/is against the same element.
iOS fill / type / scroll / is are markedly slower than Android.
iOS root snapshot -i (~2.7s) is ~10× the deep snapshot -i (~190ms).

Gotchas the harness had to work around (candidate bugs)

Session lock: passing --session sets lockPolicy='reject' (src/utils/session-binding.ts), so device selectors are rejected on every command except the session-creating open/apps/devices — including boot. Harness establishes via open + selectors and runs boot sessionless. Is rejecting selectors that match the locked device intended?
Android: accessibility dumps time out unless animations are off (harness runs settings animations off); press needs a hittable text="…" target (labels match non-hittable text nodes); the search editable only appears after tapping the search card.
iOS: fill on the search field hangs ("main thread execution timed out") if the field is focused first; filling at root works.
record: an instant start→stop fails to finalize a valid video (simctl recordVideo exited with code 1 / Android "not a playable MP4"); must bracket other commands between them.
Scroll position couples consecutive commands → a freshRoot relaunch resets to top before steps that need a clean root.

Notes

Not run on PRs (perf timing on shared runners is noisy + slow). Scheduled nightly + workflow_dispatch with a rounds input.
Reports land in scripts/perf/.results/ (gitignored).
Follow-up: dig into the iOS slow paths above and the gotchas that look like bugs, one platform at a time.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0dbcc16c57

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-31T07:37:08Z

+  const stderr = r.stderr ?? '';
+  const json = tryParseJson(stdout);
+  const exitCode = r.status ?? -1;
+  return { exitCode, wallClockMs, stdout, stderr, json, ok: exitCode === 0 && jsonOk(json) };


Honor failed batch step results

When a single-step batch --json invocation reports a nested failure as data.results[0].ok === false while the top-level response remains successful, this line still marks the sample as successful because jsonOk only checks top-level ok. In the nightly perf harness, any failed selector/screenshot/logs batch step will be included in medians with no failure note, corrupting the benchmark report; the batch path should inspect the first result's ok and nested error as well.

Useful? React with 👍 / 👎.

Addressed defensively. With today's stop-only batch this can't occur — a failed step yields a top-level ok:false/non-zero exit, already caught by invokeCli. But invokeBatchStep now also downgrades the sample's ok from the step's own result.ok, so a future on-error=continue mode can't silently count a failed step as a successful sample.

chatgpt-codex-connector · 2026-05-31T07:37:08Z

@@ -0,0 +1,81 @@
+import { spawnSync } from 'node:child_process';


Use the shared exec helper for perf invocations

This TypeScript script directly imports spawnSync, but the repo AGENTS.md Hard Rules require TypeScript process execution to go through src/utils/exec.ts helpers outside that module. Since every benchmarked CLI call goes through this path, it bypasses the shared executable normalization and error behavior expected by the repo seam; use runCmdSync(..., { allowFailure: true }) or extend the helper if the harness needs additional options.

Useful? React with 👍 / 👎.

Done — the harness now spawns via runCmdSync from src/utils/exec.ts (with allowFailure), and maxBuffer was added to ExecOptions; spawnSync is no longer imported directly. (Reviewed commit predated that change.)

Adds scripts/perf, a cheap end-to-end perf benchmark that drives the built CLI through an ordered Settings tour of ~24 commands for N rounds, on a fully isolated daemon/state-dir and self-cleaning device, and emits JSON + Markdown reports. Per-command timing comes from wrapping each batchable command in its own single-step batch (daemon durationMs) plus wall-clock around the process. Wires a scheduled + workflow_dispatch CI job (perf-nightly.yml) that reuses the cached iOS XCUITest runner (setup-apple-replay) and the Android replay host, and runs the CLI from source via --experimental-strip-types (no dist build).

Review (P2): repo rule is to spawn processes through src/utils/exec.ts, not node:child_process directly. Switch the perf harness's invokeCli to runCmdSync (allowFailure so non-zero exits are recorded as samples) and add a maxBuffer option to ExecOptions/runCmdSync (snapshot payloads exceed Node's ~1MB default).

github-actions · 2026-05-31T09:26:58Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-31 12:38 UTC

github-actions · 2026-05-31T09:28:07Z

Size Report

Metric	Base	Current	Diff
JS raw	1.1 MB	1.1 MB	+111 B
JS gzip	355.8 kB	355.9 kB	+52 B
npm tarball	452.7 kB	452.9 kB	+149 B
npm unpacked	1.5 MB	1.5 MB	+564 B

Startup median (7 runs, lower is better):

Scenario	Base	Current	Diff
CLI --version	26.0 ms	26.2 ms	+0.2 ms
CLI --help	41.2 ms	41.3 ms	+0.2 ms

Top changed chunks: no changes in the largest emitted chunks.

…nd is clean The first interaction after open/relaunch pays the one-time iOS XCUITest runner startup (~10s+ cold) and a per-relaunch first-AX-query settle cost (~4s). That was landing on the first measured command each round (snapshot -i), inflating it ~10x vs the next snapshot. Run an untimed warmup snapshot -i after establishSession, after each round's reset-open, and after every freshRoot relaunch, so no measured command absorbs runner startup. Noted in the report header.

- exec.ts: extract spawnRejectionError + commandCloseFailure helpers, deduping the error/close handler clones (Fallow duplication ✗ that surfaced once the maxBuffer change pulled exec.ts into the audit scope). - .fallowrc: exclude scripts/perf/** (non-shipped benchmark tooling, like examples/ test-app) so its naturally-moderate functions don't trip the complexity gate. - config.ts: drop unused exports CLI_BIN/DEFAULT_OUT_DIR; add readIntValue so --n/--rounds/--warmup report the actual flag + reject non-integers clearly. - harness.ts: extract toSample(); type sampleError param as CliResult. - scenario.ts: ScenarioStep is now a discriminated union on execMode (removes step.step!/ step.args ?? []). - comment/legend rewords (platform defaults are local-convenience/CI-overridden; elements = node count). check:fallow now green; typecheck/lint/unit pass.

Defensive belt-and-suspenders for the Codex review note: stop-only batch already surfaces a failed step as a top-level failure (caught by invokeCli), but if an on-error=continue mode ever keeps the batch ok while a step fails, don't silently count that step as a successful sample — derive ok from the step's own result.ok.

chatgpt-codex-connector Bot reviewed May 31, 2026

View reviewed changes

thymikee mentioned this pull request May 31, 2026

perf(ios): make get text ~80x faster for non-editable elements #632

Merged

thymikee added 2 commits May 31, 2026 11:23

thymikee force-pushed the perf-harness branch from 0dbcc16 to 7f154d2 Compare May 31, 2026 09:26

This was referenced May 31, 2026

perf+fix(ios): faster text entry (readiness + typed-query field resolve) + fix fill mis-navigation #633

Merged

Porting to Dart and benchmarking a bit #629

Closed

thymikee added 2 commits May 31, 2026 14:18

thymikee merged commit 45cfad5 into main May 31, 2026
19 checks passed

thymikee deleted the perf-harness branch May 31, 2026 12:38

thymikee mentioned this pull request May 31, 2026

Follow-ups from iOS perf PRs (#630, #632, #633, #634) #635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: e2e command perf benchmark harness + nightly CI#630

feat: e2e command perf benchmark harness + nightly CI#630
thymikee merged 5 commits into
mainfrom
perf-harness

thymikee commented May 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

thymikee May 31, 2026

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Uh oh!

thymikee May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,81 @@
		import { spawnSync } from 'node:child_process';

Conversation

thymikee commented May 31, 2026

What

How it works

Findings (n=3, both platforms 24/24 clean)

Gotchas the harness had to work around (candidate bugs)

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

thymikee May 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 31, 2026

Choose a reason for hiding this comment

Uh oh!

thymikee May 31, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Size Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 31, 2026 •

edited

Loading

github-actions Bot commented May 31, 2026 •

edited

Loading