Skip to content

feat: e2e command perf benchmark harness + nightly CI#630

Merged
thymikee merged 5 commits into
mainfrom
perf-harness
May 31, 2026
Merged

feat: e2e command perf benchmark harness + nightly CI#630
thymikee merged 5 commits into
mainfrom
perf-harness

Conversation

@thymikee
Copy link
Copy Markdown
Member

What

Adds scripts/perf/ — a cheap end-to-end perf benchmark that drives the built CLI through an ordered "Settings tour" of 24 commands for N rounds, on a fully isolated daemon/state-dir and self-cleaning device, and emits a JSON + Markdown report. Plus a scheduled/manual CI job (perf-nightly.yml).

pnpm perf:ios          # or perf:android
# flags: --n <rounds=5> --warmup <1> --keep-artifacts --out-dir --device --udid --serial

How it works

  • Timing for free: each batchable command is wrapped in its own single-step batch, so we read the daemon-side durationMs (src/core/batch.ts) per command, plus wall-clock around the child process (≈ spawn + socket overhead). No new in-command instrumentation.
  • Isolation: throwaway --state-dir (own daemon/socket/logs) + --session perf + device selectors. Never touches the default ~/.agent-device daemon or other devices. Teardown = close --shutdown + rm -rf.
  • N rounds: each round open --relaunch resets to root; warmup rounds dropped; min/median/p95/max reported. Snapshots record element counts (tree-size proxy: root vs deep).
  • CI (perf-nightly.yml, schedule + workflow_dispatch): reuses the same build artifacts as the device suites — the cached iOS XCUITest runner (setup-apple-replay, ios-runner-prebuilt cache) and the Android replay host — and runs the CLI from source via --experimental-strip-types src/bin.ts (no dist build), matching the replay workflows. Uploads the reports as artifacts.

Findings (n=3, both platforms 24/24 clean)

⚠️ Absolute numbers are inflated by concurrent device load on the dev machine (co-resident sims/emulators). Treat within-run, cross-command comparisons as the signal; CI will give cleaner trends. Times are daemon durationMs medians (ms).

command iOS Android note
snapshot (root / deep) 251 / 328 2025 / 2811 iOS element counts 109 / 201; Android 60 / 47
snapshot -i (root / deep) 2690 / 187 2113 / 1957 iOS root -i oddly ~10× the deep -i
boot (cold iOS / warm Android) 9944 761
find 1114 2296
get text 24980 4650 iOS ~20× find/is on the same target
is visible 6257 2137
fill 27672 9947 iOS hangs if the field is focused first
type 12827 479 iOS ~27× Android
scroll 8010 2506
wait text 3854 2050
press 4665 2409
back 949 264
screenshot 1400 3478
logs mark / clear 0 / 0 0 / 1
perf 500 164
record start / stop 332 / 434 (wall) 692 / 2248 (wall)

Standouts worth investigating as bugs / perf issues (follow-up work in progress):

  • iOS get text is ~20× find/is against the same element.
  • iOS fill / type / scroll / is are markedly slower than Android.
  • iOS root snapshot -i (~2.7s) is ~10× the deep snapshot -i (~190ms).

Gotchas the harness had to work around (candidate bugs)

  • Session lock: passing --session sets lockPolicy='reject' (src/utils/session-binding.ts), so device selectors are rejected on every command except the session-creating open/apps/devices — including boot. Harness establishes via open + selectors and runs boot sessionless. Is rejecting selectors that match the locked device intended?
  • Android: accessibility dumps time out unless animations are off (harness runs settings animations off); press needs a hittable text="…" target (labels match non-hittable text nodes); the search editable only appears after tapping the search card.
  • iOS: fill on the search field hangs ("main thread execution timed out") if the field is focused first; filling at root works.
  • record: an instant startstop fails to finalize a valid video (simctl recordVideo exited with code 1 / Android "not a playable MP4"); must bracket other commands between them.
  • Scroll position couples consecutive commands → a freshRoot relaunch resets to top before steps that need a clean root.

Notes

  • Not run on PRs (perf timing on shared runners is noisy + slow). Scheduled nightly + workflow_dispatch with a rounds input.
  • Reports land in scripts/perf/.results/ (gitignored).
  • Follow-up: dig into the iOS slow paths above and the gotchas that look like bugs, one platform at a time.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0dbcc16c57

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/perf/cli.ts
const stderr = r.stderr ?? '';
const json = tryParseJson(stdout);
const exitCode = r.status ?? -1;
return { exitCode, wallClockMs, stdout, stderr, json, ok: exitCode === 0 && jsonOk(json) };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor failed batch step results

When a single-step batch --json invocation reports a nested failure as data.results[0].ok === false while the top-level response remains successful, this line still marks the sample as successful because jsonOk only checks top-level ok. In the nightly perf harness, any failed selector/screenshot/logs batch step will be included in medians with no failure note, corrupting the benchmark report; the batch path should inspect the first result's ok and nested error as well.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed defensively. With today's stop-only batch this can't occur — a failed step yields a top-level ok:false/non-zero exit, already caught by invokeCli. But invokeBatchStep now also downgrades the sample's ok from the step's own result.ok, so a future on-error=continue mode can't silently count a failed step as a successful sample.

Comment thread scripts/perf/cli.ts Outdated
@@ -0,0 +1,81 @@
import { spawnSync } from 'node:child_process';
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the shared exec helper for perf invocations

This TypeScript script directly imports spawnSync, but the repo AGENTS.md Hard Rules require TypeScript process execution to go through src/utils/exec.ts helpers outside that module. Since every benchmarked CLI call goes through this path, it bypasses the shared executable normalization and error behavior expected by the repo seam; use runCmdSync(..., { allowFailure: true }) or extend the helper if the harness needs additional options.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the harness now spawns via runCmdSync from src/utils/exec.ts (with allowFailure), and maxBuffer was added to ExecOptions; spawnSync is no longer imported directly. (Reviewed commit predated that change.)

thymikee added 2 commits May 31, 2026 11:23
Adds scripts/perf, a cheap end-to-end perf benchmark that drives the built
CLI through an ordered Settings tour of ~24 commands for N rounds, on a fully
isolated daemon/state-dir and self-cleaning device, and emits JSON + Markdown
reports. Per-command timing comes from wrapping each batchable command in its
own single-step batch (daemon durationMs) plus wall-clock around the process.

Wires a scheduled + workflow_dispatch CI job (perf-nightly.yml) that reuses the
cached iOS XCUITest runner (setup-apple-replay) and the Android replay host, and
runs the CLI from source via --experimental-strip-types (no dist build).
Review (P2): repo rule is to spawn processes through src/utils/exec.ts, not
node:child_process directly. Switch the perf harness's invokeCli to runCmdSync
(allowFailure so non-zero exits are recorded as samples) and add a maxBuffer
option to ExecOptions/runCmdSync (snapshot payloads exceed Node's ~1MB default).
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-31 12:38 UTC

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

Size Report

Metric Base Current Diff
JS raw 1.1 MB 1.1 MB +111 B
JS gzip 355.8 kB 355.9 kB +52 B
npm tarball 452.7 kB 452.9 kB +149 B
npm unpacked 1.5 MB 1.5 MB +564 B

Startup median (7 runs, lower is better):

Scenario Base Current Diff
CLI --version 26.0 ms 26.2 ms +0.2 ms
CLI --help 41.2 ms 41.3 ms +0.2 ms

Top changed chunks: no changes in the largest emitted chunks.

…nd is clean

The first interaction after open/relaunch pays the one-time iOS XCUITest runner
startup (~10s+ cold) and a per-relaunch first-AX-query settle cost (~4s). That was
landing on the first measured command each round (snapshot -i), inflating it ~10x
vs the next snapshot. Run an untimed warmup snapshot -i after establishSession, after
each round's reset-open, and after every freshRoot relaunch, so no measured command
absorbs runner startup. Noted in the report header.
thymikee added 2 commits May 31, 2026 14:18
- exec.ts: extract spawnRejectionError + commandCloseFailure helpers, deduping the
  error/close handler clones (Fallow duplication ✗ that surfaced once the maxBuffer
  change pulled exec.ts into the audit scope).
- .fallowrc: exclude scripts/perf/** (non-shipped benchmark tooling, like examples/
  test-app) so its naturally-moderate functions don't trip the complexity gate.
- config.ts: drop unused exports CLI_BIN/DEFAULT_OUT_DIR; add readIntValue so
  --n/--rounds/--warmup report the actual flag + reject non-integers clearly.
- harness.ts: extract toSample(); type sampleError param as CliResult.
- scenario.ts: ScenarioStep is now a discriminated union on execMode (removes step.step!/
  step.args ?? []).
- comment/legend rewords (platform defaults are local-convenience/CI-overridden;
  elements = node count). check:fallow now green; typecheck/lint/unit pass.
Defensive belt-and-suspenders for the Codex review note: stop-only batch already
surfaces a failed step as a top-level failure (caught by invokeCli), but if an
on-error=continue mode ever keeps the batch ok while a step fails, don't silently
count that step as a successful sample — derive ok from the step's own result.ok.
@thymikee thymikee merged commit 45cfad5 into main May 31, 2026
19 checks passed
@thymikee thymikee deleted the perf-harness branch May 31, 2026 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant