feat: e2e command perf benchmark harness + nightly CI#630
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0dbcc16c57
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const stderr = r.stderr ?? ''; | ||
| const json = tryParseJson(stdout); | ||
| const exitCode = r.status ?? -1; | ||
| return { exitCode, wallClockMs, stdout, stderr, json, ok: exitCode === 0 && jsonOk(json) }; |
There was a problem hiding this comment.
Honor failed batch step results
When a single-step batch --json invocation reports a nested failure as data.results[0].ok === false while the top-level response remains successful, this line still marks the sample as successful because jsonOk only checks top-level ok. In the nightly perf harness, any failed selector/screenshot/logs batch step will be included in medians with no failure note, corrupting the benchmark report; the batch path should inspect the first result's ok and nested error as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed defensively. With today's stop-only batch this can't occur — a failed step yields a top-level ok:false/non-zero exit, already caught by invokeCli. But invokeBatchStep now also downgrades the sample's ok from the step's own result.ok, so a future on-error=continue mode can't silently count a failed step as a successful sample.
| @@ -0,0 +1,81 @@ | |||
| import { spawnSync } from 'node:child_process'; | |||
There was a problem hiding this comment.
Use the shared exec helper for perf invocations
This TypeScript script directly imports spawnSync, but the repo AGENTS.md Hard Rules require TypeScript process execution to go through src/utils/exec.ts helpers outside that module. Since every benchmarked CLI call goes through this path, it bypasses the shared executable normalization and error behavior expected by the repo seam; use runCmdSync(..., { allowFailure: true }) or extend the helper if the harness needs additional options.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Done — the harness now spawns via runCmdSync from src/utils/exec.ts (with allowFailure), and maxBuffer was added to ExecOptions; spawnSync is no longer imported directly. (Reviewed commit predated that change.)
Adds scripts/perf, a cheap end-to-end perf benchmark that drives the built CLI through an ordered Settings tour of ~24 commands for N rounds, on a fully isolated daemon/state-dir and self-cleaning device, and emits JSON + Markdown reports. Per-command timing comes from wrapping each batchable command in its own single-step batch (daemon durationMs) plus wall-clock around the process. Wires a scheduled + workflow_dispatch CI job (perf-nightly.yml) that reuses the cached iOS XCUITest runner (setup-apple-replay) and the Android replay host, and runs the CLI from source via --experimental-strip-types (no dist build).
Review (P2): repo rule is to spawn processes through src/utils/exec.ts, not node:child_process directly. Switch the perf harness's invokeCli to runCmdSync (allowFailure so non-zero exits are recorded as samples) and add a maxBuffer option to ExecOptions/runCmdSync (snapshot payloads exceed Node's ~1MB default).
|
Size Report
Startup median (7 runs, lower is better):
Top changed chunks: no changes in the largest emitted chunks. |
…nd is clean The first interaction after open/relaunch pays the one-time iOS XCUITest runner startup (~10s+ cold) and a per-relaunch first-AX-query settle cost (~4s). That was landing on the first measured command each round (snapshot -i), inflating it ~10x vs the next snapshot. Run an untimed warmup snapshot -i after establishSession, after each round's reset-open, and after every freshRoot relaunch, so no measured command absorbs runner startup. Noted in the report header.
- exec.ts: extract spawnRejectionError + commandCloseFailure helpers, deduping the error/close handler clones (Fallow duplication ✗ that surfaced once the maxBuffer change pulled exec.ts into the audit scope). - .fallowrc: exclude scripts/perf/** (non-shipped benchmark tooling, like examples/ test-app) so its naturally-moderate functions don't trip the complexity gate. - config.ts: drop unused exports CLI_BIN/DEFAULT_OUT_DIR; add readIntValue so --n/--rounds/--warmup report the actual flag + reject non-integers clearly. - harness.ts: extract toSample(); type sampleError param as CliResult. - scenario.ts: ScenarioStep is now a discriminated union on execMode (removes step.step!/ step.args ?? []). - comment/legend rewords (platform defaults are local-convenience/CI-overridden; elements = node count). check:fallow now green; typecheck/lint/unit pass.
Defensive belt-and-suspenders for the Codex review note: stop-only batch already surfaces a failed step as a top-level failure (caught by invokeCli), but if an on-error=continue mode ever keeps the batch ok while a step fails, don't silently count that step as a successful sample — derive ok from the step's own result.ok.
What
Adds
scripts/perf/— a cheap end-to-end perf benchmark that drives the built CLI through an ordered "Settings tour" of 24 commands for N rounds, on a fully isolated daemon/state-dir and self-cleaning device, and emits a JSON + Markdown report. Plus a scheduled/manual CI job (perf-nightly.yml).How it works
batch, so we read the daemon-sidedurationMs(src/core/batch.ts) per command, plus wall-clock around the child process (≈ spawn + socket overhead). No new in-command instrumentation.--state-dir(own daemon/socket/logs) +--session perf+ device selectors. Never touches the default~/.agent-devicedaemon or other devices. Teardown =close --shutdown+rm -rf.open --relaunchresets to root; warmup rounds dropped; min/median/p95/max reported. Snapshots record element counts (tree-size proxy: root vs deep).perf-nightly.yml, schedule +workflow_dispatch): reuses the same build artifacts as the device suites — the cached iOS XCUITest runner (setup-apple-replay,ios-runner-prebuiltcache) and the Android replay host — and runs the CLI from source via--experimental-strip-types src/bin.ts(no dist build), matching the replay workflows. Uploads the reports as artifacts.Findings (n=3, both platforms 24/24 clean)
-ioddly ~10× the deep-ifind/ison the same targetStandouts worth investigating as bugs / perf issues (follow-up work in progress):
get textis ~20×find/isagainst the same element.fill/type/scroll/isare markedly slower than Android.snapshot -i(~2.7s) is ~10× the deepsnapshot -i(~190ms).Gotchas the harness had to work around (candidate bugs)
--sessionsetslockPolicy='reject'(src/utils/session-binding.ts), so device selectors are rejected on every command except the session-creatingopen/apps/devices— includingboot. Harness establishes viaopen+ selectors and runsbootsessionless. Is rejecting selectors that match the locked device intended?settings animations off);pressneeds a hittabletext="…"target (labels match non-hittable text nodes); the search editable only appears after tapping the search card.fillon the search field hangs ("main thread execution timed out") if the field is focused first; filling at root works.start→stopfails to finalize a valid video (simctl recordVideo exited with code 1/ Android "not a playable MP4"); must bracket other commands between them.freshRootrelaunch resets to top before steps that need a clean root.Notes
workflow_dispatchwith aroundsinput.scripts/perf/.results/(gitignored).