Skip to content

feat(ios): source frameSequence frames from WDA MJPEG stream#2720

Open
quanru wants to merge 2 commits into
feat/frame-sequence-transient-uifrom
feat/ios-mjpeg-frame-source
Open

feat(ios): source frameSequence frames from WDA MJPEG stream#2720
quanru wants to merge 2 commits into
feat/frame-sequence-transient-uifrom
feat/ios-mjpeg-frame-source

Conversation

@quanru

@quanru quanru commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Background

frameSequence works well on Android because its screenshotBase64() pulls the latest frame from a continuously-running scrcpy video stream (near-instant). On iOS, screenshotBase64() goes through WDA's takeScreenshot() — a full per-call capture that is too slow to sample a short time window densely, so the effective frame spacing becomes captureLatency + intervalMs and short-lived UI (toasts) slips through the gaps.

Key point: iOS already runs WDA's native MJPEG server (mjpegStreamUrl, with mjpegServerFramerate/mjpegServerScreenshotQuality configured) — the iOS analog of scrcpy. It just wasn't wired into frame capture. This PR consumes it as a fast frame source.

Targets feat/frame-sequence-transient-ui (the frameSequence feature branch).

What this PR does

  • core (device): add an optional AbstractInterface.captureFrameSequence({count, intervalMs, abortSignal}) — a fast multi-frame capture implemented by devices that maintain a continuous frame stream. Returns data-URL frames in temporal order; may return fewer than requested.
  • core (agent): captureUIContextSequence prefers captureFrameSequence for the earlier frames and captures the representative (last) frame with a normal full-quality screenshot. If the fast source throws or yields nothing, it falls back to the existing sequential screenshotBase64() loop. Abort is honored throughout.
  • ios: MjpegFrameSource consumes the WDA MJPEG stream (lazy-started, auto-reconnecting with backoff, stopped on destroy()) and keeps the latest decoded JPEG. IOSDevice.captureFrameSequence samples it at the requested cadence — pulling is near-instant, so frames land at intervalMs instead of intervalMs + slowScreenshot.

screenshotBase64() (normal locate/asserts) is unchanged — it still uses full-quality takeScreenshot(). The MJPEG stream is only used for the multi-frame sampling.

Why this design

  • Mirrors what already makes Android fast (continuous stream → instant latest-frame), instead of optimizing the slow per-call path.
  • The fast path is an optional capability: Android/web don't implement it and keep working (Android is already fast; web could later implement it over CDP screencast). No behavior change for devices without it.
  • Graceful degradation: any stream failure falls back to sequential capture, so correctness is preserved even if the MJPEG server is unavailable.

Tests

  • packages/ios/tests/unit-test/mjpeg-frame-source.test.tsextractJpegFrames (single/multiple frames, boundary/header bytes, incomplete frame, split-across-chunks, trailing-FF), plus a streamed MjpegFrameSource decoding the latest frame from a mocked fetch stream. 7/7 pass.
  • packages/core/tests/unit-test/agent-frame-sequence.test.ts — fast-path is used when the device exposes captureFrameSequence (earlier frames from stream + one representative screenshot), and falls back to sequential capture when it fails.

Validation

  • npx tsc --noEmit for @midscene/core and @midscene/ios — clean
  • npx nx build core and npx nx build ios — pass
  • Core frameSequence suites (20 tests) + iOS device.test.ts (54) + new MJPEG tests — pass
  • biome lint clean

Note: end-to-end behavior was verified on Android/web in the base PR; the iOS path here is covered by unit tests (no physical iOS device in CI).

On iOS, WDA's takeScreenshot is too slow to sample a short time window
densely, so frameSequence missed transient UI (toasts) that Android
catches via its scrcpy stream. iOS already runs WDA's native MJPEG
server; this consumes it as a fast frame source.

- core(device): add optional AbstractInterface.captureFrameSequence — a
  fast multi-frame capture for devices with a continuous frame stream.
- core(agent): captureUIContextSequence prefers captureFrameSequence for
  the earlier frames (the representative last frame still uses a normal
  full-quality screenshot), and falls back to sequential screenshotBase64
  capture when the stream is unavailable or yields nothing.
- ios: MjpegFrameSource consumes the WDA MJPEG stream (lazy-started,
  auto-reconnecting, stopped on destroy) and keeps the latest decoded
  JPEG; IOSDevice.captureFrameSequence samples it at the requested cadence.

Tests: extractJpegFrames parsing (incl. split-chunk / boundary cases) and
a streamed MjpegFrameSource; agent fast-path selection and fallback.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3bf66f1289

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

abortSignal?: AbortSignal;
}): Promise<Array<{ base64: string; capturedAt: number }>> {
const { count, intervalMs, abortSignal } = opt;
const source = await this.ensureMjpegFrameSource();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor cancellation before starting the MJPEG stream

When frameSequence is invoked with an already-aborted signal, this starts and waits for the WDA MJPEG source before the first throwIfAborted() in the loop. On iOS that can turn a cancelled aiAssert into a 3s ensureStarted() wait when the stream is unavailable, or start a long-lived stream before throwing when it is available. Check the signal before ensureMjpegFrameSource() (and ideally while waiting for the first frame) so the fast path preserves the existing prompt cancellation behavior.

Useful? React with 👍 / 👎.

Comment on lines +104 to +105
if (this.latest) {
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject stale MJPEG frames after stream loss

After one frame has ever been decoded, ensureStarted() returns immediately even if the MJPEG connection has since ended and the reconnect loop is only retrying. If the port forward or WDA MJPEG server drops after a previous capture, later frameSequence calls can include an old screen as an “earliest” frame instead of throwing and falling back to fresh screenshots. Clear or age-check latest on disconnect/retry so unavailable streams degrade instead of feeding stale UI to the model.

Useful? React with 👍 / 👎.

Mirror Android scrcpy (disabled by default): the MJPEG frame source is
only used for frameSequence when explicitly enabled, so default behavior
is unchanged (frameSequence falls back to sequential screenshotBase64).

- core(device): add IOSDeviceOpt.wdaMjpegFrameSource.enabled (default off).
- ios: expose the captureFrameSequence capability only when enabled, so
  the agent transparently falls back when it is off.
- ios(agent-tools): plumb wdaMjpegFrameSource through the WDA init args.

Multi-device concurrency: the stream URL is built from the existing
per-device wdaMjpegPort option, so each device streams from its own port
(set a distinct wdaMjpegPort per device, like wdaPort). Added tests for
the opt-in gating and per-device stream URL.
@quanru

quanru commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Follow-up (7d8101d): the WDA MJPEG frame source is now opt-in, disabled by default, mirroring Android scrcpy.

  • Enable via new IOSDevice({ wdaMjpegFrameSource: { enabled: true }, wdaMjpegPort }). When off (default), frameSequence behaves exactly as before — sequential screenshotBase64() capture — so there is no behavior change unless explicitly enabled.
  • The capability (captureFrameSequence) is only attached when enabled, so the agent transparently falls back when it is off.
  • Multi-device concurrency: the stream URL is built from the existing per-device wdaMjpegPort option, so each device streams from its own port (pass a distinct wdaMjpegPort per device, just like wdaPort). WDA-side MJPEG port forwarding is the user's responsibility, same as the WDA control port.

Added tests for the opt-in gating and the per-device stream URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant