test: mitigate e2e simulator hang / retry flakes#9057
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a series of stability improvements for end-to-end testing, specifically targeting flaky simulator behavior and WebSocket connection issues. By enhancing the recovery logic for both the test runner and the underlying iOS simulator, the changes aim to reduce CI noise caused by transient infrastructure failures. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a patch to mocha-remote-server to handle transient client disconnects with a reconnect grace timer, and updates the E2E test suite to reboot the iOS simulator upon retryable launch failures. Feedback highlights a critical race condition in the server patch where a restarted client process may hang waiting for a run command that is never sent. Additionally, the synchronous reboot of the iOS simulator blocks the Node.js event loop, potentially freezing the WebSocket server; it is recommended to refactor this to be asynchronous and properly awaited.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #9057 +/- ##
============================================
+ Coverage 59.71% 62.82% +3.12%
============================================
Files 394 319 -75
Lines 24853 21417 -3436
Branches 4543 3963 -580
============================================
- Hits 14838 13454 -1384
+ Misses 9275 7556 -1719
+ Partials 740 407 -333
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
dd8a267 to
8a66583
Compare
Cap Gradle workers at min(physical_cpus, 6) to limit parallel heap pressure; 5GB daemon heap handles peak packageDebugAndroidTest load. Scales with hardware without overwhelming low-core CI machines.
…ilures Run modular getSessionId probes before all other analytics tests; drop namespace getSessionId coverage to avoid cross-test session interference.
Inline CI at Metro bundle time so Jet tests on device see the CI runner flag instead of an undefined process.env lookup.
Increase Jet reconnectGraceMs from 15s to 30s so transient WS 1006/1001 drops can recover before fatal exit during long debug+coverage runs.
Poll simctl boot state up to 120s before simctl install to avoid LaunchServices races when rebooting between Jet retry attempts.
Log loadavg/memory on transient disconnect, proactively pull coverage after reconnect, and default reconnect grace to 30s in the Jet patch.
Send pull-coverage when mocha-remote client reconnects mid-run and log coverage-ready receipt; align server reconnect grace default to 30s.
Ping keepalive on connect, log send readyState failures, and retry coverage-ready upload up to 3 times with backoff after reconnect.
Snapshot load, top, and e2e-related process stats every 10s into resource-monitor.log for correlating flakes with CPU/memory pressure.
Collate jet-ws, rnfb-e2e, lifecycle, and launch markers from CI logs into flake-summary.txt for faster post-run triage.
Stream testing/SpringBoard logs, run resource monitor, tee Detox output, write flake summary, and upload new diagnostic artifacts on failure.
Match FrontBoard/FBSOpenApplication launch errors and treat coverage teardown WebSocket failures as retryable Jet session failures.
Dump get_app_container/listapps before and after each launch attempt and log the Detox failure reason when launchAppWithRetry gives up.
Time terminateApp during launch retries and reboot the simulator when terminate exceeds RNFB_SLOW_TERMINATE_MS before relaunching.
Use shorter release launch timeout, skip delete on inner retry, and log liveMetro/delete flags to distinguish release stalls from Metro issues.
Mark exhausted inner launch retries as Jet-retryable so debug FrontBoard failures get a full simulator reboot instead of a terminal false.
Emit structured retry-eligibility checks on Jet attempt failure so CI logs show which sub-condition blocked or allowed the second attempt.
Update OKF bundle with new artifacts, boot-simulator shutdown wait, Jet WS/coverage handshake mitigations, FrontBoard launch flakes, and local stress iteration guidance.
Host-orchestrated Tart VMs with detached iteration, session-scoped artifacts, virtiofs completion polling, and optional SCP harvest (--no-sync-artifacts).
Snapshot host and guest loadavg during Detox runs, upload the log as a CI artifact, and include it in flake-summary triage.
Fix reconnect send race after WS grace recovery under saturated macOS CI load: assign mocha-remote client before connection emit, single server-side pull-coverage, non-throwing Server.send, and Jet retry on No client connected. Regenerate yarn.lock after patch updates; document in okf-bundle.
44505e3 to
205f64e
Compare
|
what an epic. Probably more deflakes to come ! |
Summary
This PR is intended to hold a continued series of e2e flake fixes.
Test plan