Skip to content

Commit c05c058

Browse files
Lykhoydaclaude
andauthored
feat(#264): Phase 5 — bridge supervisor split (survive Metro restarts) (#273)
* docs(plan): #264 Phase 5 TDD plan — bridge supervisor split Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(plan): #264 Phase 5 — amendments from multi-LLM plan review - BLOCKER: child.on('error') funneled into one onDeath pass (ENOENT must not crash the supervisor); integration test added - BLOCKER: setEncoding('utf8') on stdin/worker.stdout (UTF-8 codepoint splits corrupted JSON); real Buffer-split integration test - initialize out of pending-set + initializeAnswered (crash-before-first- response no longer wedges the handshake) - terminal error names resolved logger.logFilePath, not tmpdir() - SIGUSR2=exit-1 pinned by source-text test; workerRestarts monotonic - onSpawned moved into apply()'s spawn branch Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(plan): #264 Phase 5 — codex-pair round-2 amendments (probe buffering, lock gate) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(plan): #264 Phase 5 — fix lock-gate shell snippet (shared cwd via subshells) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(plan): #264 Task 0 findings — (a)/(b) clean degradation, (c) SIGKILL repro confirmed Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(#264): LineSplitter — newline-delimited JSON-RPC framing Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(#264): SupervisorCore — handshake replay, death errors, bounded respawn budget Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(#264): supervisor entry — spawn/pipe/respawn worker, lock + parent-watch ownership Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(#264): cdp_status.bridge — supervised / workerRestarts / lastWorkerExit Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(#264): MCP entry point → dist/supervisor.js (RN_BRIDGE_SUPERVISOR=0 escape hatch) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(#264): supervisor split — architecture, troubleshooting, changeset Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * chore(#264): rebuilt dist (supervisor entry) + gitignore .brainstorm-tmp Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264/B200): process.ppid property, not nonexistent process.getppid() Found by the Phase 5 lock-conflict live gate: both feature-detects always fell back to 0, so (1) every lock recorded ppid:0 and isLockLive's orphan check (livePpid !== body.ppid) reclaimed ANY live holder's lock — the single-instance guarantee was broken since #182 — and (2) the parent-death watch never fired (0 === 0 forever). New runtime test pins the real API. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * feat(#264): carry bridge supervision facts on cdp_status failure paths + rebuilt dist Live-gate finding: a sim with no Hermes target takes the connect-failure path, which omitted bridge.* entirely — supervision facts are env-derived and must be visible exactly when the bridge is unhealthy (same rationale as the existing reconnect/autoConnect extras). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264): SIGUSR2 hot-reload no longer charges the crash budget PR #273 review (Gemini): the reload's exit-1 was indistinguishable from a crash — three reloads in 60s wedged the bridge into terminal mode. The supervisor now flags the core before forwarding SIGUSR2; a flagged exit respawns + replays + counts in telemetry without burning the budget. Also documents the in-order-stdin reliance of the initialized replay. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264): per-child worker splitter — dead worker's partial frame can't contaminate the respawn PR #273 Codex P2: a worker killed mid-write left an unterminated tail in the shared LineSplitter, prefixing the fresh worker's first line and corrupting the replayed-initialize answer. Deterministic repro fixture (partial-then-echo-worker) + integration test; per-child splitter fix. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264): terminal error reaches a never-answered initialize (no silent handshake hang) PR #273 Codex P2 round 2: initialize is exempt from pending (replayable by design), so a worker crash-looping to budget exhaustion before its first answer entered terminal mode without any response — the MCP host hung on the handshake forever. The terminal transition now errors the handshake id. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264): queued mid-restart requests are delivered exactly once, never double-answered codex-pair MED: a request queued during a restart was marked pending before delivery — if the fresh worker crashed pre-drain, the id got a -32000 death error AND a later queued replay (two responses for one JSON-RPC id). Now: queueing != pending (a never-delivered request didn't fail; it drains once), pending starts at delivery (drainQueue), and the terminal transition errors + drops queued ids so nothing hangs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264): strict handshake ordering — initialized replays after the initialize response codex-pair MED: the eager initialized replay relied on SDK v1.29.0 not gating tool calls (a hidden version-specific assumption). The swallow branch now emits initialized + drains the queue only after the fresh worker's initialize response arrives. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264): pre-handshake worker never receives client traffic; behavioral SIGUSR2 pin; comment hygiene codex-pair round 5: - the crash-before-first-response replay branch flipped to running and drained the queue before the fresh worker answered initialize; now it stays restarting and a replayForwardId gates the drain on the response (symmetric with the swallow path) - SIGUSR2 exit-1 contract pinned behaviorally against dist/index.js instead of a source regex - review-history comment references trimmed to durable invariants Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264): initialized arriving mid-restart replays exactly once It was both cached (for the post-handshake replay) and queued, so the fresh worker received two initialized frames. Cache-only during restart. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * fix(#264): drop a dead child's late stdout lines Node can emit 'exit' before stdout drains; after onDeath ran (pending ids errored, replacement possibly spawned), a late line could double-answer an errored id or satisfy the replayed-initialize gate in the fresh worker's place. Guard the data handler with the per-child death flag. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
1 parent ea3e025 commit c05c058

27 files changed

Lines changed: 2762 additions & 27 deletions
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
---
2+
"rn-dev-agent-cdp": minor
3+
"rn-dev-agent-plugin": minor
4+
---
5+
6+
#202 Phase 5 / #264 — the bridge now survives Metro restarts (supervisor split).
7+
8+
The MCP entry point is now `dist/supervisor.js`: a thin stdio shim holding zero network sockets (immune to `lsof -ti tcp:8081 | xargs kill -9`, which used to SIGKILL the whole server and cost the session all 77 tools). It spawns the real bridge as a worker, and on worker death: errors in-flight calls with `-32000` ("retry the call"), respawns it (max 3 per rolling 60 s, then a terminal crash-loop error), and replays the cached MCP `initialize` handshake so the session continues seamlessly. Visibility: `cdp_status``bridge: { supervised, workerRestarts, lastWorkerExit }`. Opt out with `RN_BRIDGE_SUPERVISOR=0` (legacy single process). `SIGUSR2` now performs a real hot-reload (worker restart + handshake replay).

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@
5656
"cdp": {
5757
"command": "node",
5858
"args": [
59-
"${CLAUDE_PLUGIN_ROOT}/scripts/cdp-bridge/dist/index.js"
59+
"${CLAUDE_PLUGIN_ROOT}/scripts/cdp-bridge/dist/supervisor.js"
6060
]
6161
}
6262
}

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,3 +31,4 @@ scripts/rn-fast-runner/**/DerivedData/
3131
# cdp-bridge eval harnesses (local-only dev scaffolding — live-gate scripts
3232
# driven against booted devices, e.g. eval/live-gate-gh253.mjs; not shipped)
3333
scripts/cdp-bridge/eval/
34+
.brainstorm-tmp/

CLAUDE.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,7 @@ Repo-local troubleshooting memory (replaces the Experience Engine):
106106
- **Legacy `AgentDeviceRunner` re-appears on the simulator** → A stale `~/.agent-device/daemon.json` is respawning the upstream runner. Since #202 the plugin terminates stale `AgentDeviceRunner` processes at session-open by default (scoped to the target simulator UDID), clears orphaned `~/.agent-device/daemon.{json,lock}`, and (Phase 4) **uninstalls the legacy runner apps** (`com.callstack.agentdevice.runner` + its xctrunner) from the target simulator — killing the process alone was insufficient because iOS relaunches an installed XCUITest runner mid-flow. This should fully self-heal. If you've opted out via `RN_DEVICE_KILL_LEGACY=0`, clean up one-time: `pkill -f AgentDeviceRunner && rm -f ~/.agent-device/daemon.json ~/.agent-device/daemon.lock && xcrun simctl uninstall <udid> com.callstack.agentdevice.runner && xcrun simctl uninstall <udid> com.callstack.agentdevice.runner.uitests.xctrunner`.
107107
- **`RnFastRunner` / `RnFastRunnerUITests-Runner` icons appear on the simulator** → Expected, not clutter. iOS device control is an XCUITest rig (D1219), so running it installs two apps: `RnFastRunner` (the minimal host app, bundle `dev.lykhoyda.rndevagent.fastrunner`) and `RnFastRunnerUITests-Runner` (the XCUITest harness — same pattern as WebDriverAgent's `WebDriverAgentRunner`). The Runner hosts the `POST /command` HTTP server on port 22088 and drives YOUR app via `XCUIApplication(bundleIdentifier:)` — it never drives itself. It stays installed/running on purpose so subsequent `device_*` calls are fast; leave it. (Contrast the legacy `AgentDeviceRunner` above, which IS unwanted.)
108108
- **"Disconnected due to opening a second DevTools window" / React Native DevTools keeps getting kicked** → RN allows exactly one debugger frontend per app, and the bridge auto-reconnects by default (agent-first). To let the visual DevTools hold the seat, set `RN_CDP_AUTOCONNECT=0` (or `.rn-agent/config.json``{ "cdp": { "autoConnect": false } }`). The bridge then reconnects only when a CDP tool actually runs, and yields again once you reopen DevTools. Note: **any** CDP tool call — including `cdp_status` — reclaims the seat while it runs; passive mode only stops *background* re-grabs. Check the resolved mode in `cdp_status``autoConnect`.
109+
- **MCP server died when Metro was restarted (all tools gone until session restart)** → Fixed since #202 Phase 5 (#264): the stdio supervisor holds no network sockets, so port-based kills (`lsof -ti tcp:8081 | xargs kill -9`) only take the worker, which respawns automatically (`cdp_status``bridge.workerRestarts`). If tools error with "worker is crash-looping", check the bridge log (`LOG_LEVEL=info` writes it) and restart the session. `RN_BRIDGE_SUPERVISOR=0` opts back into the legacy single-process bridge.
109110
- **"No booted simulator"** → Open Simulator.app or boot one via Xcode
110111
- **iOS 26.x beta issues** → Use iOS 18 stable runtime (Xcode > Settings > Platforms)
111112
- **Node.js odd version (v25)** → Switch to Node 22 LTS: `nvm install 22 && nvm use 22`
@@ -162,6 +163,8 @@ One mechanism per capability tier. The device-session honors this contract (the
162163

163164
### MCP Server (cdp-bridge)
164165

166+
Since #202 Phase 5 (#264), the MCP entry point is a **supervisor split**: `dist/supervisor.js` owns stdio with Claude Code and holds ZERO network sockets, so `lsof -ti tcp:8081 | xargs kill -9` (a documented Metro-recovery step) can no longer kill the bridge. It spawns the real server (`dist/index.js --no-lock`) as a worker, caches the MCP `initialize` handshake, and on worker death errors out in-flight calls (`-32000`, "retry the call"), respawns (max 3 per rolling 60 s, then a terminal crash-loop error naming the worker's last exit), and replays the handshake. The single-instance `Lockfile` + parent-death watch live in the supervisor; the worker keeps the UDID device lock. In-memory state (arbiter lease, ring buffers, CDP connection) is rebuilt on respawn by design. `cdp_status``bridge: { supervised, workerRestarts, lastWorkerExit }`. Escape hatch: `RN_BRIDGE_SUPERVISOR=0` runs the legacy single process. `SIGUSR2` to the supervisor = real hot-reload (worker restart + handshake replay).
167+
165168
**76 tools** exposed via MCP (re-audited 2026-05-31; counted from `trackedTool()` calls in `scripts/cdp-bridge/src/index.ts`). Five conceptual families:
166169

167170
**CDP tools** — React internals via Chrome DevTools Protocol over WebSocket:

docs-site/src/content/docs/architecture.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,8 @@ One mechanism per capability tier — **L1 + L2 coexist** (drive with XCTest, as
9696

9797
The MCP server is a Node.js process that maintains a persistent WebSocket connection to the React Native app's Hermes engine through Metro's CDP endpoint.
9898

99+
Since #264 the entry point is a **supervisor split**: a thin stdio shim (`dist/supervisor.js`) that holds zero network sockets owns the MCP connection with Claude Code and spawns the real bridge as a respawnable worker. Killing everything on Metro's port (`lsof -ti tcp:8081 | xargs kill -9` — a common recovery step) used to SIGKILL the whole server and cost the session every tool; now it only takes the worker, which the supervisor respawns (bounded: 3 per rolling 60 s), replaying the cached MCP `initialize` handshake so the session continues. In-flight calls at the moment of death fail fast with a "worker restarted — retry the call" error. Supervision state is visible in `cdp_status``bridge: { supervised, workerRestarts, lastWorkerExit }`; opt out with `RN_BRIDGE_SUPERVISOR=0`.
100+
99101
**74 tools** across five families:
100102
- **CDP** — React internals via Chrome DevTools Protocol (component tree, store state, navigation, profiling, network)
101103
- **Device** — Native interaction (iOS: rn-fast-runner, Android: agent-device)

docs-site/src/content/docs/troubleshooting.mdx

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ The bridge auto-reconnects by default and evicts the visual React Native DevTool
2727
This is normal. `cdp_reload` automatically reconnects within 15 seconds. If it fails, call `cdp_status` to re-establish the connection.
2828
</Aside>
2929

30+
<Aside type="tip" title="MCP server died when Metro was restarted">
31+
Fixed since #264: the bridge entry point is a stdio supervisor that holds no network sockets, so port-based kills (`lsof -ti tcp:8081 | xargs kill -9`) only take the worker process — the supervisor respawns it automatically and the session keeps its tools (`cdp_status``bridge.workerRestarts`). If tools error with "worker is crash-looping", check the bridge log (`LOG_LEVEL=info` writes one) and restart the Claude Code session. `RN_BRIDGE_SUPERVISOR=0` opts back into the legacy single-process bridge.
32+
</Aside>
33+
3034
## Store state issues
3135

3236
<Aside type="tip" title="cdp_store_state error for Zustand">

0 commit comments

Comments
 (0)