Lykhoyda
diff --git a/‎.changeset/phase5-supervisor-split.md‎
Lines changed: 8 additions & 0 deletions b/‎.changeset/phase5-supervisor-split.md‎
Lines changed: 8 additions & 0 deletions
diff --git a/‎.claude-plugin/plugin.json‎
Lines changed: 1 addition & 1 deletion b/‎.claude-plugin/plugin.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 3 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs-site/src/content/docs/architecture.mdx‎
Lines changed: 2 additions & 0 deletions b/‎docs-site/src/content/docs/architecture.mdx‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs-site/src/content/docs/troubleshooting.mdx‎
Lines changed: 4 additions & 0 deletions b/‎docs-site/src/content/docs/troubleshooting.mdx‎
Lines changed: 4 additions & 0 deletions
@@ -0,0 +1,8 @@
+---
+"rn-dev-agent-cdp": minor
+"rn-dev-agent-plugin": minor
+---
+
+#202 Phase 5 / #264 — the bridge now survives Metro restarts (supervisor split).
+
+The MCP entry point is now `dist/supervisor.js`: a thin stdio shim holding zero network sockets (immune to `lsof -ti tcp:8081 | xargs kill -9`, which used to SIGKILL the whole server and cost the session all 77 tools). It spawns the real bridge as a worker, and on worker death: errors in-flight calls with `-32000` ("retry the call"), respawns it (max 3 per rolling 60 s, then a terminal crash-loop error), and replays the cached MCP `initialize` handshake so the session continues seamlessly. Visibility: `cdp_status` → `bridge: { supervised, workerRestarts, lastWorkerExit }`. Opt out with `RN_BRIDGE_SUPERVISOR=0` (legacy single process). `SIGUSR2` now performs a real hot-reload (worker restart + handshake replay).
@@ -56,7 +56,7 @@
     "cdp": {
       "command": "node",
       "args": [
-        "${CLAUDE_PLUGIN_ROOT}/scripts/cdp-bridge/dist/index.js"
+        "${CLAUDE_PLUGIN_ROOT}/scripts/cdp-bridge/dist/supervisor.js"
       ]
     }
   }
 
@@ -31,3 +31,4 @@ scripts/rn-fast-runner/**/DerivedData/
 # cdp-bridge eval harnesses (local-only dev scaffolding — live-gate scripts
 # driven against booted devices, e.g. eval/live-gate-gh253.mjs; not shipped)
 scripts/cdp-bridge/eval/
+.brainstorm-tmp/
@@ -106,6 +106,7 @@ Repo-local troubleshooting memory (replaces the Experience Engine):
 - **Legacy `AgentDeviceRunner` re-appears on the simulator** → A stale `~/.agent-device/daemon.json` is respawning the upstream runner. Since #202 the plugin terminates stale `AgentDeviceRunner` processes at session-open by default (scoped to the target simulator UDID), clears orphaned `~/.agent-device/daemon.{json,lock}`, and (Phase 4) **uninstalls the legacy runner apps** (`com.callstack.agentdevice.runner` + its xctrunner) from the target simulator — killing the process alone was insufficient because iOS relaunches an installed XCUITest runner mid-flow. This should fully self-heal. If you've opted out via `RN_DEVICE_KILL_LEGACY=0`, clean up one-time: `pkill -f AgentDeviceRunner && rm -f ~/.agent-device/daemon.json ~/.agent-device/daemon.lock && xcrun simctl uninstall <udid> com.callstack.agentdevice.runner && xcrun simctl uninstall <udid> com.callstack.agentdevice.runner.uitests.xctrunner`.
 - **`RnFastRunner` / `RnFastRunnerUITests-Runner` icons appear on the simulator** → Expected, not clutter. iOS device control is an XCUITest rig (D1219), so running it installs two apps: `RnFastRunner` (the minimal host app, bundle `dev.lykhoyda.rndevagent.fastrunner`) and `RnFastRunnerUITests-Runner` (the XCUITest harness — same pattern as WebDriverAgent's `WebDriverAgentRunner`). The Runner hosts the `POST /command` HTTP server on port 22088 and drives YOUR app via `XCUIApplication(bundleIdentifier:)` — it never drives itself. It stays installed/running on purpose so subsequent `device_*` calls are fast; leave it. (Contrast the legacy `AgentDeviceRunner` above, which IS unwanted.)
 - **"Disconnected due to opening a second DevTools window" / React Native DevTools keeps getting kicked** → RN allows exactly one debugger frontend per app, and the bridge auto-reconnects by default (agent-first). To let the visual DevTools hold the seat, set `RN_CDP_AUTOCONNECT=0` (or `.rn-agent/config.json` → `{ "cdp": { "autoConnect": false } }`). The bridge then reconnects only when a CDP tool actually runs, and yields again once you reopen DevTools. Note: **any** CDP tool call — including `cdp_status` — reclaims the seat while it runs; passive mode only stops *background* re-grabs. Check the resolved mode in `cdp_status` → `autoConnect`.
+- **MCP server died when Metro was restarted (all tools gone until session restart)** → Fixed since #202 Phase 5 (#264): the stdio supervisor holds no network sockets, so port-based kills (`lsof -ti tcp:8081 | xargs kill -9`) only take the worker, which respawns automatically (`cdp_status` → `bridge.workerRestarts`). If tools error with "worker is crash-looping", check the bridge log (`LOG_LEVEL=info` writes it) and restart the session. `RN_BRIDGE_SUPERVISOR=0` opts back into the legacy single-process bridge.
 - **"No booted simulator"** → Open Simulator.app or boot one via Xcode
 - **iOS 26.x beta issues** → Use iOS 18 stable runtime (Xcode > Settings > Platforms)
 - **Node.js odd version (v25)** → Switch to Node 22 LTS: `nvm install 22 && nvm use 22`
@@ -162,6 +163,8 @@ One mechanism per capability tier. The device-session honors this contract (the
 
 ### MCP Server (cdp-bridge)
 
+Since #202 Phase 5 (#264), the MCP entry point is a **supervisor split**: `dist/supervisor.js` owns stdio with Claude Code and holds ZERO network sockets, so `lsof -ti tcp:8081 | xargs kill -9` (a documented Metro-recovery step) can no longer kill the bridge. It spawns the real server (`dist/index.js --no-lock`) as a worker, caches the MCP `initialize` handshake, and on worker death errors out in-flight calls (`-32000`, "retry the call"), respawns (max 3 per rolling 60 s, then a terminal crash-loop error naming the worker's last exit), and replays the handshake. The single-instance `Lockfile` + parent-death watch live in the supervisor; the worker keeps the UDID device lock. In-memory state (arbiter lease, ring buffers, CDP connection) is rebuilt on respawn by design. `cdp_status` → `bridge: { supervised, workerRestarts, lastWorkerExit }`. Escape hatch: `RN_BRIDGE_SUPERVISOR=0` runs the legacy single process. `SIGUSR2` to the supervisor = real hot-reload (worker restart + handshake replay).
+
 **76 tools** exposed via MCP (re-audited 2026-05-31; counted from `trackedTool()` calls in `scripts/cdp-bridge/src/index.ts`). Five conceptual families:
 
 **CDP tools** — React internals via Chrome DevTools Protocol over WebSocket:
 
@@ -96,6 +96,8 @@ One mechanism per capability tier — **L1 + L2 coexist** (drive with XCTest, as
 
 The MCP server is a Node.js process that maintains a persistent WebSocket connection to the React Native app's Hermes engine through Metro's CDP endpoint.
 
+Since #264 the entry point is a **supervisor split**: a thin stdio shim (`dist/supervisor.js`) that holds zero network sockets owns the MCP connection with Claude Code and spawns the real bridge as a respawnable worker. Killing everything on Metro's port (`lsof -ti tcp:8081 | xargs kill -9` — a common recovery step) used to SIGKILL the whole server and cost the session every tool; now it only takes the worker, which the supervisor respawns (bounded: 3 per rolling 60 s), replaying the cached MCP `initialize` handshake so the session continues. In-flight calls at the moment of death fail fast with a "worker restarted — retry the call" error. Supervision state is visible in `cdp_status` → `bridge: { supervised, workerRestarts, lastWorkerExit }`; opt out with `RN_BRIDGE_SUPERVISOR=0`.
+
 **74 tools** across five families:
 - **CDP** — React internals via Chrome DevTools Protocol (component tree, store state, navigation, profiling, network)
 - **Device** — Native interaction (iOS: rn-fast-runner, Android: agent-device)
 
@@ -27,6 +27,10 @@ The bridge auto-reconnects by default and evicts the visual React Native DevTool
 This is normal. `cdp_reload` automatically reconnects within 15 seconds. If it fails, call `cdp_status` to re-establish the connection.
 </Aside>
 
+<Aside type="tip" title="MCP server died when Metro was restarted">
+Fixed since #264: the bridge entry point is a stdio supervisor that holds no network sockets, so port-based kills (`lsof -ti tcp:8081 | xargs kill -9`) only take the worker process — the supervisor respawns it automatically and the session keeps its tools (`cdp_status` → `bridge.workerRestarts`). If tools error with "worker is crash-looping", check the bridge log (`LOG_LEVEL=info` writes one) and restart the Claude Code session. `RN_BRIDGE_SUPERVISOR=0` opts back into the legacy single-process bridge.
+</Aside>
+
 ## Store state issues
 
 <Aside type="tip" title="cdp_store_state error for Zustand">
Original file line number	Diff line number	Diff line change
`@@ -56,7 +56,7 @@`
`56`	`56`	`"cdp": {`
`57`	`57`	`"command": "node",`
`58`	`58`	`"args": [`
`59`		`- "${CLAUDE_PLUGIN_ROOT}/scripts/cdp-bridge/dist/index.js"`
	`59`	`+ "${CLAUDE_PLUGIN_ROOT}/scripts/cdp-bridge/dist/supervisor.js"`
`60`	`60`	`]`
`61`	`61`	`}`
`62`	`62`	`}`