fix(ios-qa): tunnel cache survives auth rotate + drop redundant IPv4 listener#1756
fix(ios-qa): tunnel cache survives auth rotate + drop redundant IPv4 listener#1756sternryan wants to merge 1 commit into
Conversation
…listener Two paper-cuts that surfaced once PR garrytan#1673 (tunnel resolution, merged in v1.44.1.0) unblocked live-device QA. Both verified on iPhone 12 Pro / iOS 26.x against the gstack v1.44.1.0 base. 1) Tunnel cache TTL invalidated after first auth rotate -------------------------------------------------- Symptom: /ios-qa works for the first ~30 seconds against a real device, then every request returns 503 device_not_connected forever. Root cause: getTunnel() in daemon/src/index.ts cached the tunnel for 30s and re-invoked tunnelProvider on the next call past that window. tunnelProvider -> bootstrapTunnel reads the boot token from the app sandbox at tmp/gstack-ios-qa.token, but StateServer.swift.template's handleAuthRotate *deletes* that file on disk after the first rotate (the rotated bearer only lives in the daemon's memory). The second bootstrap therefore fails with boot_token_unavailable, returns null, and every subsequent proxy call short-circuits to 503. Fix: hold the tunnel for the daemon lifetime; only drop it when the proxy reports the CoreDevice route is genuinely dead (503 device_disconnected from ECONNREFUSED / EHOSTUNREACH in proxy.ts). The keepalive added in garrytan#1673 (startTunnelKeepalive) prevents route staleness in practice; this is the correct error-handling path for when it does happen. New tests in daemon/test/tunnel-cache.test.ts (2 tests): - tunnelProvider invoked exactly once across 25 sequential requests. - On 503 device_disconnected, the next request re-invokes tunnelProvider (cache was dropped). 2) IPv4 listener always fails EADDRINUSE on iOS -------------------------------------------- Symptom: every StateServer launch logs LISTENER_FAILED family=ipv4 err=POSIXErrorCode(rawValue: 48): Address already in use Root cause: NWListener(using: NWParameters.tcp, on: port) on iOS binds a dual-stack socket (IN6ADDR_ANY, no IPV6_V6ONLY). The ::1 listener already accepts IPv4 connections via 4-in-6 mapped addresses. The follow-up 127.0.0.1 bind on the same port can never succeed -> EADDRINUSE every launch. Fix: drop the redundant ipv4Listener from templates/StateServer.swift.template. Keep the AddressFamily enum (single .ipv6 case) so the explicit dual-stack decision is visible to future readers, with a comment block at the top of the listener block documenting the EADDRINUSE finding. Tested against: iPhone 12 Pro on iOS 26.x via Mac Mini M-series, gstack at v1.44.1.0 + this branch. Test summary: bun test (ios-qa/daemon): 84 pass, 0 fail. Pre-existing on main (unchanged by this PR): one runtime error from daemon-integration.test.ts importing afterEach without declaring it in the bun:test import statement. Prerequisite: garrytan#1673 (tunnel resolution), merged in v1.44.1.0.
Downstream validation — iPhone 12 Pro on iOS 26.3.1Checked out this PR locally, ported the Fix 1 — Tunnel cache survives auth rotate ✅Bootstrap-then-sustain test: one bootstrap request, then probes at 21s / 46s / 66s elapsed (the 46s and 66s probes would deterministically fail on I hit this exact regression in my own QA session before this PR existed — first 30s of agent-driven work succeeded, then everything died with Fix 2 — IPv4 listener noise ✅Pre-fix console log on every app launch: Post-fix console log on every app launch: Confirmed end-to-end functionality through the dual-stack ::1 listener: full agent QA loop runs (screenshot, elements walk, session acquire, tap, state read) — IPv4 callers reach the listener via 4-in-6 mapped addresses as the PR comment explains. No functional regression from dropping the redundant bind. Notes
|
Two paper-cuts surfaced once PR #1673 (tunnel resolution, merged in v1.44.1.0) unblocked live-device QA. Both verified today on iPhone 12 Pro / iOS 26.x against gstack v1.44.1.0 + this branch.
Issue 1 — Tunnel cache TTL invalidates after first auth rotate
Symptom:
/ios-qaworks against a real device for the first ~30s, then every request returns503 device_not_connectedforever. Repro is just "run ios-qa for longer than 30 seconds."Root cause:
getTunnel()indaemon/src/index.tscaches the tunnel for 30s and re-invokestunnelProvideron the next call after the window.tunnelProvider→bootstrapTunnelreads the boot token from the app sandbox attmp/gstack-ios-qa.token, butStateServer.swift.template'shandleAuthRotatedeletes that file on disk the first time the daemon rotates (try? FileManager.default.removeItem(atPath: bootTokenPath)). The rotated bearer only lives in the daemon's memory. Second bootstrap →boot_token_unavailable→ null tunnel → 503 forever.Fix: hold the tunnel for the daemon's lifetime; only drop it when the proxy reports the underlying CoreDevice route is genuinely dead (
503 device_disconnectedfromECONNREFUSED/EHOSTUNREACHinproxy.ts). The keepalive added in #1673 (startTunnelKeepalive) prevents route staleness in practice — this is the correct error-handling path for when it does fail.Issue 2 — IPv4 listener always fails EADDRINUSE on iOS
Symptom: every
StateServer.start()emits(confirmed via NSLog diagnostics added to the listener on iPhone 12 Pro / iOS 26.x today).
Root cause:
NWListener(using: NWParameters.tcp, on: port)on iOS binds a dual-stack socket —IN6ADDR_ANYwithoutIPV6_V6ONLY— so the::1listener already accepts IPv4 connections via 4-in-6 mapped addresses (e.g.::ffff:127.0.0.1). The follow-up127.0.0.1bind on the same port can never succeed and reliably emits EADDRINUSE on every launch. It's harmless (the dual-stack listener serves everything) but spams the log every cold start.Fix: drop the redundant IPv4 listener from
templates/StateServer.swift.template. TheAddressFamilyenum is kept (single.ipv6case) so the explicit dual-stack decision stays visible to future readers, with a comment block documenting the EADDRINUSE finding.Test summary
The pre-existing error is
daemon-integration.test.ts:191callingafterEachwithout importing it frombun:test. It's unrelated to this PR (reproduces cleanly onorigin/mainata6fb3172) — flagging it here so a future fix-wave can pick it up.New tests in
ios-qa/daemon/test/tunnel-cache.test.ts(2 tests):tunnelProvideris called exactly once across 25 sequential requests (no TTL re-bootstrap).503 device_disconnected, the cached tunnel is dropped and the next call re-invokestunnelProvider.Both use the existing
tunnelProviderinjection pattern; no new mocking framework.Tested against
iPhone 12 Pro on iOS 26.x via Mac Mini M-series, gstack at v1.44.1.0 + this branch. Issue 1 reproduced before the fix (503s after 30s); resolved with the fix (continuous capture for ~10 minutes, no false-503s). Issue 2 confirmed via NSLog'd listener state callback before and after — the EADDRINUSE log line is gone post-fix.
Files changed
ios-qa/daemon/src/index.ts— drop TTL, addinvalidateTunnel, wire into bothhandleLoopbackandhandleTailnetondevice_disconnected. AddedisDeviceDisconnectedhelper to safely parse the proxy body.ios-qa/daemon/test/tunnel-cache.test.ts— new file, 2 regression tests viatunnelProviderinjection.ios-qa/templates/StateServer.swift.template— remove second listener, simplifyAddressFamilyenum to a single case, document the EADDRINUSE finding inline.Prerequisite: #1673 (tunnel resolution), already merged in v1.44.1.0.