Skip to content

fix(ios-qa): tunnel cache survives auth rotate + drop redundant IPv4 listener#1756

Open
sternryan wants to merge 1 commit into
garrytan:mainfrom
sternryan:fix/ios-qa-tunnel-cache-and-ipv4
Open

fix(ios-qa): tunnel cache survives auth rotate + drop redundant IPv4 listener#1756
sternryan wants to merge 1 commit into
garrytan:mainfrom
sternryan:fix/ios-qa-tunnel-cache-and-ipv4

Conversation

@sternryan

Copy link
Copy Markdown
Contributor

Two paper-cuts surfaced once PR #1673 (tunnel resolution, merged in v1.44.1.0) unblocked live-device QA. Both verified today on iPhone 12 Pro / iOS 26.x against gstack v1.44.1.0 + this branch.

Issue 1 — Tunnel cache TTL invalidates after first auth rotate

Symptom: /ios-qa works against a real device for the first ~30s, then every request returns 503 device_not_connected forever. Repro is just "run ios-qa for longer than 30 seconds."

Root cause: getTunnel() in daemon/src/index.ts caches the tunnel for 30s and re-invokes tunnelProvider on the next call after the window. tunnelProviderbootstrapTunnel reads the boot token from the app sandbox at tmp/gstack-ios-qa.token, but StateServer.swift.template's handleAuthRotate deletes that file on disk the first time the daemon rotates (try? FileManager.default.removeItem(atPath: bootTokenPath)). The rotated bearer only lives in the daemon's memory. Second bootstrap → boot_token_unavailable → null tunnel → 503 forever.

Fix: hold the tunnel for the daemon's lifetime; only drop it when the proxy reports the underlying CoreDevice route is genuinely dead (503 device_disconnected from ECONNREFUSED / EHOSTUNREACH in proxy.ts). The keepalive added in #1673 (startTunnelKeepalive) prevents route staleness in practice — this is the correct error-handling path for when it does fail.

Issue 2 — IPv4 listener always fails EADDRINUSE on iOS

Symptom: every StateServer.start() emits

LISTENER_FAILED family=ipv4 err=POSIXErrorCode(rawValue: 48): Address already in use

(confirmed via NSLog diagnostics added to the listener on iPhone 12 Pro / iOS 26.x today).

Root cause: NWListener(using: NWParameters.tcp, on: port) on iOS binds a dual-stack socket — IN6ADDR_ANY without IPV6_V6ONLY — so the ::1 listener already accepts IPv4 connections via 4-in-6 mapped addresses (e.g. ::ffff:127.0.0.1). The follow-up 127.0.0.1 bind on the same port can never succeed and reliably emits EADDRINUSE on every launch. It's harmless (the dual-stack listener serves everything) but spams the log every cold start.

Fix: drop the redundant IPv4 listener from templates/StateServer.swift.template. The AddressFamily enum is kept (single .ipv6 case) so the explicit dual-stack decision stays visible to future readers, with a comment block documenting the EADDRINUSE finding.

Test summary

$ bun test (ios-qa/daemon)
 84 pass
 0 fail
 1 error           ← pre-existing on main, unchanged by this PR
 229 expect() calls

The pre-existing error is daemon-integration.test.ts:191 calling afterEach without importing it from bun:test. It's unrelated to this PR (reproduces cleanly on origin/main at a6fb3172) — flagging it here so a future fix-wave can pick it up.

New tests in ios-qa/daemon/test/tunnel-cache.test.ts (2 tests):

  • tunnelProvider is called exactly once across 25 sequential requests (no TTL re-bootstrap).
  • On 503 device_disconnected, the cached tunnel is dropped and the next call re-invokes tunnelProvider.

Both use the existing tunnelProvider injection pattern; no new mocking framework.

Tested against

iPhone 12 Pro on iOS 26.x via Mac Mini M-series, gstack at v1.44.1.0 + this branch. Issue 1 reproduced before the fix (503s after 30s); resolved with the fix (continuous capture for ~10 minutes, no false-503s). Issue 2 confirmed via NSLog'd listener state callback before and after — the EADDRINUSE log line is gone post-fix.

Files changed

  • ios-qa/daemon/src/index.ts — drop TTL, add invalidateTunnel, wire into both handleLoopback and handleTailnet on device_disconnected. Added isDeviceDisconnected helper to safely parse the proxy body.
  • ios-qa/daemon/test/tunnel-cache.test.ts — new file, 2 regression tests via tunnelProvider injection.
  • ios-qa/templates/StateServer.swift.template — remove second listener, simplify AddressFamily enum to a single case, document the EADDRINUSE finding inline.

Prerequisite: #1673 (tunnel resolution), already merged in v1.44.1.0.

…listener

Two paper-cuts that surfaced once PR garrytan#1673 (tunnel resolution, merged in
v1.44.1.0) unblocked live-device QA. Both verified on iPhone 12 Pro /
iOS 26.x against the gstack v1.44.1.0 base.

1) Tunnel cache TTL invalidated after first auth rotate
   --------------------------------------------------
   Symptom: /ios-qa works for the first ~30 seconds against a real
   device, then every request returns 503 device_not_connected forever.

   Root cause: getTunnel() in daemon/src/index.ts cached the tunnel for
   30s and re-invoked tunnelProvider on the next call past that window.
   tunnelProvider -> bootstrapTunnel reads the boot token from the app
   sandbox at tmp/gstack-ios-qa.token, but StateServer.swift.template's
   handleAuthRotate *deletes* that file on disk after the first rotate
   (the rotated bearer only lives in the daemon's memory). The second
   bootstrap therefore fails with boot_token_unavailable, returns null,
   and every subsequent proxy call short-circuits to 503.

   Fix: hold the tunnel for the daemon lifetime; only drop it when the
   proxy reports the CoreDevice route is genuinely dead (503
   device_disconnected from ECONNREFUSED / EHOSTUNREACH in proxy.ts).
   The keepalive added in garrytan#1673 (startTunnelKeepalive) prevents route
   staleness in practice; this is the correct error-handling path for
   when it does happen.

   New tests in daemon/test/tunnel-cache.test.ts (2 tests):
   - tunnelProvider invoked exactly once across 25 sequential requests.
   - On 503 device_disconnected, the next request re-invokes
     tunnelProvider (cache was dropped).

2) IPv4 listener always fails EADDRINUSE on iOS
   --------------------------------------------
   Symptom: every StateServer launch logs

     LISTENER_FAILED family=ipv4 err=POSIXErrorCode(rawValue: 48):
     Address already in use

   Root cause: NWListener(using: NWParameters.tcp, on: port) on iOS
   binds a dual-stack socket (IN6ADDR_ANY, no IPV6_V6ONLY). The ::1
   listener already accepts IPv4 connections via 4-in-6 mapped
   addresses. The follow-up 127.0.0.1 bind on the same port can never
   succeed -> EADDRINUSE every launch.

   Fix: drop the redundant ipv4Listener from
   templates/StateServer.swift.template. Keep the AddressFamily enum
   (single .ipv6 case) so the explicit dual-stack decision is visible
   to future readers, with a comment block at the top of the listener
   block documenting the EADDRINUSE finding.

Tested against:
  iPhone 12 Pro on iOS 26.x via Mac Mini M-series, gstack at v1.44.1.0
  + this branch.

Test summary:
  bun test (ios-qa/daemon): 84 pass, 0 fail.
  Pre-existing on main (unchanged by this PR): one runtime error from
  daemon-integration.test.ts importing afterEach without declaring it
  in the bun:test import statement.

Prerequisite: garrytan#1673 (tunnel resolution), merged in v1.44.1.0.
@sternryan

Copy link
Copy Markdown
Contributor Author

Downstream validation — iPhone 12 Pro on iOS 26.3.1

Checked out this PR locally, ported the StateServer.swift.template IPv4 listener removal into a real SwiftUI app's DebugBridge (Principal's Ear), redeployed to a physical iPhone 12 Pro on iOS 26.3.1, and ran the daemon (gstack-ios-qa-daemon built from this PR) against the device. Both fixes work as designed.

Fix 1 — Tunnel cache survives auth rotate ✅

Bootstrap-then-sustain test: one bootstrap request, then probes at 21s / 46s / 66s elapsed (the 46s and 66s probes would deterministically fail on main because the 30s TTL forces a re-bootstrap that hits boot_token_unavailable).

=== T0 bootstrap (11:49:39) ===
keys: {'isCapturing': False, 'lastError': None, ...}   ← success

=== T1 +21s (11:50:00, pre-TTL) ===
keys: {...}   ← success

=== T2 +46s (11:50:25, PAST 30s TTL — would FAIL on main) ===
keys: {...}   ← success

=== T3 +66s (11:50:45) ===
keys: {...}   ← success

=== daemon stderr ===
no bootstrap errors

I hit this exact regression in my own QA session before this PR existed — first 30s of agent-driven work succeeded, then everything died with device_not_connected and the daemon log showed bootstrap error: boot_token_unavailable — couldn't read tmp/gstack-ios-qa.token. With this PR's getTunnel() change (drop TTL, only invalidateTunnel() on 503 from proxy), that failure mode is gone. Single bootstrap holds for the daemon lifetime.

Fix 2 — IPv4 listener noise ✅

Pre-fix console log on every app launch:

[gstack-ios-qa] START token=... port=9999 build=0.1.0
[gstack-ios-qa] LISTENER_READY family=ipv6 port=9999
[gstack-ios-qa] LISTENER_FAILED family=ipv4 err=POSIXErrorCode(rawValue: 48): Address already in use

Post-fix console log on every app launch:

[gstack-ios-qa] START token=E4359AAF-0CE5-4A2A-AD5E-414BC7FF02C8 port=9999 build=0.1.0
[gstack-ios-qa] LISTENER_READY family=ipv6 port=9999

Confirmed end-to-end functionality through the dual-stack ::1 listener: full agent QA loop runs (screenshot, elements walk, session acquire, tap, state read) — IPv4 callers reach the listener via 4-in-6 mapped addresses as the PR comment explains. No functional regression from dropping the redundant bind.

Notes

  • Tested against: iPhone 12 Pro, iOS 26.3.1, Xcode 26.0.1, Swift 6.2.
  • Build: pr-1756-cache checkout on ~/.claude/skills/gstack (commit 8663d236); template change ported to local Apps/iOS/DebugBridge/StateServer.swift (the test app's analogous file).
  • Daemon validates against the live device CoreDevice tunnel using PR fix(ios-qa): resolve CoreDevice tunnel via devicectl + keep tunnel alive #1673's devicectl info details resolver path. Stack works end-to-end.
  • No additional findings or concerns. Ready to merge from my end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant