Dual-backend 1/3 — RNS core + protocol seam#79
Conversation
Path-scoped slice of feat/dual-backend-arch for review (Greptile <=100 files): RNSAPI seam + Compat, RNSBackendPy/Swift, PythonBridge, app/*.py Python RNS sources, xcodeproj, build/support scripts, CI/docs. Stacked: PR2 (app services) and PR3 (UI) build on this. Not independently buildable. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR introduces the RNS protocol seam — the
Confidence Score: 3/5The CI workflow clones sibling packages without branch/commit pinning, meaning it can silently test against different library versions than Package.resolved specifies — including the wrong branch for LXST-swift. The CI clone step for LXST-swift omits -b feat/transport-agnostic, so xcodebuild on CI resolves a different branch than what Package.resolved pins. Prior review rounds surfaced a cluster of concurrency bugs many of which the current code addresses; remaining open items are the primary obstacle to production-readiness. .github/workflows/tests.yml (clone pinning), Sources/RNSBackendSwift/SwiftRNSBackend.swift and Sources/PythonBridge/PythonBridge.swift (open items from prior rounds), app/rns_bridge.py (same). Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as SwiftUI / AppServices
participant B as RnsBackend (protocol)
participant Py as PythonRNSBackend
participant Br as PythonBridge (queue / blockingQueue)
participant Py2 as rns_bridge.py (CPython)
participant Sw as SwiftRNSBackend
participant RS as reticulum-swift / LXMFSwift
UI->>B: start(params)
B->>Py: start(params)
Py->>Br: bridge.start(...) [serial queue]
Br->>Py2: PyObject_CallObject(start)
Py2-->>Br: identity_hash, destination_hash
Br-->>Py: LocalInfo
Py-->>UI: LocalInfo
loop Drain loop 200ms
Py->>Br: bridge.drainEvents() [serial queue]
Br->>Py2: drain_events()
Py2-->>Br: event dicts
Br-->>Py: PythonBridge.Event list
Py->>UI: eventContinuation.yield(BackendEvent)
end
UI->>B: propagationSync(timeout)
B->>Py: propagationSync
Py->>Br: bridge.propagationSync [blockingQueue]
Br->>Py2: propagation_sync() blocks up to timeout
Py2-->>Br: PropagationSyncResult dict
Br-->>Py: PropagationSyncResult
Py-->>UI: PropagationSyncResult
Note over Sw,RS: Swift backend path
UI->>B: start(params)
B->>Sw: start(params)
Sw->>RS: ReticulumTransport + LXMRouter init
RS-->>Sw: transport / router
Sw->>Sw: startAnnouncePolling()
Sw-->>UI: LocalInfo
RS->>Sw: RouterDelegate.didReceiveMessage
Sw->>UI: eventContinuation.yield(.inbound)
|
- fetch-wheels.sh: require explicit BLE_RETICULUM_LOCAL env-var for a local ble-reticulum checkout instead of implicitly picking up ~/repos/ble-reticulum (build reproducibility — matches RETICULUM_LOCAL / LXMF_LOCAL). - PythonBridge.swift: replace force-unwraps on CPython allocation calls (PyTuple_New / PyUnicode_FromString / PyDict_New / PyLong) with guard-let throwing / return-false, matching start()'s safe pattern — NULL under iOS memory pressure no longer crashes the process. - rns_bridge.py: drop the dead `terminal` set in the propagation-sync poll loop. - fetch-wheels.sh: correct the stale cryptography `dev1` comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@greptile review |
…ter 2) reset_identity called exit_handler() but, unlike stop(), never cleared RNS.Reticulum._Reticulum__instance (+ the exit-handler flags). Its docstring requires the caller to start() again afterwards, but that follow-on start() hit __init__'s 'Attempt to reinitialise Reticulum' guard, leaving the app unable to restart without a process relaunch. Mirror stop()'s cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 1a17901)
|
@greptile review |
…tive (greploop iter 3) - reset_identity now also calls router.exit_handler() and tears down open _links / clears _telephony_destination (mirrors stop()), not just the Reticulum singleton — otherwise a follow-on start() spawns a 2nd LXMRouter racing the same lxmf-storage SQLite. - open-link path: move link.identify() to AFTER link_ready.wait() — identify sends an encrypted proof that a PENDING link can't, so identifying first silently no-op'd and the remote never learned our identity (breaks nomadnet stateful pages). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@greptile review |
…ter 4) _on_closed pops _links[link_id]; if the link closed before the entry existed the pop missed and open_link then inserted a permanently-zombie entry. Add to _links first, then wire callbacks — matching the inbound (_on_inbound_link) path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@greptile review |
SwiftRNSBackend.propagationSync hardcoded receivedMessages: 0, so the UI's "N new messages" after a propagation-node sync always showed 0 even when messages were pulled down. Read router.syncState.receivedMessages (the router increments it per retrieved message) after the await instead. Verified end-to-end on the interop harness (red→green): a peer queues a PROPAGATED message on lxmd; with this fix the sync reports received>=1, and reverting to the hardcoded 0 makes the same test fail at received=0. The interop regression test lives on the integration branch (it needs the full Sideband-peer + simulator harness, not present on this branch). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PythonBridge routes every call through one serial DispatchQueue. propagationSync blocks up to its timeout (default 60s) polling the router, so on that shared queue it stalled every other bridge call — announce, send, setPropagationNode — for the full duration. Move propagationSync onto a dedicated blockingQueue. The underlying Python poll releases the GIL between iterations (time.sleep), so short calls on the main queue acquire the GIL and run within ~0.5s while a sync is in flight. Two queues into CPython is safe — withGIL (PyGILState_Ensure/Release) serializes all Python access regardless of thread, the same pattern the BLE callback path already uses. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* PythonBridge.fetchNomadNetPage ran on the main serial queue, so a 30s NomadNet fetch stalled drainEvents + every other bridge call — the same starvation propagationSync was just fixed for. Route it to blockingQueue too. * _alloc_link_id incremented a counter with a non-atomic read-modify-write while being called from both the bridge thread and RNS callback threads, so two links could get the same id and overwrite each other in _links. Guard it with a dedicated lock (not the shared _lock — some callers already hold that, which would deadlock). * SwiftRNSBackend.stop() called eventContinuation.finish(), permanently terminating the AsyncStream; a later start() on the same instance dropped every event. Keep the continuation open across stop/start, matching PythonRNSBackend. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s in CI * SwiftRNSBackend.propagationSync ignored its `timeout` parameter — syncFromPropagationNode() has only per-request timeouts, no overall deadline, so a stalled link/response blocked the caller forever. Race the sync against the timeout in a task group; the loser is cancelled. * support/fetch-wheels.sh pulled ble-reticulum from a bare repo URL (HEAD), making builds non-reproducible. Pin to a commit (origin/main@07d9413; the local checkout is 49 commits past v0.2.2, so a tag would regress). * .github/workflows/tests.yml built the Python-backend scheme without ever fetching Python.xcframework + the wheels, so the copy build-phase had nothing to copy. Add a fetch-python.sh + fetch-wheels.sh step before the builds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@greptile review |
* PythonRNSBackend.start() relied on the one-shot lazy `events` initializer to spin up the drain loop, but stop() cancels eventDrainTask — so after any stop/start (identity switch, reconnect) the loop never restarted and no Python events reached the host. Call startDrainLoop() in start() (no-ops if already running). * SwiftRNSBackend's per-link stateUpdates drain Task was fire-and-forget and untracked, so stop() (which only cleared `links`) left them running, holding the kept-open eventContinuation and yielding stale linkState events into the next session. Track them in a [Int: Task] and cancel on linkTeardown + stop(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@greptile review |
reset_identity cleared the Reticulum singleton + _links but left RNS.Transport's process-global class state (destinations, interfaces, path tables, announce handlers, identities) populated. Its docstring requires the caller to call start() next — but that start() would raise "Attempt to register an already registered destination" (Transport.register_destination scans the stale Transport.destinations list), recoverable only by a process restart. stop() already did this teardown inline; reset_identity diverged and omitted it. Extract the shared block into _clear_transport_class_state() and call it from both, so the two teardown paths can't drift again. Note: resetIdentity is currently dead code (no callers in the app — identity switch goes through switchIdentity), so this hardens a latent bug rather than a live crash; grounded by mirroring the proven stop() path. Verified: py_compile clean; behaviour-preserving refactor of the live stop() teardown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both stop() and reset_identity() iterated _links and called link.teardown() while holding _lock. link.teardown() can fire the _on_closed callback synchronously on the calling thread (once exit_handler() has stopped the transport's dispatch threads), and _on_closed does `with _lock: _links.pop(...)`. threading.Lock is non-reentrant, so that re-acquire would deadlock the bridge queue and the user couldn't disconnect/restart without killing the process. Snapshot _links + clear the dict inside the lock, then tear the snapshot down outside it — the same in-lock-snapshot / out-of-lock-side-effect ordering used elsewhere. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
On the flagged |
|
@greptile review |
…data (#79) Two greptile-flagged defects on pr1: - rns_bridge.py: stop() cleared BLE callbacks but not RNode; reset_identity() cleared neither. Stale closures from a torn-down IOSRNodeInterface/IOSBLE driver survived a stop+start / identity-reset. Add clear_rnode_callbacks() to stop() and clear_ble_callbacks()+clear_rnode_callbacks()+handle-nil to reset_identity(), symmetric with the existing BLE cleanup. - SwiftRNSBackend.announce: the delivery-destination announce set appData to raw UTF-8; canonical LXMF (>=0.5.0) is msgpack([name_utf8, stamp_cost]). Emit that (stamp_cost nil, matching the Python backend's register_delivery_identity(display_name=…)) so Sideband/Android decode the name via the msgpack path and read the (absent) stamp cost, instead of the legacy raw-utf8 fallback. Telephony announce keeps its raw LXST format. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@greptile review |
- reset_identity(): include telephony_destination in the _state.update so it matches stop(); previously _state['telephony_destination'] kept a dead RNS.Destination after a reset, diverging from stop()'s teardown. - Snapshot _state['identity'] under _lock before link.identify(): a concurrent reset_identity()/stop() could null it between the None-check and the call, silently no-op'ing identify(None). Mirrors link_identify(). - Cancel the delayed re-announce daemon across teardown/restart via an _announce_generation token: start() captures the current generation; stop()/reset_identity() bump it; the thread bails before announcing a torn-down (or previous-session) destination. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
openLink is async with no actor isolation, so concurrent callers could interleave the nextLinkId read-increment and the links[linkId] write: two callers grab the same id and the second write orphans the first link (never tracked in links, never cancelled by linkTeardown/stop). Same class of bug the Python backend fixed with _link_id_lock. Guard nextLinkId, links, and linkStateTasks with an NSLock at every access (openLink/linkSend/ linkIdentify/linkTeardown/stop), never held across an await — keeping the backend a class so linkSend stays actor-hop-free on the audio path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The link registration and its stateUpdates drain task were stored in two separate linkLock sections split by an await. A stop() racing that window cleared linkStateTasks (not yet holding the new task) and links, then openLink re-inserted the task into the emptied map — orphaning it (link actor kept alive, stale events into the next session, linkId absent from links so linkTeardown couldn't reach it). Build the Task and store both writes inside one critical section; the constructor is synchronous so it's lock-safe, and setPacketCallback's await now follows the unlock. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- interfaceIds had no synchronization while links/linkStateTasks did: statusSnapshot() read it (.first(where:)) concurrently with start()'s bring-up loop and addInterface/removeInterface writes from other tasks. Swift Dictionary read+write is UB (release-build crash). Extend linkLock to every interfaceIds access; statusSnapshot takes a COW snapshot under the lock so the map stays lock-free. - startAnnouncePolling() overwrote announcePoller without cancelling it. start() has no idempotency guard, so a second start() without stop() leaked the old poller (still polling its captured, now-orphaned PathTable) and double-yielded every announce. Cancel before reassigning. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Part 1 of 3 of a path-scoped split of
feat/dual-backend-arch, sliced to stay under the 100-file Greptile review limit. The whole feature is the Python-RNS migration + theRnsBackendprotocol seam (embedded-Python and native-Swift backends) + a runtime engine toggle. Net diff vsmainis 190 files; this slice is 44.This slice — RNS core + protocol seam
RNSAPI(seam + Compat),RNSBackendPy/RNSBackendSwift,PythonBridge, theapp/*.pyPython RNS sources,Columba.xcodeproj, build /supportscripts, CI + docs.Stacked series (merge in order)
mainpr1-rns-core(67 files)pr2-app-services(79 files)The three slices' union is byte-identical to
feat/dual-backend-arch. They are chunks of one feature, not independent units: merge 1 → 2 → 3, and an individual slice is not independently buildable (fine for static review; not for per-PR CI).Deferred (carried here, intentionally not applied)
main's newer interop test files are present on disk but not yet wired into theColumbaAppTeststarget.🤖 Generated with Claude Code