Overhaul core components for production resilience and performance#227
Conversation
|
This PR opens the door to possibilities that weren't possible before, allowing Fast Fail Mechanism (can be introducd in complementary PR). Now that the relay can quickly and cheaply determine if a twin is locally connected, it can reject incoming messages when it knows for certain that the destination twin is currently offline (disconnected). Instead of blindly accepting the message and queuing it (giving the false impression to the originating relay that the message was delivered) it can reject the federation request immediately. What happens when the relay refuses the message?
If the application (e.g. Dashboard) using the source twin retries with an alternate destination upon failure, or used waiting in front of it, it can benefit from the fast-fail mechanism—resulting in improved performance and responsiveness, instead of waiting for the message to time out. |
|
I extended the PR to include the event listener rework. |
|
Added a federation rework. |
…imizations - Switch JSON → bincode for performance - Rewrite catch-up logic for speed and resilience - Optimize Redis command generation in relay worker Rationale: - Lower serialization overhead and improve end-to-end throughput
…tion - Adjust APIs and fix integration issues introduced by upgrades Rationale: - Keep stack current; performance and maintenance benefits
- events: timeout(12s) for OnlineClient::from_url; capped backoff; clearer catch-up loop - federation: RETRY_THROTTLE_SECS 15→2; CLAIM_MIN_IDLE_MS 30_000→3_000; XPENDING 100→200; XREAD 10→100; BLOCK 1000→500 - router: reqwest connect_timeout 2s; per-request timeout 3s - relay: use serve_connection_with_upgrades; adjust timeouts; better logging Rationale: - Tighter bounds on latency and faster recovery from failures
- Implement PartialOrd/Ord for SessionID using (TwinID, Option<&str>) - Replace to_string-based sort with sort_unstable Rationale: - Fewer allocations and faster set rebuilds with deterministic order
- Add size/time-based ACK batching (size=128, flush=10ms) - Final flush on shutdown; import Instant/Duration Rationale: - Cuts per-message XDEL round-trips; improves throughput under load
- Replace naive pool sizing with explicit headroom - wiggle = max(10% of workers, 8); fed_size = wiggle * 2 - blocking = workers + fed_size; ops_headroom = max(blocking/4, 16) - pool_size = blocking + ops_headroom; log pool breakdown - Federation: PARALLEL_PER_CONSUMER = 4; buffer_unordered for claimed/new - Import futures_util::StreamExt; tidy msg extraction and tests Rationale: - Prevent pool exhaustion from BLOCK’ed readers - Reduce head-of-line blocking without request storms
- Replace Arc<Mutex<ClientWrapper>> with ArcSwap<Client> for lock-free reads - Add per-attempt timeout (12s) and URL rotation in connect_with_timeouts - Singleflight reconnect via reconnect_lock on RPC disconnects - Deduplicate concurrent cache misses via shared futures (in_flight) - Retry update_twin/get_twin/get_twin_with_account on ClientError::Rpc(_) - Remove ClientWrapper and related code Rationale: - Avoids serialized client access and long-held locks across awaits - Fails fast on slow/broken endpoints and prevents thundering herd
This update represents a major architectural overhaul of the RMB Relay, focusing on production-grade reliability, performance, and observability. It refactors core components to be self-healing, introduces significant performance optimizations for high-load scenarios, and adds comprehensive metrics for monitoring.
1. Architectural & Reliability Overhaul
arc-swapfor lock-free, non-blocking client replacement during reconnects. It also uses a "singleflight" pattern to de-duplicate concurrent requests for the same twin during a cache miss, preventing cascading load on the Substrate node.2. Performance & Efficiency
XREADCommand Caching The worker's core I/O loop has been optimized with a "fingerprint" caching strategy for theXREADcommand. This avoids expensive command reconstruction on every loop, significantly reducing CPU usage and memory allocations.serde_jsonto the more efficientbincodefor serialization, reducing data size and CPU overhead.fail-fasttag in the message envelope provides instant feedback to the sender if a remote destination is known to be offline, avoiding unnecessary queuing and timeouts.3. Networking & Protocol Modernization
hyper 0.14to the modern Hyper 1.0 API, along with its ecosystem crates likehyper-util.reqwestclient used for federation are now explicitly configured and tuned for HTTP/2, leveraging its performance benefits where available.federationandrelaysfields in theEnvelopeprotobuf message have been formally marked as deprecated, signaling the protocol's evolution towards relay discovery via the chain and cache.4. Observability
closed,back_pressure_timeout, etc.).5. Dependencies & Testing
Related Issues