Skip to content

Overhaul core components for production resilience and performance#227

Merged
sameh-farouk merged 20 commits into
mainfrom
main-feat-optimize-message-routing
Sep 21, 2025
Merged

Overhaul core components for production resilience and performance#227
sameh-farouk merged 20 commits into
mainfrom
main-feat-optimize-message-routing

Conversation

@sameh-farouk
Copy link
Copy Markdown
Member

@sameh-farouk sameh-farouk commented Aug 5, 2025

This update represents a major architectural overhaul of the RMB Relay, focusing on production-grade reliability, performance, and observability. It refactors core components to be self-healing, introduces significant performance optimizations for high-load scenarios, and adds comprehensive metrics for monitoring.

1. Architectural & Reliability Overhaul

  • Federation on Redis Streams The inter-relay federation mechanism has been completely rebuilt on top of Redis Streams and consumer groups. This replaces the old, less reliable list-based queue, providing at-least-once delivery semantics and automatic recovery of failed messages via a Pending Entries List (PEL).
  • Resilient Substrate Client & Event Listener Components interacting with the TFChain are now highly resilient to network failures.
    • The Event Listener is now a self-healing service that detects stalled connections, performs reconnections with a capped exponential backoff, and intelligently catches up on missed blocks without requiring a full cache flush for minor disconnects.
    • The Substrate Client now uses arc-swap for lock-free, non-blocking client replacement during reconnects. It also uses a "singleflight" pattern to de-duplicate concurrent requests for the same twin during a cache miss, preventing cascading load on the Substrate node.

2. Performance & Efficiency

  • Local Peer Circuit-Breaker The relay now performs a fast, local, in-memory check to see if a message's destination is a locally connected peer. This bypasses expensive twin lookups and federation logic, significantly speeding up local traffic.
  • Fair Worker Load Balancing The distribution of new connections to workers is now fair and balanced. Workers accept one connection at a time and yield to the scheduler, preventing a single worker from starving others during high connection bursts.
  • Batched Message ACKs WebSocket message acknowledgments are now batched before being sent to Redis. This drastically reduces Redis round-trips under load, improving overall throughput.
  • Optimized XREAD Command Caching The worker's core I/O loop has been optimized with a "fingerprint" caching strategy for the XREAD command. This avoids expensive command reconstruction on every loop, significantly reducing CPU usage and memory allocations.
  • Binary Serialization for Cache The Redis cache has switched from serde_json to the more efficient bincode for serialization, reducing data size and CPU overhead.
  • Federation "Fail-Fast" Support A new fail-fast tag in the message envelope provides instant feedback to the sender if a remote destination is known to be offline, avoiding unnecessary queuing and timeouts.

3. Networking & Protocol Modernization

  • Hyper 1.0 Stack The entire HTTP stack has been migrated from hyper 0.14 to the modern Hyper 1.0 API, along with its ecosystem crates like hyper-util.
  • HTTP/2 Federation Both the main relay server and the reqwest client used for federation are now explicitly configured and tuned for HTTP/2, leveraging its performance benefits where available.
  • Protocol Field Deprecation The federation and relays fields in the Envelope protobuf message have been formally marked as deprecated, signaling the protocol's evolution towards relay discovery via the chain and cache.

4. Observability

  • Comprehensive Prometheus Metrics New metrics have been added across the application for deeper insight into the relay's health and performance:
    • Event Listener: Tracks reconnect cycles, last processed block number, and current reconnecting status.
    • Cache: Exposes hits, misses, total entries, and flush counts.
    • Relay Switch: A new counter tracks session evictions, labeled by reason (closed, back_pressure_timeout, etc.).

5. Dependencies & Testing

  • Updated Dependencies Dozens of dependencies across the project have been updated to their latest versions for security and performance improvements.
  • Expanded Testing The test suite has been updated to reflect the new architecture, with new tests added to cover the Redis Streams federation logic and the resilient client patterns.

Related Issues

@sameh-farouk sameh-farouk marked this pull request as draft August 5, 2025 22:36
@sameh-farouk
Copy link
Copy Markdown
Member Author

sameh-farouk commented Aug 6, 2025

This PR opens the door to possibilities that weren't possible before, allowing Fast Fail Mechanism (can be introducd in complementary PR).

Now that the relay can quickly and cheaply determine if a twin is locally connected, it can reject incoming messages when it knows for certain that the destination twin is currently offline (disconnected). Instead of blindly accepting the message and queuing it (giving the false impression to the originating relay that the message was delivered) it can reject the federation request immediately.

What happens when the relay refuses the message?

  1. The originating relay can immediately retry delivery through other relays that the destination twin is known to be connected to. This ensures message delivery as long as the twin is connected to at least one of its configured relays (as opposed to requiring connection to all relays, as before).
  2. If the destination twin is completely offline (disconnected from all relays), the originating relay can quickly detect this and respond to the source twin that the destination is currently unreachable.
  3. This allows the relay to act as a “fast fail” path when desired—and this behavior can be made configurable.

If the application (e.g. Dashboard) using the source twin retries with an alternate destination upon failure, or used waiting in front of it, it can benefit from the fast-fail mechanism—resulting in improved performance and responsiveness, instead of waiting for the message to time out.

@sameh-farouk sameh-farouk marked this pull request as ready for review August 18, 2025 02:54
@sameh-farouk
Copy link
Copy Markdown
Member Author

sameh-farouk commented Aug 18, 2025

I extended the PR to include the event listener rework.

@sameh-farouk
Copy link
Copy Markdown
Member Author

sameh-farouk commented Aug 20, 2025

Added a federation rework.
The new federation represents a major architectural shift from using Redis Lists to Redis Streams, resulting in significant behavioral changes that make it fundamentally more reliable, scalable, and robust.

…imizations

- Switch JSON → bincode for performance
- Rewrite catch-up logic for speed and resilience
- Optimize Redis command generation in relay worker

Rationale:
- Lower serialization overhead and improve end-to-end throughput
…tion

- Adjust APIs and fix integration issues introduced by upgrades

Rationale:
- Keep stack current; performance and maintenance benefits
- events: timeout(12s) for OnlineClient::from_url; capped backoff; clearer catch-up loop
- federation: RETRY_THROTTLE_SECS 15→2; CLAIM_MIN_IDLE_MS 30_000→3_000; XPENDING 100→200; XREAD 10→100; BLOCK 1000→500
- router: reqwest connect_timeout 2s; per-request timeout 3s
- relay: use serve_connection_with_upgrades; adjust timeouts; better logging

Rationale:
- Tighter bounds on latency and faster recovery from failures
- Implement PartialOrd/Ord for SessionID using (TwinID, Option<&str>)
- Replace to_string-based sort with sort_unstable

Rationale:
- Fewer allocations and faster set rebuilds with deterministic order
- Add size/time-based ACK batching (size=128, flush=10ms)
- Final flush on shutdown; import Instant/Duration

Rationale:
- Cuts per-message XDEL round-trips; improves throughput under load
- Replace naive pool sizing with explicit headroom
  - wiggle = max(10% of workers, 8); fed_size = wiggle * 2
  - blocking = workers + fed_size; ops_headroom = max(blocking/4, 16)
  - pool_size = blocking + ops_headroom; log pool breakdown
- Federation: PARALLEL_PER_CONSUMER = 4; buffer_unordered for claimed/new
- Import futures_util::StreamExt; tidy msg extraction and tests

Rationale:
- Prevent pool exhaustion from BLOCK’ed readers
- Reduce head-of-line blocking without request storms
- Replace Arc<Mutex<ClientWrapper>> with ArcSwap<Client> for lock-free reads
- Add per-attempt timeout (12s) and URL rotation in connect_with_timeouts
- Singleflight reconnect via reconnect_lock on RPC disconnects
- Deduplicate concurrent cache misses via shared futures (in_flight)
- Retry update_twin/get_twin/get_twin_with_account on ClientError::Rpc(_)
- Remove ClientWrapper and related code

Rationale:
- Avoids serialized client access and long-held locks across awaits
- Fails fast on slow/broken endpoints and prevents thundering herd
@sameh-farouk sameh-farouk changed the title Optimize message routing Overhaul core components for production resilience and performance Aug 25, 2025
@sameh-farouk sameh-farouk merged commit 4a44655 into main Sep 21, 2025
1 check passed
@sameh-farouk sameh-farouk deleted the main-feat-optimize-message-routing branch September 21, 2025 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment