Overhaul core components for production resilience and performance by sameh-farouk · Pull Request #227 · threefoldtech/zos_rmb

sameh-farouk · 2025-08-05T22:36:13Z

This update represents a major architectural overhaul of the RMB Relay, focusing on production-grade reliability, performance, and observability. It refactors core components to be self-healing, introduces significant performance optimizations for high-load scenarios, and adds comprehensive metrics for monitoring.

1. Architectural & Reliability Overhaul

Federation on Redis Streams The inter-relay federation mechanism has been completely rebuilt on top of Redis Streams and consumer groups. This replaces the old, less reliable list-based queue, providing at-least-once delivery semantics and automatic recovery of failed messages via a Pending Entries List (PEL).
Resilient Substrate Client & Event Listener Components interacting with the TFChain are now highly resilient to network failures.
- The Event Listener is now a self-healing service that detects stalled connections, performs reconnections with a capped exponential backoff, and intelligently catches up on missed blocks without requiring a full cache flush for minor disconnects.
- The Substrate Client now uses arc-swap for lock-free, non-blocking client replacement during reconnects. It also uses a "singleflight" pattern to de-duplicate concurrent requests for the same twin during a cache miss, preventing cascading load on the Substrate node.

2. Performance & Efficiency

Local Peer Circuit-Breaker The relay now performs a fast, local, in-memory check to see if a message's destination is a locally connected peer. This bypasses expensive twin lookups and federation logic, significantly speeding up local traffic.
Fair Worker Load Balancing The distribution of new connections to workers is now fair and balanced. Workers accept one connection at a time and yield to the scheduler, preventing a single worker from starving others during high connection bursts.
Batched Message ACKs WebSocket message acknowledgments are now batched before being sent to Redis. This drastically reduces Redis round-trips under load, improving overall throughput.
Optimized XREAD Command Caching The worker's core I/O loop has been optimized with a "fingerprint" caching strategy for the XREAD command. This avoids expensive command reconstruction on every loop, significantly reducing CPU usage and memory allocations.
Binary Serialization for Cache The Redis cache has switched from serde_json to the more efficient bincode for serialization, reducing data size and CPU overhead.
Federation "Fail-Fast" Support A new fail-fast tag in the message envelope provides instant feedback to the sender if a remote destination is known to be offline, avoiding unnecessary queuing and timeouts.

3. Networking & Protocol Modernization

Hyper 1.0 Stack The entire HTTP stack has been migrated from hyper 0.14 to the modern Hyper 1.0 API, along with its ecosystem crates like hyper-util.
HTTP/2 Federation Both the main relay server and the reqwest client used for federation are now explicitly configured and tuned for HTTP/2, leveraging its performance benefits where available.
Protocol Field Deprecation The federation and relays fields in the Envelope protobuf message have been formally marked as deprecated, signaling the protocol's evolution towards relay discovery via the chain and cache.

4. Observability

Comprehensive Prometheus Metrics New metrics have been added across the application for deeper insight into the relay's health and performance:
- Event Listener: Tracks reconnect cycles, last processed block number, and current reconnecting status.
- Cache: Exposes hits, misses, total entries, and flush counts.
- Relay Switch: A new counter tracks session evictions, labeled by reason (closed, back_pressure_timeout, etc.).

5. Dependencies & Testing

Updated Dependencies Dozens of dependencies across the project have been updated to their latest versions for security and performance improvements.
Expanded Testing The test suite has been updated to reflect the new architecture, with new tests added to cover the Redis Streams federation logic and the resilient client patterns.

Related Issues

…ration

sameh-farouk · 2025-08-06T03:46:42Z

This PR opens the door to possibilities that weren't possible before, allowing Fast Fail Mechanism (can be introducd in complementary PR).

Now that the relay can quickly and cheaply determine if a twin is locally connected, it can reject incoming messages when it knows for certain that the destination twin is currently offline (disconnected). Instead of blindly accepting the message and queuing it (giving the false impression to the originating relay that the message was delivered) it can reject the federation request immediately.

What happens when the relay refuses the message?

The originating relay can immediately retry delivery through other relays that the destination twin is known to be connected to. This ensures message delivery as long as the twin is connected to at least one of its configured relays (as opposed to requiring connection to all relays, as before).
If the destination twin is completely offline (disconnected from all relays), the originating relay can quickly detect this and respond to the source twin that the destination is currently unreachable.
This allows the relay to act as a “fast fail” path when desired—and this behavior can be made configurable.

If the application (e.g. Dashboard) using the source twin retries with an alternate destination upon failure, or used waiting in front of it, it can benefit from the fast-fail mechanism—resulting in improved performance and responsiveness, instead of waiting for the message to time out.

…e relays field

…is offline

sameh-farouk · 2025-08-18T07:45:21Z

I extended the PR to include the event listener rework.

clippy fixes

…ion with peers and relays

…elivery

sameh-farouk · 2025-08-20T08:02:30Z

Added a federation rework.
The new federation represents a major architectural shift from using Redis Lists to Redis Streams, resulting in significant behavioral changes that make it fundamentally more reliable, scalable, and robust.

…andling

…imizations - Switch JSON → bincode for performance - Rewrite catch-up logic for speed and resilience - Optimize Redis command generation in relay worker Rationale: - Lower serialization overhead and improve end-to-end throughput

…tion - Adjust APIs and fix integration issues introduced by upgrades Rationale: - Keep stack current; performance and maintenance benefits

- events: timeout(12s) for OnlineClient::from_url; capped backoff; clearer catch-up loop - federation: RETRY_THROTTLE_SECS 15→2; CLAIM_MIN_IDLE_MS 30_000→3_000; XPENDING 100→200; XREAD 10→100; BLOCK 1000→500 - router: reqwest connect_timeout 2s; per-request timeout 3s - relay: use serve_connection_with_upgrades; adjust timeouts; better logging Rationale: - Tighter bounds on latency and faster recovery from failures

- Implement PartialOrd/Ord for SessionID using (TwinID, Option<&str>) - Replace to_string-based sort with sort_unstable Rationale: - Fewer allocations and faster set rebuilds with deterministic order

- Add size/time-based ACK batching (size=128, flush=10ms) - Final flush on shutdown; import Instant/Duration Rationale: - Cuts per-message XDEL round-trips; improves throughput under load

- Replace naive pool sizing with explicit headroom - wiggle = max(10% of workers, 8); fed_size = wiggle * 2 - blocking = workers + fed_size; ops_headroom = max(blocking/4, 16) - pool_size = blocking + ops_headroom; log pool breakdown - Federation: PARALLEL_PER_CONSUMER = 4; buffer_unordered for claimed/new - Import futures_util::StreamExt; tidy msg extraction and tests Rationale: - Prevent pool exhaustion from BLOCK’ed readers - Reduce head-of-line blocking without request storms

- Replace Arc<Mutex<ClientWrapper>> with ArcSwap<Client> for lock-free reads - Add per-attempt timeout (12s) and URL rotation in connect_with_timeouts - Singleflight reconnect via reconnect_lock on RPC disconnects - Deduplicate concurrent cache misses via shared futures (in_flight) - Retry update_twin/get_twin/get_twin_with_account on ClientError::Rpc(_) - Remove ClientWrapper and related code Rationale: - Avoids serialized client access and long-held locks across awaits - Fails fast on slow/broken endpoints and prevents thundering herd

…nations

feat: optimize message routing by checking local sessions before fede…

47af758

…ration

sameh-farouk marked this pull request as draft August 5, 2025 22:36

sameh-farouk added 3 commits August 18, 2025 05:17

fix: improve relay cache reliability; reduce RPC dependency; deprecat…

8e10918

…e relays field

feat(api): support fast-fail tag; receive immediate response if peer …

facd559

…is offline

fix cargo fmt

678535f

sameh-farouk marked this pull request as ready for review August 18, 2025 02:54

sameh-farouk added 4 commits August 18, 2025 14:36

clippy fix

18c66e2

refactor: improve error handling and metrics in relay components

43b0909

clippy fixes

Optimize the connection to maintain long-lived, performant communicat…

4112843

…ion with peers and relays

feat: refactor federation to use Redis Streams for reliable message d…

59771b6

…elivery

sameh-farouk added 10 commits August 20, 2025 12:10

refactor: extract error envelope builder and improve error response h…

7028495

…andling

feat: upgrade to hyper 1, http 1, reqwest 0.12; update server integra…

bb657c7

…tion - Adjust APIs and fix integration issues introduced by upgrades Rationale: - Keep stack current; performance and maintenance benefits

perf(switch): implement Ord for SessionID; use sort_unstable

c293676

- Implement PartialOrd/Ord for SessionID using (TwinID, Option<&str>) - Replace to_string-based sort with sort_unstable Rationale: - Fewer allocations and faster set rebuilds with deterministic order

perf(relay/api): batch ACKs in ConnectionWriter to reduce Redis RTTs

7c15b90

- Add size/time-based ACK batching (size=128, flush=10ms) - Final flush on shutdown; import Instant/Duration Rationale: - Cuts per-message XDEL round-trips; improves throughput under load

logs: log level adjustment for block number debug->trace

4bb7902

style: fix code formatting

1fbea6f

sameh-farouk changed the title ~~Optimize message routing~~ Overhaul core components for production resilience and performance Aug 25, 2025

sameh-farouk added 2 commits August 28, 2025 18:02

feat: reduce worker read timeout and improve worker fairness under load

de6ff21

rename fast-fail to fail-fast and remove downvoting for offline desti…

462dc2b

…nations

xmonader approved these changes Sep 21, 2025

View reviewed changes

sameh-farouk merged commit 4a44655 into main Sep 21, 2025
1 check passed

sameh-farouk deleted the main-feat-optimize-message-routing branch September 21, 2025 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul core components for production resilience and performance#227

Overhaul core components for production resilience and performance#227
sameh-farouk merged 20 commits into
mainfrom
main-feat-optimize-message-routing

sameh-farouk commented Aug 5, 2025 •

edited

Loading

Uh oh!

sameh-farouk commented Aug 6, 2025 •

edited

Loading

Uh oh!

sameh-farouk commented Aug 18, 2025 •

edited

Loading

Uh oh!

sameh-farouk commented Aug 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sameh-farouk commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Architectural & Reliability Overhaul

2. Performance & Efficiency

3. Networking & Protocol Modernization

4. Observability

5. Dependencies & Testing

Related Issues

Uh oh!

sameh-farouk commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameh-farouk commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sameh-farouk commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sameh-farouk commented Aug 5, 2025 •

edited

Loading

sameh-farouk commented Aug 6, 2025 •

edited

Loading

sameh-farouk commented Aug 18, 2025 •

edited

Loading

sameh-farouk commented Aug 20, 2025 •

edited

Loading