Skip to content

Latest commit

 

History

History
191 lines (153 loc) · 14.1 KB

File metadata and controls

191 lines (153 loc) · 14.1 KB

Load Balancing: Challenges & Solutions

Context

OpenTelemetry Protocol with Apache Arrow (OTAP) maintains per-stream state (schemas, dictionaries, other compression metadata) to reduce wire size; longer-lived streams generally compress better but accumulate memory and require coordination when recycled.

In thread-per-core architectures using SO_REUSEPORT, the kernel creates an independent accept queue per listening socket and selects a socket based on a hash of the connection 4-tuple (source/destination IP, source/destination port). This approach reduces accept contention but makes load distribution dependent on connection diversity.

Since gRPC multiplexes multiple logical calls (including bidirectional streaming RPCs) over a single HTTP/2 (and thus single TCP) connection, too few client connections can cause traffic to concentrate on a single reuseport bucket (one core) behind an L4 balancer. Fine-grained distribution requires either L7 (HTTP/2-aware) load balancing or deliberate client fan-out.

Addressing these interactions is crucial for reliable scalability. This document outlines the associated challenges, trade-offs, and practical mitigations for both exporters and servers.

Key Challenges

# Challenge Symptoms Root Cause
1 Skewed listener utilization All (or most) exporter TCP connections land on the same reuseport listener => one core saturated, others idle Too few distinct TCP 4-tuples (e.g. single long-lived gRPC channel) means the kernel's reuseport hash has little entropy; distribution collapses statistically.
2 In-stream vs connection imbalance Recycling an OTAP stream clears state but stays on the same TCP connection, so work remains pinned to the same listener/core gRPC stream lifetime != TCP connection lifetime; reuseport selection happens once per connection handshake. To move work you must rotate the connection or rebalance at L7.
3 Dictionary / stream-state growth High-cardinality attributes inflate Arrow dictionary & per-stream memory Stateful OTAP encoding accumulates per-stream dictionaries until recycled or bounded by receiver memory/admission limits.
4 Single accept listener anti-pattern One thread does all accept() => CPU hotspot, lock contention, cache/NUMA misses, global back-pressure Shared accept queues can become contended and distribute unevenly; SO_REUSEPORT (per-socket queues) improves scalability but has fairness/latency trade-offs.

Table notes:

Solution Space

1. Client-Side Techniques (Exporter)

Technique Purpose Considerations
Increase connection fan-out (num_streams / multiple gRPC channels) Create multiple concurrent TCP connections so reuseport has entropy; improves distribution and concurrency headroom. Start near max(1, CPUs/2) (current otelarrowexporter default) and tune; cap for TLS / resource overhead. More channels consume sockets, TLS handshakes, and exporter CPU; too many can hurt compression efficiency because state is sharded.
Client-side gRPC load-balancing policy (round_robin, etc.) Ensure channels spread across backend endpoints (or front-end L7 proxies) instead of pick_first pinning. OTEL-Arrow exporter defaults to round_robin. Requires name resolution / endpoint list; risk of thundering herd on endpoint change if misconfigured.
Stream lifetime control (max_stream_lifetime) Periodically recycle OTAP streams to bound dictionary growth and help downstream rebalancers shed load; default 30s in current Go exporter (tunable). Recycling forces resending schemas/dictionaries; shorter lifetimes reduce compression efficiency; coordinate with server keepalive.
Back-pressure aware retries Honor exporter/receiver back-pressure signals so clients can open additional channels or shed load when a stream saturates. Requires instrumentation and retry logic; careless retries can amplify load.

Important note: Do not trust uncooperative or malicious clients to recycle streams or limit dictionary growth => enforce server-side caps.

2. Server-Side Techniques

2.a. Custom SO_REUSEPORT BPF Selection

An eBPF program (SO_ATTACH_REUSEPORT_EBPF / BPF_PROG_TYPE_SK_REUSEPORT) can be attached to influence kernel listener selection within a reuseport group. Practical uses include adding randomization, weighting sockets, or leveraging additional header information to mitigate distribution skew.

2.b. Front-End L7 (HTTP/2-aware) Load Balancer

Deploying an HTTP/2-aware proxy (e.g. Envoy, NGINX) that terminates HTTP/2/TLS and creates new backend connections enables finer distribution of individual gRPC calls/streams, effectively addressing L4 connection pinning. This is the recommended approach for achieving balanced load distribution of long-lived streaming RPCs.

Tip: If you already run a thread-per-core backend behind the proxy, you typically expose one service port and rely on reuseport behind the proxy; per-core port sharding is rarely needed.

2.c. eBPF SK_MSG / SOCKMAP Stream-Level Experiments (Advanced)

Sockmap/SK_MSG programs can redirect application payloads between established sockets in the kernel and are used in advanced in-kernel proxies and service meshes. In theory we could parse (plaintext) HTTP/2 frames and steer them across worker sockets, but doing so (especially through TLS) is complex and typically unnecessary if an L7 proxy is available. Treat as research / last resort.

2.d. Internal Topic-Based Split (Pipeline Decoupling)

When SO_REUSEPORT is unavailable (or ineffective) and/or incoming connection count is too low to distribute load well across cores, a complementary approach is to split one pipeline into two pipelines connected by an internal in-memory topic.

Ingress pipeline (n cores)                  Processing/Export pipeline (m cores)
receiver -> (minimal processing) -> topic   topic -> processors -> exporters

Recommended shape:

  • Ingress pipeline (n cores): optimize for receiver-side work only (accept/decode/admission), then publish each received batch to a local in-memory topic.
  • Processing/export pipeline (m cores): consume that topic with balanced subscriptions, then run heavier processing/export logic.

Why this helps:

  • Distribution happens at the batch level (not per signal), so scheduling overhead remains low.
  • Receiver-side connection skew is absorbed by the topic boundary, while downstream worker cores still get balanced batches.
  • n and m can be tuned independently (I/O-heavy ingest vs CPU-heavy processing).
  • Best efficiency is typically achieved when both pipelines are pinned to the same NUMA node, minimizing cross-node memory traffic.

Trade-offs to plan for:

  • Extra queueing boundary (capacity sizing and queue_on_full policy matter).
  • Potential additional latency under sustained backpressure.
  • This is complementary to L7/reuseport strategies, not a replacement for network-level balancing across hosts.

Recommended Baseline Configuration

  1. Per-CPU listener sockets: Use SO_REUSEPORT (one service port; one socket per core/worker) to reduce accept contention and improve CPU locality.
  2. Exporter connection fan-out: Begin with num_streams ~= max(1, CPUs/2) (current default); monitor and adjust upward if skew persists and resources permit.
  3. Stream recycling: Start with max_stream_lifetime set between 30 seconds and 2 minutes; lengthen this interval for improved compression if memory allows, or shorten it to rebalance faster or respect proxy keepalive constraints. Align with server-side settings like keepalive and max_connection_age.
  4. Observability: Monitor per-listener connection counts, QPS, latency, and compression efficiency; alert when deviation exceeds acceptable thresholds (e.g. >20% from median) aligned with your Service Level Objectives (SLOs).
  5. Low-connection or no-reuseport fallback: If listener-level skew persists because connection entropy is low (or SO_REUSEPORT is not available), consider the internal topic-based split (ingest -> topic -> processing) to rebalance work at batch granularity.

Future Work

  • Adaptive autotuning of stream reset intervals and connection fan-out based on live operational metrics.
  • Prototype & benchmark eBPF hash strategies.
  • Investigate hot-stream migration without a full TCP reconnect (protocol extension): explore methods for shifting active OTAP gRPC streams to new listeners/cores, allowing load rebalancing without forcing exporters to reconnect or resend Arrow dictionaries.

Appendix - Why Not a Single Accept Listener?

  • CPU & lock contention: A shared accept queue forces synchronization among workers, potentially skewing load distribution particularly when combined with epoll's LIFO behavior, which preferentially feeds the busiest worker.
  • Locality & scalability: Per-socket queues improve packet locality and reduce cross-CPU bouncing, aiding multicore scaling (NUMA).
  • Extra syscalls: Passing accepted sockets between workers introduces unnecessary syscall overhead.
  • Global back-pressure: one slow worker stalls all new connections.
  • Redundant engineering: Duplicates functionality already efficiently provided by SO_REUSEPORT.
  • Failure risk: single point of failure. If the acceptor crashes, all new connections stall.

References & Further Reading