Load Balancing: Challenges & Solutions

Context

OpenTelemetry Protocol with Apache Arrow (OTAP) maintains per-stream state (schemas, dictionaries, other compression metadata) to reduce wire size; longer-lived streams generally compress better but accumulate memory and require coordination when recycled.

In thread-per-core architectures using SO_REUSEPORT, the kernel creates an independent accept queue per listening socket and selects a socket based on a hash of the connection 4-tuple (source/destination IP, source/destination port). This approach reduces accept contention but makes load distribution dependent on connection diversity.

Since gRPC multiplexes multiple logical calls (including bidirectional streaming RPCs) over a single HTTP/2 (and thus single TCP) connection, too few client connections can cause traffic to concentrate on a single reuseport bucket (one core) behind an L4 balancer. Fine-grained distribution requires either L7 (HTTP/2-aware) load balancing or deliberate client fan-out.

Addressing these interactions is crucial for reliable scalability. This document outlines the associated challenges, trade-offs, and practical mitigations for both exporters and servers.

Key Challenges

#	Challenge	Symptoms	Root Cause
1	Skewed listener utilization	All (or most) exporter TCP connections land on the same reuseport listener => one core saturated, others idle	Too few distinct TCP 4-tuples (e.g. single long-lived gRPC channel) means the kernel's reuseport hash has little entropy; distribution collapses statistically.
2	In-stream vs connection imbalance	Recycling an OTAP stream clears state but stays on the same TCP connection, so work remains pinned to the same listener/core	gRPC stream lifetime != TCP connection lifetime; reuseport selection happens once per connection handshake. To move work you must rotate the connection or rebalance at L7.
3	Dictionary / stream-state growth	High-cardinality attributes inflate Arrow dictionary & per-stream memory	Stateful OTAP encoding accumulates per-stream dictionaries until recycled or bounded by receiver memory/admission limits.
4	Single accept listener anti-pattern	One thread does all `accept()` => CPU hotspot, lock contention, cache/NUMA misses, global back-pressure	Shared accept queues can become contended and distribute unevenly; `SO_REUSEPORT` (per-socket queues) improves scalability but has fairness/latency trade-offs.

Table notes:

Reuseport hashing distributes connections, not individual gRPC streams; ensure enough connections for statistical balance. See The SO_REUSEPORT socket option - LWN.net and Performance best practices with gRPC - Microsoft Learn
Consider tail-latency implications when moving from a shared accept queue to per-socket queues. See Why does one NGINX worker take all the load? and Perfect locality and three epic SystemTap scripts

Solution Space

1. Client-Side Techniques (Exporter)

Technique	Purpose	Considerations
Increase connection fan-out (`num_streams` / multiple gRPC channels)	Create multiple concurrent TCP connections so reuseport has entropy; improves distribution and concurrency headroom. Start near `max(1, CPUs/2)` (current `otelarrowexporter` default) and tune; cap for TLS / resource overhead.	More channels consume sockets, TLS handshakes, and exporter CPU; too many can hurt compression efficiency because state is sharded.
Client-side gRPC load-balancing policy (`round_robin`, etc.)	Ensure channels spread across backend endpoints (or front-end L7 proxies) instead of `pick_first` pinning. OTEL-Arrow exporter defaults to `round_robin`.	Requires name resolution / endpoint list; risk of thundering herd on endpoint change if misconfigured.
Stream lifetime control (`max_stream_lifetime`)	Periodically recycle OTAP streams to bound dictionary growth and help downstream rebalancers shed load; default 30s in current Go exporter (tunable).	Recycling forces resending schemas/dictionaries; shorter lifetimes reduce compression efficiency; coordinate with server `keepalive`.
Back-pressure aware retries	Honor exporter/receiver back-pressure signals so clients can open additional channels or shed load when a stream saturates.	Requires instrumentation and retry logic; careless retries can amplify load.

Important note: Do not trust uncooperative or malicious clients to recycle streams or limit dictionary growth => enforce server-side caps.

2. Server-Side Techniques

2.a. Custom `SO_REUSEPORT` BPF Selection

An eBPF program (SO_ATTACH_REUSEPORT_EBPF / BPF_PROG_TYPE_SK_REUSEPORT) can be attached to influence kernel listener selection within a reuseport group. Practical uses include adding randomization, weighting sockets, or leveraging additional header information to mitigate distribution skew.

2.b. Front-End L7 (HTTP/2-aware) Load Balancer

Deploying an HTTP/2-aware proxy (e.g. Envoy, NGINX) that terminates HTTP/2/TLS and creates new backend connections enables finer distribution of individual gRPC calls/streams, effectively addressing L4 connection pinning. This is the recommended approach for achieving balanced load distribution of long-lived streaming RPCs.

Tip: If you already run a thread-per-core backend behind the proxy, you typically expose one service port and rely on reuseport behind the proxy; per-core port sharding is rarely needed.

2.c. eBPF SK_MSG / SOCKMAP Stream-Level Experiments (Advanced)

Sockmap/SK_MSG programs can redirect application payloads between established sockets in the kernel and are used in advanced in-kernel proxies and service meshes. In theory we could parse (plaintext) HTTP/2 frames and steer them across worker sockets, but doing so (especially through TLS) is complex and typically unnecessary if an L7 proxy is available. Treat as research / last resort.

2.d. Internal Topic-Based Split (Pipeline Decoupling)

When SO_REUSEPORT is unavailable (or ineffective) and/or incoming connection count is too low to distribute load well across cores, a complementary approach is to split one pipeline into two pipelines connected by an internal in-memory topic.

Ingress pipeline (n cores)                  Processing/Export pipeline (m cores)
receiver -> (minimal processing) -> topic   topic -> processors -> exporters

Recommended shape:

Ingress pipeline (n cores): optimize for receiver-side work only (accept/decode/admission), then publish each received batch to a local in-memory topic.
Processing/export pipeline (m cores): consume that topic with balanced subscriptions, then run heavier processing/export logic.

Why this helps:

Distribution happens at the batch level (not per signal), so scheduling overhead remains low.
Receiver-side connection skew is absorbed by the topic boundary, while downstream worker cores still get balanced batches.
n and m can be tuned independently (I/O-heavy ingest vs CPU-heavy processing).
Best efficiency is typically achieved when both pipelines are pinned to the same NUMA node, minimizing cross-node memory traffic.

Trade-offs to plan for:

Extra queueing boundary (capacity sizing and queue_on_full policy matter).
Potential additional latency under sustained backpressure.
This is complementary to L7/reuseport strategies, not a replacement for network-level balancing across hosts.

Recommended Baseline Configuration

Per-CPU listener sockets: Use SO_REUSEPORT (one service port; one socket per core/worker) to reduce accept contention and improve CPU locality.
Exporter connection fan-out: Begin with num_streams ~= max(1, CPUs/2) (current default); monitor and adjust upward if skew persists and resources permit.
Stream recycling: Start with max_stream_lifetime set between 30 seconds and 2 minutes; lengthen this interval for improved compression if memory allows, or shorten it to rebalance faster or respect proxy keepalive constraints. Align with server-side settings like keepalive and max_connection_age.
Observability: Monitor per-listener connection counts, QPS, latency, and compression efficiency; alert when deviation exceeds acceptable thresholds (e.g. >20% from median) aligned with your Service Level Objectives (SLOs).
Low-connection or no-reuseport fallback: If listener-level skew persists because connection entropy is low (or SO_REUSEPORT is not available), consider the internal topic-based split (ingest -> topic -> processing) to rebalance work at batch granularity.

Future Work

Adaptive autotuning of stream reset intervals and connection fan-out based on live operational metrics.
Prototype & benchmark eBPF hash strategies.
Investigate hot-stream migration without a full TCP reconnect (protocol extension): explore methods for shifting active OTAP gRPC streams to new listeners/cores, allowing load rebalancing without forcing exporters to reconnect or resend Arrow dictionaries.

Appendix - Why Not a Single Accept Listener?

CPU & lock contention: A shared accept queue forces synchronization among workers, potentially skewing load distribution particularly when combined with epoll's LIFO behavior, which preferentially feeds the busiest worker.
Locality & scalability: Per-socket queues improve packet locality and reduce cross-CPU bouncing, aiding multicore scaling (NUMA).
Extra syscalls: Passing accepted sockets between workers introduces unnecessary syscall overhead.
Global back-pressure: one slow worker stalls all new connections.
Redundant engineering: Duplicates functionality already efficiently provided by SO_REUSEPORT.
Failure risk: single point of failure. If the acceptor crashes, all new connections stall.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load Balancing: Challenges & Solutions

Context

Key Challenges

Solution Space

1. Client-Side Techniques (Exporter)

2. Server-Side Techniques

2.a. Custom `SO_REUSEPORT` BPF Selection

2.b. Front-End L7 (HTTP/2-aware) Load Balancer

2.c. eBPF SK_MSG / SOCKMAP Stream-Level Experiments (Advanced)

2.d. Internal Topic-Based Split (Pipeline Decoupling)

Recommended Baseline Configuration

Future Work

Appendix - Why Not a Single Accept Listener?

References & Further Reading

FilesExpand file tree

load-balancing.md

Latest commit

History

load-balancing.md

File metadata and controls

Load Balancing: Challenges & Solutions

Context

Key Challenges

Solution Space

1. Client-Side Techniques (Exporter)

2. Server-Side Techniques

2.a. Custom SO_REUSEPORT BPF Selection

2.b. Front-End L7 (HTTP/2-aware) Load Balancer

2.c. eBPF SK_MSG / SOCKMAP Stream-Level Experiments (Advanced)

2.d. Internal Topic-Based Split (Pipeline Decoupling)

Recommended Baseline Configuration

Future Work

Appendix - Why Not a Single Accept Listener?

References & Further Reading

2.a. Custom `SO_REUSEPORT` BPF Selection