OpenTelemetry Protocol with Apache Arrow (OTAP) maintains per-stream state (schemas, dictionaries, other compression metadata) to reduce wire size; longer-lived streams generally compress better but accumulate memory and require coordination when recycled.
In thread-per-core architectures using SO_REUSEPORT, the kernel creates an
independent accept queue per listening socket and selects a socket based on a
hash of the connection 4-tuple (source/destination IP, source/destination port).
This approach reduces accept contention but makes load distribution dependent on
connection diversity.
Since gRPC multiplexes multiple logical calls (including bidirectional streaming RPCs) over a single HTTP/2 (and thus single TCP) connection, too few client connections can cause traffic to concentrate on a single reuseport bucket (one core) behind an L4 balancer. Fine-grained distribution requires either L7 (HTTP/2-aware) load balancing or deliberate client fan-out.
Addressing these interactions is crucial for reliable scalability. This document outlines the associated challenges, trade-offs, and practical mitigations for both exporters and servers.
| # | Challenge | Symptoms | Root Cause |
|---|---|---|---|
| 1 | Skewed listener utilization | All (or most) exporter TCP connections land on the same reuseport listener => one core saturated, others idle | Too few distinct TCP 4-tuples (e.g. single long-lived gRPC channel) means the kernel's reuseport hash has little entropy; distribution collapses statistically. |
| 2 | In-stream vs connection imbalance | Recycling an OTAP stream clears state but stays on the same TCP connection, so work remains pinned to the same listener/core | gRPC stream lifetime != TCP connection lifetime; reuseport selection happens once per connection handshake. To move work you must rotate the connection or rebalance at L7. |
| 3 | Dictionary / stream-state growth | High-cardinality attributes inflate Arrow dictionary & per-stream memory | Stateful OTAP encoding accumulates per-stream dictionaries until recycled or bounded by receiver memory/admission limits. |
| 4 | Single accept listener anti-pattern | One thread does all accept() => CPU hotspot, lock contention, cache/NUMA misses, global back-pressure |
Shared accept queues can become contended and distribute unevenly; SO_REUSEPORT (per-socket queues) improves scalability but has fairness/latency trade-offs. |
Table notes:
- Reuseport hashing distributes connections, not individual gRPC streams; ensure enough connections for statistical balance. See The SO_REUSEPORT socket option - LWN.net and Performance best practices with gRPC - Microsoft Learn
- Consider tail-latency implications when moving from a shared accept queue to per-socket queues. See Why does one NGINX worker take all the load? and Perfect locality and three epic SystemTap scripts
| Technique | Purpose | Considerations |
|---|---|---|
Increase connection fan-out (num_streams / multiple gRPC channels) |
Create multiple concurrent TCP connections so reuseport has entropy; improves distribution and concurrency headroom. Start near max(1, CPUs/2) (current otelarrowexporter default) and tune; cap for TLS / resource overhead. |
More channels consume sockets, TLS handshakes, and exporter CPU; too many can hurt compression efficiency because state is sharded. |
Client-side gRPC load-balancing policy (round_robin, etc.) |
Ensure channels spread across backend endpoints (or front-end L7 proxies) instead of pick_first pinning. OTEL-Arrow exporter defaults to round_robin. |
Requires name resolution / endpoint list; risk of thundering herd on endpoint change if misconfigured. |
Stream lifetime control (max_stream_lifetime) |
Periodically recycle OTAP streams to bound dictionary growth and help downstream rebalancers shed load; default 30s in current Go exporter (tunable). | Recycling forces resending schemas/dictionaries; shorter lifetimes reduce compression efficiency; coordinate with server keepalive. |
| Back-pressure aware retries | Honor exporter/receiver back-pressure signals so clients can open additional channels or shed load when a stream saturates. | Requires instrumentation and retry logic; careless retries can amplify load. |
Important note: Do not trust uncooperative or malicious clients to recycle streams or limit dictionary growth => enforce server-side caps.
An eBPF program (SO_ATTACH_REUSEPORT_EBPF / BPF_PROG_TYPE_SK_REUSEPORT) can
be attached to influence kernel listener selection within a reuseport group.
Practical uses include adding randomization, weighting sockets, or leveraging
additional header information to mitigate distribution skew.
Deploying an HTTP/2-aware proxy (e.g. Envoy, NGINX) that terminates HTTP/2/TLS and creates new backend connections enables finer distribution of individual gRPC calls/streams, effectively addressing L4 connection pinning. This is the recommended approach for achieving balanced load distribution of long-lived streaming RPCs.
Tip: If you already run a thread-per-core backend behind the proxy, you typically expose one service port and rely on reuseport behind the proxy; per-core port sharding is rarely needed.
Sockmap/SK_MSG programs can redirect application payloads between established sockets in the kernel and are used in advanced in-kernel proxies and service meshes. In theory we could parse (plaintext) HTTP/2 frames and steer them across worker sockets, but doing so (especially through TLS) is complex and typically unnecessary if an L7 proxy is available. Treat as research / last resort.
When SO_REUSEPORT is unavailable (or ineffective) and/or incoming connection
count is too low to distribute load well across cores, a complementary approach
is to split one pipeline into two pipelines connected by an internal in-memory
topic.
Ingress pipeline (n cores) Processing/Export pipeline (m cores)
receiver -> (minimal processing) -> topic topic -> processors -> exporters
Recommended shape:
- Ingress pipeline (
ncores): optimize for receiver-side work only (accept/decode/admission), then publish each received batch to a local in-memory topic. - Processing/export pipeline (
mcores): consume that topic with balanced subscriptions, then run heavier processing/export logic.
Why this helps:
- Distribution happens at the batch level (not per signal), so scheduling overhead remains low.
- Receiver-side connection skew is absorbed by the topic boundary, while downstream worker cores still get balanced batches.
nandmcan be tuned independently (I/O-heavy ingest vs CPU-heavy processing).- Best efficiency is typically achieved when both pipelines are pinned to the same NUMA node, minimizing cross-node memory traffic.
Trade-offs to plan for:
- Extra queueing boundary (capacity sizing and
queue_on_fullpolicy matter). - Potential additional latency under sustained backpressure.
- This is complementary to L7/reuseport strategies, not a replacement for network-level balancing across hosts.
- Per-CPU listener sockets: Use
SO_REUSEPORT(one service port; one socket per core/worker) to reduce accept contention and improve CPU locality. - Exporter connection fan-out: Begin with
num_streams ~= max(1, CPUs/2)(current default); monitor and adjust upward if skew persists and resources permit. - Stream recycling: Start with
max_stream_lifetimeset between 30 seconds and 2 minutes; lengthen this interval for improved compression if memory allows, or shorten it to rebalance faster or respect proxy keepalive constraints. Align with server-side settings likekeepaliveandmax_connection_age. - Observability: Monitor per-listener connection counts, QPS, latency, and compression efficiency; alert when deviation exceeds acceptable thresholds (e.g. >20% from median) aligned with your Service Level Objectives (SLOs).
- Low-connection or no-reuseport fallback: If listener-level skew persists
because connection entropy is low (or
SO_REUSEPORTis not available), consider the internal topic-based split (ingest -> topic -> processing) to rebalance work at batch granularity.
- Adaptive autotuning of stream reset intervals and connection fan-out based on live operational metrics.
- Prototype & benchmark eBPF hash strategies.
- Investigate hot-stream migration without a full TCP reconnect (protocol extension): explore methods for shifting active OTAP gRPC streams to new listeners/cores, allowing load rebalancing without forcing exporters to reconnect or resend Arrow dictionaries.
- CPU & lock contention: A shared accept queue forces synchronization among workers, potentially skewing load distribution particularly when combined with epoll's LIFO behavior, which preferentially feeds the busiest worker.
- Locality & scalability: Per-socket queues improve packet locality and reduce cross-CPU bouncing, aiding multicore scaling (NUMA).
- Extra syscalls: Passing accepted sockets between workers introduces unnecessary syscall overhead.
- Global back-pressure: one slow worker stalls all new connections.
- Redundant engineering: Duplicates functionality already efficiently
provided by
SO_REUSEPORT. - Failure risk: single point of failure. If the acceptor crashes, all new connections stall.
- Linux
socket(7)manual -SO_REUSEPORT,SO_ATTACH_REUSEPORT_EBPFoptions - The SO_REUSEPORT socket option* (LWN.net, Kerrisk)
- NGINX
listen ... reuseportdirective (socket sharding) - Why does one NGINX worker take all the load? (Cloudflare) - accept queue models & balancing
- Perfect locality and three SystemTap scripts (Cloudflare) - reuseport & locality effects
- Performance Optimisation using SO_REUSEPORT* (Marten Gartner)
- eBPF-Powered Load Balancing for SO_REUSEPORT
- eBPF Program Type
SK_MSG& Sockmap redirection docs, BPF_MAP_TYPE_SOCKMAP and BPF_MAP_TYPE_SOCKHASH - Linux Kernel Docs - gRPC Performance Best Practices - L4 vs L7 load balancing; streaming pinning - Microsoft Learn
- gRPC Custom Load Balancing Policies
- Envoy gRPC proxying & load-balancing overview
- Introducing gRPC Support with NGINX
- OTel-Arrow in Production (memory limits, tuning, back-pressure)
- Apache Arrow Columnar Format & Dictionary Encoding spec
- OTAP Phase-2 Announcement (Rust / thread-per-core direction)