-
Notifications
You must be signed in to change notification settings - Fork 105
Network performance statistics collection
This document summarizes an early exploration of an idea of how to obtain interesting per-peer statistics for dashboard reporting as well as exposing some of it to the miniprotocols themselves to facilitate their logic in a more sophisticated ways. A new package window-stats is currently in development to support this approach, and the rest of the document discusses how to instrument/probe the code in specific places to collect the necessary data in an online fashion, what are the various tradeoffs and/or alternatives.
window-stats is a new package for efficient sliding-window statistics, backed by a finger tree, with a cached user-supplied measure that updates incrementally. This package can be leveraged in ouroboros-network to compute such metrics of interest like delays, round trip times, thruput, etc. for monitoring purposes as well as exposing a subset of this data to the miniprotocols for use in their decision logic.
A working document for instrumenting network-mux connection-level
metrics with the window-stats sliding-window machinery, intended to
feed dashboards (Grafana etc.).
Looking at Network.Mux.Trace and Network.Mux.Types:
BearerTrace (high-frequency, low-level):
-
TraceRecvRaw Int,TraceRecvStart Int,TraceRecvEnd Int— bytes per SDU read. -
TraceSendStart SDUHeader,TraceSendEnd— SDU sends (header carries protocol num, direction, length, remote timestamp). -
TraceRecvDeltaQObservation,TraceRecvDeltaQSample— DeltaQ samples are already computed in the muxer: sample duration, packet count, mean/variance of the delta-Q distribution, and an estimated rate. -
TraceTCPInfo StructTCPInfo(Linux): RTT, RTT variance, cwnd, lost, retrans, MSS, etc., straight from the kernel. -
TraceSDUReadTimeoutException,TraceSDUWriteTimeoutException.
ChannelTrace (per mini-protocol):
-
TraceChannelRecvStart MiniProtocolNum,TraceChannelRecvEnd MiniProtocolNum Int. -
TraceChannelSendStart MiniProtocolNum Int,TraceChannelSendEnd MiniProtocolNum.
Trace (high-level lifecycle): mini-protocol start/exit, mux state transitions.
That's enough source data to derive most useful per-connection
metrics by post-processing in the trace consumer. Timestamps come
either from values embedded in the event itself (e.g., the SDU
header's remoteTimestamp) or by sampling getMonotonicTime at the
consumer when the event arrives — with the caveat that consumer-side
timestamping is fine for throughput but adds jitter to latency
measurements (see Cost considerations).
Rough priority order for a cardano node:
-
Per-protocol throughput (bytes/sec, SDUs/sec, in and out). Derivable today from
TraceChannelSendStart/TraceChannelRecvEndand theirIntlength fields. For "on-the-wire" throughput, cross-reference the SDU sizes inBearerTrace'sTraceSendStartheader. Recipe in Per-protocol throughput below. -
Per-protocol latency distribution, split by direction:
-
Inbound, channel-level: time between
TraceChannelRecvStartandTraceChannelRecvEnd. The muxer's portion of receive latency — useful for spotting slow deserialisers. -
Outbound, bearer-level: time between
BearerTrace'sTraceSendStartandTraceSendEnd. The kernel-handoff time for one SDU; the SDU header carries theMiniProtocolNumso this is naturally per-protocol. See Bearer-occupancy distribution.
The channel-level send pair (
TraceChannelSendStart/TraceChannelSendEnd) is not the right measurement for outbound latency: it brackets a Wanton STM write and doesn't reflect the bearer.Both distributions are good candidates for a
TimedWindow Time (TDigest 100) (DigestSample 100)per protocol. -
Inbound, channel-level: time between
-
Bearer share / fairness: bytes shipped per protocol over a rolling window, normalised by total. Reveals starvation.
-
DeltaQ summary.
TraceRecvDeltaQSampleis already a periodic summary; just window the latest few. The muxer pre-computes the heavy lifting. -
TCP-layer signals (Linux): rolling mean/std of
tcpi_rttand rates oftcpi_retrans/tcpi_lost. Cheap to compute, often the earliest indicator of network trouble. -
Error rates: count of
TraceSDU{Read,Write}TimeoutException,TraceExceptionExitper protocol, over the window. -
Mini-protocol uptime / restart rate: derivable from
TraceCleanExit/TraceExceptionExit. A sliding window doesn't add much here; a plain counter does the job. -
Aggregate bearer throughput for the whole connection — useful as a sanity check against the sum of per-protocol byte counts.
The following can't be derived from the current traces because the relevant events aren't emitted:
-
Egress queue depth per mini-protocol. The muxer's egress side peels from per-protocol Wantons; queue length / bytes pending is the headline backpressure indicator. Suggested trace:
TraceEgressQueueBytes MiniProtocolNum Int, sampled at each dispatch or periodically. -
Oldest-byte age in the egress queue. How long bytes have been sitting in the per-protocol queue before being dispatched — the user-visible head-of-line latency. Suggested trace:
TraceEgressOldestByteAge MiniProtocolNum DiffTimeper dispatch. See Time-in-queue and oldest-byte-age. -
Ingress dispatch latency. Symmetric to (2) but on the receive side: time the demuxer placed an SDU in the destination protocol's ingress queue → time the protocol picked it up.
-
Per-protocol fairness counter (egress slots awarded). Today you can infer this from
TraceSendStart(header carries the protocol num), so this is almost there — usable as-is. -
Bearer wait time — how long the egress thread blocked waiting for any protocol to have data. Distinguishes "bearer idle because no traffic" from "bearer saturated".
(1) and (2) are the most valuable additions; (3) is symmetric. (4) and (5) are nice-to-haves. The recipe for (2) in particular hinges on the Wanton model — see the dedicated section below before considering an implementation.
For a node with ~80 inbound + ~50 outbound peers × 5 mini-protocols
× ~4 metrics = ~2600 windows. At ~32 KB per 1024-sample window with
a Welford measure, that's ~80 MB resident — significant but
tractable. TDigest-backed windows are smaller (the digest is
O(δ), not O(w)) but each insert costs O(δ log w).
-
Pick
Time, notUTCTimefor the timestamp type. Smaller per-sample footprint, no clock-jump correctness concerns, and consistent with the rest of ouroboros-network. -
Avoid Welford for high-cardinality samples that don't need variance. If you only want mean and rate, a
Sum-measure window is markedly cheaper per insert than aWelfordMeasureone (Welford's<>does several mul/div per node along the spine). -
Sample, don't track every event for low-importance metrics. If you care about p99 RTT, you don't need every TCPInfo trace — a 1 Hz sample feeds a 5-minute window of 300 entries and is still informative.
At the trace consumer, never on the trace-emit path. Adding new
state to mux for stamping purposes (e.g., the oldestEnqueued :: TVar (Maybe Time) for the oldest-byte-age recipe below) is fine
and sometimes necessary — but the sliding-window machinery itself
(the FT inserts, the rolling-measure recomputations) belongs at the
consumer. Putting it in the trace path lengthens every
traceWith call.
- Capture-at-consumer is fine for throughput, rate counters, and anything that's a count or ratio over a window much larger than scheduler jitter.
- Capture-at-emit is needed for latency distributions: latency is a difference of two timestamps; a small constant consumer delay cancels, but its variance doesn't. For sub-millisecond latencies the consumer-side jitter is the same order as the signal. Embed timestamps in the trace event or wrap the tracer with a stamping combinator.
The snippets below are in IO for brevity. In real mux code, use
the io-classes style — MonadSTM m, MonadMonotonicTime m —
consistent with the surrounding codebase. getMonotonicTime is
deliberately called outside atomically to avoid blocking IO
inside an STM transaction.
The intuitive recipe goes:
rate = windowSum w `div` windowDuration wThe trap: windowDuration w is the span from the oldest to the
newest sample currently in the window — not the configured duration.
Three failure modes:
-
Sparse traffic. Two samples 100 ms apart in a 60 s window:
windowDurationreturns 100 ms, the rate looks hundreds of times higher than the actual sustained rate. -
Single-sample window.
windowDurationis 0 /Nothing; division by zero. -
Cold start. No samples,
windowDurationisNothing.
The metric you actually want is "bytes per second averaged over the last N seconds, treating quiet periods as zero". The denominator should be the configured duration, not the observed span.
import Control.Monad.Class.MonadTime.SI (Time, DiffTime, getMonotonicTime)
import Data.Window.Timed
import Data.Monoid (Sum (..))
-- For very long windows / very fast links, switch to Int64 / Integer
-- to avoid Int overflow (a 10 Gbit link over 60 s carries ~75 GB,
-- which still fits in 64-bit Int).
type ThroughputWin = TimedWindow Time (Sum Int) (SumSample Int)
newThroughputWin :: DiffTime -> ThroughputWin
newThroughputWin = empty
recordBytes :: Time -> Int -> ThroughputWin -> ThroughputWin
recordBytes t bytes = insert (t, bytes)
-- | Bytes per second averaged over the configured window. Returns
-- 'Nothing' until the window has actually spanned its configured
-- duration (i.e. it's warmed up).
bytesPerSecond :: ThroughputWin -> Maybe Double
bytesPerSecond w = do
observed <- windowDuration w
let configured :: DiffTime
configured = windowMaxDuration w
if observed < configured
then Nothing
else do
total <- windowSum w
return (fromIntegral total / diffTimeToSeconds configured)
where
diffTimeToSeconds = fromRational . toRational- Denominator is
windowMaxDuration w, notwindowDuration w. -
Warm-up handling. Returning
Nothinguntilobserved >= configuredkeeps the first N seconds out of the dashboard rather than producing a misleading low spike.
sendThroughputTracer
:: TVar IO (Map MiniProtocolNum ThroughputWin)
-> ChannelTrace
-> IO ()
sendThroughputTracer ref (TraceChannelSendStart mpn len) = do
!now <- getMonotonicTime
atomically $ modifyTVar' ref $
Map.adjust (recordBytes now len) mpn
sendThroughputTracer _ _ = return ()For receive, use TraceChannelRecvEnd mpn len. When publishing to
the dashboard, snapshot the TVar and run bytesPerSecond on each
window.
-
Don't divide by the observed span. Tempting when warm-up
Nothings look awkward on the dashboard — but the alternative is wildly wrong rates during low-traffic periods. -
TraceChannelSendStartlengths are payload-only. They exclude mux SDU headers. For wire throughput, cross-referenceBearerTrace.TraceSendStartSDU header lengths. - Choose the configured duration thoughtfully. Too short (1 s) gives noisy readings and short warm-up; too long (60 s) smooths over interesting transients. For most dashboards 5–15 s is a sensible default. Run multiple parallel windows (short for spike detection, long for trend).
-
Aggregating across protocols. Don't naively sum per-protocol
bytesPerSecondvalues — they may be in different warm-up states. Either maintain a separate aggregate window, or sum only warmed-up readings.
import Data.Window.Fold.Timed (windowFoldRollingFull, windowMeasureFinal)
import Data.Monoid (Sum (..))
import qualified Control.Foldl as L
diffTimeToSeconds :: DiffTime -> Double
diffTimeToSeconds = fromRational . toRational
-- final total bytes at end of input (not a rate)
totalBytesFinal :: DiffTime -> L.Fold (Time, Int) (Sum Int)
totalBytesFinal = windowMeasureFinal
-- streamed rolling rates (Bytes/sec), warm-up gated
rollingRate :: DiffTime -> L.Fold (Time, Int) [Double]
rollingRate d =
let perSecond (Sum total) =
fromIntegral total / diffTimeToSeconds d
in fmap perSecond <$> windowFoldRollingFull d L.listwindowFoldRollingFull embeds the warm-up gate, matching the
bytesPerSecond semantics above.
The egress side is not a queue of messages: it's a Wanton,
i.e. a TVar ByteString (wrapped in a newtype). send appends
bytes to the buffer; processSingleWanton later peels off chunks of
up to ~12 kB (one SDU) and round-robins between protocols. A single
send may be fragmented across several SDU dispatches; multiple
sends may coalesce in the buffer with no gap. The mux has no
notion of "message boundary" — to it, each protocol is a continuous
byte stream:
app calls Channel.send
↓
bytes appended to Wanton ← head-of-line ageing starts here
↓
egress thread peels off an SDU
↓
TraceSendStart ← bearer-occupancy starts ticking
↓
kernel send() syscall
↓
TraceSendEnd ← bearer-occupancy stops
This rules out the naive "wrap each enqueued element with a timestamp" approach. We need to be clearer about what we want to measure.
-
(A) First-byte latency per
sendcall — time fromsendreturning to the first byte of that call's bytes hitting the wire. -
(B) Last-byte latency per
sendcall — time fromsendreturning to the last byte hitting the wire. The most user-visible: it's when the receiver can start parsing. - (C) Oldest-byte age signal — one timestamp per Wanton: the arrival time of the oldest currently-unsent byte. Reported per dispatch.
You don't need to thread the timestamp through the buffer. Keep a
sidecar Seq of (cumulativeEndOffset, sendTime) plus running
counters:
data Wanton = Wanton
{ bytes :: TVar ByteString
, sendStamps :: TVar (Seq (Int, Time)) -- (cumulative end offset, send time)
, bytesIn :: TVar Int -- cumulative bytes ever enqueued
, bytesOut :: TVar Int -- cumulative bytes ever dispatched
}
send :: Wanton -> ByteString -> IO ()
send w bs = do
!now <- getMonotonicTime -- IO outside atomically by design
atomically $ do
n <- readTVar (bytesIn w)
let n' = n + BS.length bs
modifyTVar' (bytes w) (<> bs)
writeTVar (bytesIn w) n'
modifyTVar' (sendStamps w) (Seq.|> (n', now))
-- inside processSingleWanton, after deciding to dispatch m bytes
onDispatch :: Tracer IO BearerTrace -> MiniProtocolNum -> Wanton -> Int -> IO ()
onDispatch tr mpn w m = do
!now <- getMonotonicTime
fired <- atomically $ do
outBefore <- readTVar (bytesOut w)
let outAfter = outBefore + m
writeTVar (bytesOut w) outAfter
stamps <- readTVar (sendStamps w)
let (done, kept) = Seq.spanl (\(end, _) -> end <= outAfter) stamps
writeTVar (sendStamps w) kept
return done
mapM_ (\(_, t0) -> traceWith tr (TraceEgressSendFinished mpn (now `diffTime` t0)))
(toList fired)The Seq is bounded by the number of in-flight sends — small in
practice (mini-protocols pipeline a handful of outstanding messages
and the buffer holds at most a few SDUs' worth).
Same shape as (B), but record the cumulative start offset and pop
when bytesOut crosses it.
A single timestamp per Wanton, updated on transitions empty↔non-empty:
data Wanton = Wanton
{ bytes :: TVar ByteString
, oldestEnqueued :: TVar (Maybe Time)
}
send :: Wanton -> ByteString -> IO ()
send w bs = do
!now <- getMonotonicTime
atomically $ do
isEmpty <- BS.null <$> readTVar (bytes w)
when isEmpty (writeTVar (oldestEnqueued w) (Just now))
modifyTVar' (bytes w) (<> bs)
onDispatch :: Tracer IO BearerTrace -> MiniProtocolNum -> Wanton -> IO ()
onDispatch tr mpn w = do
!now <- getMonotonicTime
age <- atomically $ do
bs <- readTVar (bytes w)
case bs of
_ | BS.null bs -> do
writeTVar (oldestEnqueued w) Nothing
return Nothing
_ -> readTVar (oldestEnqueued w)
forM_ age $ \t0 ->
traceWith tr (TraceEgressOldestByteAge mpn (now `diffTime` t0))After a partial dispatch leaves bytes behind, the "oldest" timestamp stays put — those leftover bytes are still old. That's exactly the head-of-line story the metric is trying to surface.
Start with (C): one timestamp per Wanton, two cheap STM ops per
send and per dispatch, no Seq bookkeeping, and the metric it
produces is directly actionable for dashboards. Graduate to (B)
only if you need per-send resolution later. Skip (A) unless
there's a specific reason — first-byte latency is rarely what you
want for backpressure.
For each SDU shipped, time between BearerTrace's TraceSendStart
and TraceSendEnd — the kernel-handoff time for that SDU. The SDU
header carries MiniProtocolNum, giving a per-protocol
distribution.
The two metrics sound similar but probe adjacent layers; both are worth tracking, neither subsumes the other.
- Oldest-byte age (queueing latency): how long bytes have been sitting in the per-protocol queue before being dispatched. One sample per dispatch.
-
Bearer occupancy (transmission latency): how long the kernel
sendsyscall took to accept one SDU. One sample per SDU.
Either can be high without the other:
- Bearer-occupancy high, oldest-byte age low. Kernel send is slow but the protocol is producing data sparsely; each SDU is handled before the next arrives. The bearer is having trouble but data is not queueing.
- Bearer-occupancy low, oldest-byte age high. Kernel send is fast per SDU, but the application is producing data faster than the bearer can ship it, or the fairness scheduler is starving this protocol. Data piles up in the Wanton.
- Both low — healthy.
- Both high — classic congestion.
Roughly:
oldest_byte_age(P) ≈
(# SDUs ahead of head in P's Wanton)
× (avg bearer occupancy)
+ (fairness wait while bearer serves other protocols)
Bearer-occupancy is a rate-like local measurement; oldest-byte age is the accumulated latency that integrates the bearer's behaviour over the queue's contents.
Pair them per protocol:
- A bearer-occupancy p95 / p99 chart catches kernel / network trouble per SDU.
- An oldest-byte-age chart catches the resulting queueing.
- Together they distinguish:
- "bearer is slow but we're keeping up" (bearer p99 high, age low) → intermittent network jitter;
- "bearer is fine but starvation" (bearer p99 low, age high) → fairness / producer-burst story;
- "everything congested" (both high) → upstream pressure, alert-worthy.
-
Local kernel handoff, not network transit. For end-to-end
transit, use the receiver's
TraceRecvDeltaQObservation(which uses the SDU header'sremoteTimestamp). - SDU-size confound. Larger SDUs take longer on the same socket. When cross-comparing protocols with different typical SDU sizes, normalise by length (record ns/byte distributions) before drawing conclusions.
Both metrics are useful, but they answer different questions:
-
Per-
send-call latency (at the mux / channel boundary) — "from when a layer above mux handed bytes off, how long until those bytes were on the wire?" Captures transport health: bearer congestion, head-of-line blocking, per-protocol fairness. - Per-CBOR-message latency (at the codec / typed-protocols driver) — "from when the protocol decided to send message M, how long until M was fully encoded and shipped?" Captures application-visible latency: includes encoding cost, includes protocol-level synchronisation, then includes everything mux-level latency would.
The mux number is a strict component of the codec number.
-
Semantically meaningful unit. "ChainSync
MsgRollForwardtook 8 ms" is more directly actionable than "send-call latency 8 ms". -
Per-message-type breakdown. Once at message granularity, you
can keep one rolling window per
(protocol, message-tag)pair and see e.g. "MsgRollForwardis fine butMsgRollBackwardshot up". - Stable across implementation changes. Codec or buffering changes shift per-send numbers; per-message numbers don't.
- Encoding cost is visible. A codec regression shows up.
-
Wiring is more invasive. Either instrument each codec or hook
the typed-protocols
Driver(a shared layer currently agnostic of timing). The latter is one wiring change instead of N. - No direct lever on mux. If per-message latency is bad, you need the mux-level numbers anyway to localise.
- Loses byte-level granularity. For backpressure analysis you want bytes.
- Doesn't measure the bearer specifically. Includes protocol-level time the bearer can't influence.
- Duplicates some of what KeepAlive already gives at a higher level.
Layer by question:
- "Is the bearer keeping up?" → mux-level (oldest-byte age, per-protocol throughput, bearer share). Cheap, uniform across protocols.
- "Is a specific protocol healthy?" → per-message latency at the typed-protocols driver, broken down by message tag. Worth the wiring cost for protocols with per-message SLOs (ChainSync, BlockFetch).
- "Are peers responsive end-to-end?" → existing KeepAlive RTT. Don't reinvent.
Per-message adds to per-send; it doesn't replace it. If you go
per-message, do it at the typed-protocols Driver — one wiring
change instead of N.
A consumer should:
-
Build the window map once on connection setup — a
Map MiniProtocolNum ThroughputWin(and similar per metric), pre-populated with one empty window per mini-protocol active on the connection. -
Hold it in a
StrictTVar(io-classes' strict variant) so the trace handler can update without thunk accumulation. -
Install the tracer as part of mux's tracer composition.
contramapto extract the events of interest (e.g.,ChannelTracefor throughput,BearerTracefor bearer occupancy) and route into the consumer's update function. -
Snapshot for publishing on a separate timer (typically 1 Hz or 0.1 Hz for dashboards): read the
TVaroutside any STM transaction, run the read-side functions (bytesPerSecond,windowQuantileetc.), push to the metrics exporter (EKG, Prometheus, etc.). -
Tear down on connection close — drop the map. There's nothing to flush; the windows are pure values.
The asymmetry between writers (the trace path, lots of fast updates)
and readers (the publisher, occasional snapshots) makes a single
StrictTVar per connection adequate up to the window counts in the
"Cost considerations" section. If contention becomes visible,
either switch to one TVar per (peer, metric) or move to a
StrictMVar-per-bucket scheme.
A reasonable first round:
-
Implement Per-protocol throughput. This needs no new traces, has the most direct dashboard value, and exercises the io-classes integration end-to-end.
-
Benchmark in isolation. Build a synthetic trace-event generator firing at target rates into the consumer. Measure CPU, RSS, and GC pressure with and without the windows wired up.
-
Add the bearer-occupancy distribution. Also no new traces; reuses the throughput consumer skeleton.
-
Decide on the queue-depth and oldest-byte-age traces. These are the most valuable additions to mux itself (items 1 and 2 here), but they require library changes. Cleanest if done as a focused follow-up PR.