Skip to content

Network performance statistics collection

Marcin Wójtowicz edited this page Jul 2, 2026 · 1 revision

Scope

This document summarizes an early exploration of an idea of how to obtain interesting per-peer statistics for dashboard reporting as well as exposing some of it to the miniprotocols themselves to facilitate their logic in a more sophisticated ways. A new package window-stats is currently in development to support this approach, and the rest of the document discusses how to instrument/probe the code in specific places to collect the necessary data in an online fashion, what are the various tradeoffs and/or alternatives.

window-stats

window-stats is a new package for efficient sliding-window statistics, backed by a finger tree, with a cached user-supplied measure that updates incrementally. This package can be leveraged in ouroboros-network to compute such metrics of interest like delays, round trip times, thruput, etc. for monitoring purposes as well as exposing a subset of this data to the miniprotocols for use in their decision logic.

Tracking Mux connection metrics with window-stats

A working document for instrumenting network-mux connection-level metrics with the window-stats sliding-window machinery, intended to feed dashboards (Grafana etc.).

What network-mux already gives you

Looking at Network.Mux.Trace and Network.Mux.Types:

BearerTrace (high-frequency, low-level):

  • TraceRecvRaw Int, TraceRecvStart Int, TraceRecvEnd Int — bytes per SDU read.
  • TraceSendStart SDUHeader, TraceSendEnd — SDU sends (header carries protocol num, direction, length, remote timestamp).
  • TraceRecvDeltaQObservation, TraceRecvDeltaQSample — DeltaQ samples are already computed in the muxer: sample duration, packet count, mean/variance of the delta-Q distribution, and an estimated rate.
  • TraceTCPInfo StructTCPInfo (Linux): RTT, RTT variance, cwnd, lost, retrans, MSS, etc., straight from the kernel.
  • TraceSDUReadTimeoutException, TraceSDUWriteTimeoutException.

ChannelTrace (per mini-protocol):

  • TraceChannelRecvStart MiniProtocolNum, TraceChannelRecvEnd MiniProtocolNum Int.
  • TraceChannelSendStart MiniProtocolNum Int, TraceChannelSendEnd MiniProtocolNum.

Trace (high-level lifecycle): mini-protocol start/exit, mux state transitions.

That's enough source data to derive most useful per-connection metrics by post-processing in the trace consumer. Timestamps come either from values embedded in the event itself (e.g., the SDU header's remoteTimestamp) or by sampling getMonotonicTime at the consumer when the event arrives — with the caveat that consumer-side timestamping is fine for throughput but adds jitter to latency measurements (see Cost considerations).

Useful per-connection metrics

Rough priority order for a cardano node:

  1. Per-protocol throughput (bytes/sec, SDUs/sec, in and out). Derivable today from TraceChannelSendStart / TraceChannelRecvEnd and their Int length fields. For "on-the-wire" throughput, cross-reference the SDU sizes in BearerTrace's TraceSendStart header. Recipe in Per-protocol throughput below.

  2. Per-protocol latency distribution, split by direction:

    • Inbound, channel-level: time between TraceChannelRecvStart and TraceChannelRecvEnd. The muxer's portion of receive latency — useful for spotting slow deserialisers.
    • Outbound, bearer-level: time between BearerTrace's TraceSendStart and TraceSendEnd. The kernel-handoff time for one SDU; the SDU header carries the MiniProtocolNum so this is naturally per-protocol. See Bearer-occupancy distribution.

    The channel-level send pair (TraceChannelSendStart / TraceChannelSendEnd) is not the right measurement for outbound latency: it brackets a Wanton STM write and doesn't reflect the bearer.

    Both distributions are good candidates for a TimedWindow Time (TDigest 100) (DigestSample 100) per protocol.

  3. Bearer share / fairness: bytes shipped per protocol over a rolling window, normalised by total. Reveals starvation.

  4. DeltaQ summary. TraceRecvDeltaQSample is already a periodic summary; just window the latest few. The muxer pre-computes the heavy lifting.

  5. TCP-layer signals (Linux): rolling mean/std of tcpi_rtt and rates of tcpi_retrans / tcpi_lost. Cheap to compute, often the earliest indicator of network trouble.

  6. Error rates: count of TraceSDU{Read,Write}TimeoutException, TraceExceptionExit per protocol, over the window.

  7. Mini-protocol uptime / restart rate: derivable from TraceCleanExit / TraceExceptionExit. A sliding window doesn't add much here; a plain counter does the job.

  8. Aggregate bearer throughput for the whole connection — useful as a sanity check against the sum of per-protocol byte counts.

Measurements blocked on missing traces

The following can't be derived from the current traces because the relevant events aren't emitted:

  1. Egress queue depth per mini-protocol. The muxer's egress side peels from per-protocol Wantons; queue length / bytes pending is the headline backpressure indicator. Suggested trace: TraceEgressQueueBytes MiniProtocolNum Int, sampled at each dispatch or periodically.

  2. Oldest-byte age in the egress queue. How long bytes have been sitting in the per-protocol queue before being dispatched — the user-visible head-of-line latency. Suggested trace: TraceEgressOldestByteAge MiniProtocolNum DiffTime per dispatch. See Time-in-queue and oldest-byte-age.

  3. Ingress dispatch latency. Symmetric to (2) but on the receive side: time the demuxer placed an SDU in the destination protocol's ingress queue → time the protocol picked it up.

  4. Per-protocol fairness counter (egress slots awarded). Today you can infer this from TraceSendStart (header carries the protocol num), so this is almost there — usable as-is.

  5. Bearer wait time — how long the egress thread blocked waiting for any protocol to have data. Distinguishes "bearer idle because no traffic" from "bearer saturated".

(1) and (2) are the most valuable additions; (3) is symmetric. (4) and (5) are nice-to-haves. The recipe for (2) in particular hinges on the Wanton model — see the dedicated section below before considering an implementation.

Cost considerations

Window-count budget

For a node with ~80 inbound + ~50 outbound peers × 5 mini-protocols × ~4 metrics = ~2600 windows. At ~32 KB per 1024-sample window with a Welford measure, that's ~80 MB resident — significant but tractable. TDigest-backed windows are smaller (the digest is O(δ), not O(w)) but each insert costs O(δ log w).

Per-event cost

  • Pick Time, not UTCTime for the timestamp type. Smaller per-sample footprint, no clock-jump correctness concerns, and consistent with the rest of ouroboros-network.

  • Avoid Welford for high-cardinality samples that don't need variance. If you only want mean and rate, a Sum-measure window is markedly cheaper per insert than a WelfordMeasure one (Welford's <> does several mul/div per node along the spine).

  • Sample, don't track every event for low-importance metrics. If you care about p99 RTT, you don't need every TCPInfo trace — a 1 Hz sample feeds a 5-minute window of 300 entries and is still informative.

Where the windowing logic should live

At the trace consumer, never on the trace-emit path. Adding new state to mux for stamping purposes (e.g., the oldestEnqueued :: TVar (Maybe Time) for the oldest-byte-age recipe below) is fine and sometimes necessary — but the sliding-window machinery itself (the FT inserts, the rolling-measure recomputations) belongs at the consumer. Putting it in the trace path lengthens every traceWith call.

Consumer-side vs emit-site timestamping

  • Capture-at-consumer is fine for throughput, rate counters, and anything that's a count or ratio over a window much larger than scheduler jitter.
  • Capture-at-emit is needed for latency distributions: latency is a difference of two timestamps; a small constant consumer delay cancels, but its variance doesn't. For sub-millisecond latencies the consumer-side jitter is the same order as the signal. Embed timestamps in the trace event or wrap the tracer with a stamping combinator.

Recipes

The snippets below are in IO for brevity. In real mux code, use the io-classes style — MonadSTM m, MonadMonotonicTime m — consistent with the surrounding codebase. getMonotonicTime is deliberately called outside atomically to avoid blocking IO inside an STM transaction.

Per-protocol throughput

The pitfall to avoid

The intuitive recipe goes:

rate = windowSum w `div` windowDuration w

The trap: windowDuration w is the span from the oldest to the newest sample currently in the window — not the configured duration. Three failure modes:

  1. Sparse traffic. Two samples 100 ms apart in a 60 s window: windowDuration returns 100 ms, the rate looks hundreds of times higher than the actual sustained rate.
  2. Single-sample window. windowDuration is 0 / Nothing; division by zero.
  3. Cold start. No samples, windowDuration is Nothing.

The metric you actually want is "bytes per second averaged over the last N seconds, treating quiet periods as zero". The denominator should be the configured duration, not the observed span.

The proper recipe

import Control.Monad.Class.MonadTime.SI (Time, DiffTime, getMonotonicTime)
import Data.Window.Timed
import Data.Monoid (Sum (..))

-- For very long windows / very fast links, switch to Int64 / Integer
-- to avoid Int overflow (a 10 Gbit link over 60 s carries ~75 GB,
-- which still fits in 64-bit Int).
type ThroughputWin = TimedWindow Time (Sum Int) (SumSample Int)

newThroughputWin :: DiffTime -> ThroughputWin
newThroughputWin = empty

recordBytes :: Time -> Int -> ThroughputWin -> ThroughputWin
recordBytes t bytes = insert (t, bytes)

-- | Bytes per second averaged over the configured window. Returns
-- 'Nothing' until the window has actually spanned its configured
-- duration (i.e. it's warmed up).
bytesPerSecond :: ThroughputWin -> Maybe Double
bytesPerSecond w = do
  observed <- windowDuration w
  let configured :: DiffTime
      configured = windowMaxDuration w
  if observed < configured
    then Nothing
    else do
      total <- windowSum w
      return (fromIntegral total / diffTimeToSeconds configured)
  where
    diffTimeToSeconds = fromRational . toRational
  • Denominator is windowMaxDuration w, not windowDuration w.
  • Warm-up handling. Returning Nothing until observed >= configured keeps the first N seconds out of the dashboard rather than producing a misleading low spike.

Wiring it to mux traces

sendThroughputTracer
  :: TVar IO (Map MiniProtocolNum ThroughputWin)
  -> ChannelTrace
  -> IO ()
sendThroughputTracer ref (TraceChannelSendStart mpn len) = do
  !now <- getMonotonicTime
  atomically $ modifyTVar' ref $
    Map.adjust (recordBytes now len) mpn
sendThroughputTracer _ _ = return ()

For receive, use TraceChannelRecvEnd mpn len. When publishing to the dashboard, snapshot the TVar and run bytesPerSecond on each window.

Pitfalls

  • Don't divide by the observed span. Tempting when warm-up Nothings look awkward on the dashboard — but the alternative is wildly wrong rates during low-traffic periods.
  • TraceChannelSendStart lengths are payload-only. They exclude mux SDU headers. For wire throughput, cross-reference BearerTrace.TraceSendStart SDU header lengths.
  • Choose the configured duration thoughtfully. Too short (1 s) gives noisy readings and short warm-up; too long (60 s) smooths over interesting transients. For most dashboards 5–15 s is a sensible default. Run multiple parallel windows (short for spike detection, long for trend).
  • Aggregating across protocols. Don't naively sum per-protocol bytesPerSecond values — they may be in different warm-up states. Either maintain a separate aggregate window, or sum only warmed-up readings.

A with-foldl refinement

import Data.Window.Fold.Timed (windowFoldRollingFull, windowMeasureFinal)
import Data.Monoid (Sum (..))
import qualified Control.Foldl as L

diffTimeToSeconds :: DiffTime -> Double
diffTimeToSeconds = fromRational . toRational

-- final total bytes at end of input (not a rate)
totalBytesFinal :: DiffTime -> L.Fold (Time, Int) (Sum Int)
totalBytesFinal = windowMeasureFinal

-- streamed rolling rates (Bytes/sec), warm-up gated
rollingRate :: DiffTime -> L.Fold (Time, Int) [Double]
rollingRate d =
  let perSecond (Sum total) =
        fromIntegral total / diffTimeToSeconds d
  in  fmap perSecond <$> windowFoldRollingFull d L.list

windowFoldRollingFull embeds the warm-up gate, matching the bytesPerSecond semantics above.

Time-in-queue and oldest-byte-age

The egress side is not a queue of messages: it's a Wanton, i.e. a TVar ByteString (wrapped in a newtype). send appends bytes to the buffer; processSingleWanton later peels off chunks of up to ~12 kB (one SDU) and round-robins between protocols. A single send may be fragmented across several SDU dispatches; multiple sends may coalesce in the buffer with no gap. The mux has no notion of "message boundary" — to it, each protocol is a continuous byte stream:

  app calls Channel.send
   ↓
  bytes appended to Wanton           ← head-of-line ageing starts here
   ↓
  egress thread peels off an SDU
   ↓
  TraceSendStart                       ← bearer-occupancy starts ticking
   ↓
  kernel send() syscall
   ↓
  TraceSendEnd                         ← bearer-occupancy stops

This rules out the naive "wrap each enqueued element with a timestamp" approach. We need to be clearer about what we want to measure.

Candidate metrics

  • (A) First-byte latency per send call — time from send returning to the first byte of that call's bytes hitting the wire.
  • (B) Last-byte latency per send call — time from send returning to the last byte hitting the wire. The most user-visible: it's when the receiver can start parsing.
  • (C) Oldest-byte age signal — one timestamp per Wanton: the arrival time of the oldest currently-unsent byte. Reported per dispatch.

Implementing (B): last-byte latency per send

You don't need to thread the timestamp through the buffer. Keep a sidecar Seq of (cumulativeEndOffset, sendTime) plus running counters:

data Wanton = Wanton
  { bytes      :: TVar ByteString
  , sendStamps :: TVar (Seq (Int, Time))  -- (cumulative end offset, send time)
  , bytesIn    :: TVar Int                -- cumulative bytes ever enqueued
  , bytesOut   :: TVar Int                -- cumulative bytes ever dispatched
  }

send :: Wanton -> ByteString -> IO ()
send w bs = do
  !now <- getMonotonicTime   -- IO outside atomically by design
  atomically $ do
    n <- readTVar (bytesIn w)
    let n' = n + BS.length bs
    modifyTVar' (bytes      w) (<> bs)
    writeTVar    (bytesIn    w) n'
    modifyTVar' (sendStamps w) (Seq.|> (n', now))

-- inside processSingleWanton, after deciding to dispatch m bytes
onDispatch :: Tracer IO BearerTrace -> MiniProtocolNum -> Wanton -> Int -> IO ()
onDispatch tr mpn w m = do
  !now <- getMonotonicTime
  fired <- atomically $ do
    outBefore <- readTVar (bytesOut w)
    let outAfter = outBefore + m
    writeTVar (bytesOut w) outAfter
    stamps <- readTVar (sendStamps w)
    let (done, kept) = Seq.spanl (\(end, _) -> end <= outAfter) stamps
    writeTVar (sendStamps w) kept
    return done
  mapM_ (\(_, t0) -> traceWith tr (TraceEgressSendFinished mpn (now `diffTime` t0)))
        (toList fired)

The Seq is bounded by the number of in-flight sends — small in practice (mini-protocols pipeline a handful of outstanding messages and the buffer holds at most a few SDUs' worth).

Implementing (A): first-byte latency per send

Same shape as (B), but record the cumulative start offset and pop when bytesOut crosses it.

Implementing (C): cheap oldest-byte age

A single timestamp per Wanton, updated on transitions empty↔non-empty:

data Wanton = Wanton
  { bytes          :: TVar ByteString
  , oldestEnqueued :: TVar (Maybe Time)
  }

send :: Wanton -> ByteString -> IO ()
send w bs = do
  !now <- getMonotonicTime
  atomically $ do
    isEmpty <- BS.null <$> readTVar (bytes w)
    when isEmpty (writeTVar (oldestEnqueued w) (Just now))
    modifyTVar' (bytes w) (<> bs)

onDispatch :: Tracer IO BearerTrace -> MiniProtocolNum -> Wanton -> IO ()
onDispatch tr mpn w = do
  !now <- getMonotonicTime
  age  <- atomically $ do
    bs <- readTVar (bytes w)
    case bs of
      _ | BS.null bs -> do
        writeTVar (oldestEnqueued w) Nothing
        return Nothing
      _ -> readTVar (oldestEnqueued w)
  forM_ age $ \t0 ->
    traceWith tr (TraceEgressOldestByteAge mpn (now `diffTime` t0))

After a partial dispatch leaves bytes behind, the "oldest" timestamp stays put — those leftover bytes are still old. That's exactly the head-of-line story the metric is trying to surface.

Recommendation

Start with (C): one timestamp per Wanton, two cheap STM ops per send and per dispatch, no Seq bookkeeping, and the metric it produces is directly actionable for dashboards. Graduate to (B) only if you need per-send resolution later. Skip (A) unless there's a specific reason — first-byte latency is rarely what you want for backpressure.

Bearer-occupancy distribution

For each SDU shipped, time between BearerTrace's TraceSendStart and TraceSendEnd — the kernel-handoff time for that SDU. The SDU header carries MiniProtocolNum, giving a per-protocol distribution.

How it relates to oldest-byte age

The two metrics sound similar but probe adjacent layers; both are worth tracking, neither subsumes the other.

  • Oldest-byte age (queueing latency): how long bytes have been sitting in the per-protocol queue before being dispatched. One sample per dispatch.
  • Bearer occupancy (transmission latency): how long the kernel send syscall took to accept one SDU. One sample per SDU.

Either can be high without the other:

  • Bearer-occupancy high, oldest-byte age low. Kernel send is slow but the protocol is producing data sparsely; each SDU is handled before the next arrives. The bearer is having trouble but data is not queueing.
  • Bearer-occupancy low, oldest-byte age high. Kernel send is fast per SDU, but the application is producing data faster than the bearer can ship it, or the fairness scheduler is starving this protocol. Data piles up in the Wanton.
  • Both low — healthy.
  • Both high — classic congestion.

Roughly:

oldest_byte_age(P) ≈
    (# SDUs ahead of head in P's Wanton)
      × (avg bearer occupancy)
    + (fairness wait while bearer serves other protocols)

Bearer-occupancy is a rate-like local measurement; oldest-byte age is the accumulated latency that integrates the bearer's behaviour over the queue's contents.

Dashboard layout

Pair them per protocol:

  • A bearer-occupancy p95 / p99 chart catches kernel / network trouble per SDU.
  • An oldest-byte-age chart catches the resulting queueing.
  • Together they distinguish:
    • "bearer is slow but we're keeping up" (bearer p99 high, age low) → intermittent network jitter;
    • "bearer is fine but starvation" (bearer p99 low, age high) → fairness / producer-burst story;
    • "everything congested" (both high) → upstream pressure, alert-worthy.

Caveats

  • Local kernel handoff, not network transit. For end-to-end transit, use the receiver's TraceRecvDeltaQObservation (which uses the SDU header's remoteTimestamp).
  • SDU-size confound. Larger SDUs take longer on the same socket. When cross-comparing protocols with different typical SDU sizes, normalise by length (record ns/byte distributions) before drawing conclusions.

Per-CBOR-message vs per-send latency

Both metrics are useful, but they answer different questions:

  • Per-send-call latency (at the mux / channel boundary) — "from when a layer above mux handed bytes off, how long until those bytes were on the wire?" Captures transport health: bearer congestion, head-of-line blocking, per-protocol fairness.
  • Per-CBOR-message latency (at the codec / typed-protocols driver) — "from when the protocol decided to send message M, how long until M was fully encoded and shipped?" Captures application-visible latency: includes encoding cost, includes protocol-level synchronisation, then includes everything mux-level latency would.

The mux number is a strict component of the codec number.

When per-CBOR-message earns its keep

  • Semantically meaningful unit. "ChainSync MsgRollForward took 8 ms" is more directly actionable than "send-call latency 8 ms".
  • Per-message-type breakdown. Once at message granularity, you can keep one rolling window per (protocol, message-tag) pair and see e.g. "MsgRollForward is fine but MsgRollBackward shot up".
  • Stable across implementation changes. Codec or buffering changes shift per-send numbers; per-message numbers don't.
  • Encoding cost is visible. A codec regression shows up.

Trade-offs

  • Wiring is more invasive. Either instrument each codec or hook the typed-protocols Driver (a shared layer currently agnostic of timing). The latter is one wiring change instead of N.
  • No direct lever on mux. If per-message latency is bad, you need the mux-level numbers anyway to localise.
  • Loses byte-level granularity. For backpressure analysis you want bytes.
  • Doesn't measure the bearer specifically. Includes protocol-level time the bearer can't influence.
  • Duplicates some of what KeepAlive already gives at a higher level.

Practical recommendation

Layer by question:

  • "Is the bearer keeping up?" → mux-level (oldest-byte age, per-protocol throughput, bearer share). Cheap, uniform across protocols.
  • "Is a specific protocol healthy?" → per-message latency at the typed-protocols driver, broken down by message tag. Worth the wiring cost for protocols with per-message SLOs (ChainSync, BlockFetch).
  • "Are peers responsive end-to-end?" → existing KeepAlive RTT. Don't reinvent.

Per-message adds to per-send; it doesn't replace it. If you go per-message, do it at the typed-protocols Driver — one wiring change instead of N.

Wiring it up at process start

A consumer should:

  1. Build the window map once on connection setup — a Map MiniProtocolNum ThroughputWin (and similar per metric), pre-populated with one empty window per mini-protocol active on the connection.

  2. Hold it in a StrictTVar (io-classes' strict variant) so the trace handler can update without thunk accumulation.

  3. Install the tracer as part of mux's tracer composition. contramap to extract the events of interest (e.g., ChannelTrace for throughput, BearerTrace for bearer occupancy) and route into the consumer's update function.

  4. Snapshot for publishing on a separate timer (typically 1 Hz or 0.1 Hz for dashboards): read the TVar outside any STM transaction, run the read-side functions (bytesPerSecond, windowQuantile etc.), push to the metrics exporter (EKG, Prometheus, etc.).

  5. Tear down on connection close — drop the map. There's nothing to flush; the windows are pure values.

The asymmetry between writers (the trace path, lots of fast updates) and readers (the publisher, occasional snapshots) makes a single StrictTVar per connection adequate up to the window counts in the "Cost considerations" section. If contention becomes visible, either switch to one TVar per (peer, metric) or move to a StrictMVar-per-bucket scheme.

Next steps

A reasonable first round:

  1. Implement Per-protocol throughput. This needs no new traces, has the most direct dashboard value, and exercises the io-classes integration end-to-end.

  2. Benchmark in isolation. Build a synthetic trace-event generator firing at target rates into the consumer. Measure CPU, RSS, and GC pressure with and without the windows wired up.

  3. Add the bearer-occupancy distribution. Also no new traces; reuses the throughput consumer skeleton.

  4. Decide on the queue-depth and oldest-byte-age traces. These are the most valuable additions to mux itself (items 1 and 2 here), but they require library changes. Cleanest if done as a focused follow-up PR.

Clone this wiki locally