You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
Two complementary RTT signals on a single chart immediately below the
bitrate chart, sharing the time axis with bitrate / buffer / FPS via
`getChartsForSession` so spikes line up visually.
**TCP_INFO RTT (#401)** — 100 ms ticker reads `getsockopt(TCP_INFO)` on
each session's most-recent connection, folds into a 1 s window drained
on every snapshot tick. Six metrics: `client_rtt_ms` (smoothed avg),
`_max` / `_min` (per-window peak/trough), `_min_lifetime` (kernel's
per-connection floor), `_var` (jitter), `_rto` (retransmit timeout). RTT
family on the left axis; RTO on the right axis (kernel default ≥200 ms,
can spike to seconds during a wedge — sharing one axis would flatten
RTT). Charted with a min/max envelope, dashed lifetime-min reference,
hidden-by-default RTO line. Wedge detection is the gap between `rtt` and
`rto` — RTO climbs while smoothed RTT flatlines because no fresh ACKs.
**Out-of-band ICMP path ping (#404)** — 1 Hz ICMP echo from go-proxy →
`player_ip` via a shared `*icmp.PacketConn` with a receiver goroutine
demuxing by (id, seq). Independent of streaming throughput because ICMP
packets bypass the application's send queue. New
`client_path_ping_rtt_ms` field, cyan line on the same chart's left
axis. The line that **stays put when shaping kicks in while TCP_INFO RTT
climbs** — the gap between them is the queueing delay the app is
inducing on itself. ICMP filtering renders a gap rather than misleading
data.
Linux-only kernel reads (`getsockopt(TCP_INFO)`, raw ICMP). `!linux`
build tags compile macOS dev with stubs that emit zeros / disabled
samplers so `go build ./...` stays green.
Closes#401, closes#404.
## Test plan
- [x] `go vet` + `go build` clean for `go-proxy/...` and
`analytics/go-forwarder/...` on darwin and linux/amd64
- [x] `node --check` clean on `session-shell.js` and
`testing-session-ui.js`
- [x] Pre-commit Sonnet review pass — surfaced rttCharts cleanup leak
(fixed) + fold() zero-overwrite of latest-fields (fixed)
- [x] `make test-deploy-dev` clean: `rtt sampler started (100ms
cadence)` + `path ping sampler started (1Hz)` boot logs present, all 9
go-proxy ports listening, no fatals
- [x] `make analytics-migrate` applied seven `ADD COLUMN IF NOT EXISTS`
lines (six TCP_INFO + one path-ping); `DESCRIBE session_snapshots`
confirms all seven columns
- [x] RTT chart parity with bandwidth/buffer/FPS: in
`getChartsForSession`, `ensureChartLiveWheelAnchor`,
`refreshLegendHoverAll`, `chartsToDestroy`, event-marker pool.
Pause/zoom/pan inherit via DOM and shared zoom options.
- [x] Right-edge alignment: dual-Y chart uses y1 `afterFit` width — no
double-counting `layout.padding.right`
- [ ] LAN baseline: drive a session via `testing-session.html`, RTT
chart shows `client_rtt_ms < 1 ms`, `client_path_ping_rtt_ms` ≈ same,
`client_rtt_min_lifetime_ms` ≈ same, RTO at kernel default
- [ ] Throttle to 1 Mbps mid-stream: TCP_INFO lines climb (queueing);
path-ping line **stays at LAN baseline** — that gap is the bufferbloat
the chart was built to surface
- [ ] Add `tc netem delay 100ms` on top of throttle: BOTH lines step up
by ~100 ms (path itself got longer)
- [ ] Wedge: drop packets via fault-injection UI, toggle RTO visible —
RTO rises while RTT flatlines
- [ ] Replay an archived session in `session-viewer.html`: same chart
populated from ClickHouse
- [ ] Block ICMP outbound at host firewall: path-ping line shows gap;
TCP_INFO lines unaffected
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: README.md
+21-3Lines changed: 21 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -257,11 +257,11 @@ ATS-style transfer timeouts the proxy enforces against the *client* — useful f
257
257
258
258
When a timeout fires the network-log waterfall renders the row with `!⏱` so you can tell it apart from a fault-injection cut (`!✂`) or a player abort (`!↩`).
259
259
260
-
### Bitrate chart (with buffer depth and FPS)
260
+
### Bitrate chart (with RTT, buffer depth, and FPS)
261
261
262
262

263
263
264
-
The session card has a collapsible **Bitrate Chart** that stacks an events timeline + up to three time-series charts, all sharing a 10-minute rolling window and unified zoom/pan. Legend entries toggle series, scroll zooms, drag pans, `⏸` pauses live updates. The four panels share an x-axis so a bandwidth dip lines up visually with its buffer / FPS / variant-shift impact.
264
+
The session card has a collapsible **Bitrate Chart** that stacks an events timeline + up to four time-series charts, all sharing a 10-minute rolling window and unified zoom/pan. Legend entries toggle series, scroll zooms, drag pans, `⏸` pauses live updates. The five panels share an x-axis so a bandwidth dip lines up visually with its RTT / buffer / FPS / variant-shift impact.
265
265
266
266
-**Events timeline***(top)* — swim-lane visualization of what the player and server are doing right now:
267
267
-**PLAYER variants** — one lane per ladder rung (e.g. `1920×1080:7.1Mbps`, `1280×720:3.5Mbps`). Coloured blocks show which variant the player was on at each moment. The variant shift on a throughput collapse is visible as a downstep across lanes.
@@ -275,10 +275,14 @@ The session card has a collapsible **Bitrate Chart** that stacks an events timel
275
275
-**Reference lines**: `Limit` (shaping ceiling, stepped when a pattern is active), `Server Rendition` (what the server believes it delivered), one line per ladder `Variant` (hidden by default).
276
276
-**Events**: `STALL` and `RESTART` markers annotate player stalls and restarts.
277
277
-**Y-axis**: `Auto` or fixed `5 / 10 / 20 / 30 / 40 / 50 / 100` Mbps — pin the scale when comparing two sessions side by side.
278
+
-**RTT chart** — two independent round-trip-time signals on the same time axis (issues #401 / #404):
279
+
-**TCP_INFO RTT family** (purple lines, left Y-axis) — what the streaming TCP connection actually experiences. Sampled inside go-proxy via `getsockopt(TCP_INFO)` at 100 Hz, drained into 1 s windows: `RTT avg` (smoothed RFC 6298 SRTT), `RTT max` / `min` (per-window peak/trough envelope), `RTT lifetime min` (sticky path floor, dashed reference), `RTO` (right Y-axis, hidden by default — climbs above RTT during a wedge while smoothed RTT flatlines).
280
+
-**Path ping** (cyan line, left Y-axis) — out-of-band ICMP echo from go-proxy → `player_ip` at 1 Hz, routed through a high-priority band inside the per-port shaping class. Sees the configured netem delay but jumps the bulk segment queue. The closest thing to "what the path could deliver if you weren't loading it." Zero / gap when ICMP is filtered.
281
+
-**Why both?** The ping line is the network's contribution; the gap up to the TCP_INFO line is the application stack's contribution (queueing under throttle, delayed ACKs, receiver load). Together they decompose latency under shaping: rising netem moves both lines together; rising throttle inflates only TCP_INFO via bufferbloat. See [`analytics/README.md`](analytics/README.md#reading-the-rtt-chart-issues-401-404) for the deep interpretation guide and per-shaping-knob test recipes.
278
282
-**Buffer depth chart** — player `buffered` TimeRanges (`player_metrics_buffer_depth_s`) on the left axis; **Wall-Clock Offset** (player playhead vs encoder PDT) on the right axis.
The four panels' plot areas all align on the same right edge so vertical x-axis ticks line up across every chart and the events timeline above.
285
+
The five panels' plot areas all align on the same right edge so vertical x-axis ticks line up across every chart and the events timeline above.
282
286
283
287
### Network log waterfall (HAR view)
284
288
@@ -545,6 +549,20 @@ Full API (`/api/content`, `/api/jobs`, `/api/sessions/*`, `/api/nftables/*`, etc
545
549
|`mbps_transfer_rate`| 250 ms | Byte-change-gated rate during segment transfer, aligned to HTB burst edges. Reports at drain/refill boundaries |
546
550
|`mbps_transfer_complete`| per segment | Total bytes / total time for one completed segment transfer (backlog drained to 0) |
547
551
552
+
### Server-side RTT metrics
553
+
554
+
Sampled inside go-proxy via `getsockopt(TCP_INFO)` on each session's most-recent connection. The 100 ms sampler folds reads into a 1 s window that drains on every snapshot tick — same cadence as the player-metrics PATCH heartbeat, so the RTT chart shares a time axis with the bitrate chart above it. Linux-only (the kernel option doesn't exist on macOS); the dev build compiles via a stub that emits zeros. All values in milliseconds.
555
+
556
+
| Metric | Source field | What it measures |
557
+
|---|---|---|
558
+
|`client_rtt_ms`|`tcpi_rtt` (avg of 1 s window) | Smoothed RTT (RFC 6298 SRTT, kernel EWMA) |
559
+
|`client_rtt_max_ms`| window max of `tcpi_rtt`| Peak smoothed RTT in window — catches sub-second spikes the kernel's EWMA would mask |
560
+
|`client_rtt_min_ms`| window min of `tcpi_rtt`| Trough during the same 1 s window |
561
+
|`client_rtt_min_lifetime_ms`|`tcpi_min_rtt`| Min RTT ever observed on this connection — the path floor |
562
+
|`client_rtt_var_ms`|`tcpi_rttvar`| Smoothed mean deviation (jitter) |
563
+
|`client_rto_ms`|`tcpi_rto`| Current retransmit timeout — rises during a wedge while smoothed RTT flatlines; the gap between `rto` and `rtt` is the canonical "kernel suspects this connection is stalling" signal |
564
+
|`client_path_ping_rtt_ms`| ICMP echo, 1 Hz |**Out-of-band path latency** (issue #404). Independent of the streaming connection's queue contribution — TCP_INFO RTT inflates with throttle, but ICMP packets bypass the application's send queue, so this line stays at the LAN baseline when shaping kicks in. The vertical gap between this line and `client_rtt_ms` is the queueing delay the application is inducing on itself. Zero / gap when ICMP is filtered. |
565
+
548
566
### Metric semantics
549
567
550
568
-**Limit value** (`nftables` shaping rate): configured ceiling for the session port; a control target, not a measured throughput.
-**Reverse-path queuing** — the player's egress (ACKs going *back*)
200
+
has its own tiny outbound queue. ICMP replies skip it.
201
+
202
+
So `TCP_INFO − ping` on healthy unshaped LAN ≈ delayed-ACK + receiver
203
+
load + reverse queueing. This is the network stack's overhead, not
204
+
a fault.
205
+
206
+
#### Expected behavior under shaping
207
+
208
+
The two signals respond differently to the two shaping knobs (netem
209
+
delay and HTB rate limit). Useful test recipes:
210
+
211
+
| Action | TCP_INFO RTT | Path ping | Why |
212
+
|---|---|---|---|
213
+
|**No shaping**|`path + ACK overhead + receiver load` (typically 5–50 ms LAN) |`~path RTT` (sub-ms LAN) | Baseline. The ping line is the floor; TCP_INFO is everything else the stack adds. |
214
+
|**Set netem delay = 25 ms**| rises by ~25 ms (mean) | rises by ~25 ms (mean) | Both packets traverse the same per-band netem inside the HTB class. Matched movement. Per-packet variance is ±5 % of mean (~1 ms stddev at 25 ms — see jitter note below). |
215
+
|**Set throttle = 1 Mbps** (no netem) | climbs into bufferbloat range (often 100s of ms during downloads) | unchanged from baseline | Bulk segment data fills the HTB queue; each MTU waits for a rate token. The probe escapes via prio band 0 — at most one MTU's serialization (~12 ms at 1 Mbps for 1500 B). |
216
+
|**Throttle + netem combined**|`path + netem + bufferbloat + ACK overhead` (compounded) |`~path + netem + at-most-one-MTU`| Effects stack additively on bulk data. The ping line shows you what's *just* the configured delay so you can subtract bufferbloat by eye. |
217
+
|**Toggle shaping mid-stream**| step changes correlate visibly with bitrate / buffer drops on the chart above | flat through bandwidth changes; steps on netem changes only | Whole point of having both signals. Bitrate dropped because shaping was applied → both lines confirm in different ways. |
218
+
|**Drop packets via fault-injection**| smoothed RTT eventually flatlines (no fresh ACKs); `client_rto_ms` climbs as kernel doubles its timeout |`0` / gap (echo replies dropped too) | RTO − RTT divergence is the canonical wedge indicator. |
RTT (purple) ──flat─────────────────────────────── ← no ACKs, no fresh samples
268
+
RTO (red) ───────⌐──┘─⌐──┘──⌐──┘──⌐────┘──── ← kernel doubling on each retry
269
+
```
270
+
271
+
The growing gap between the two is the unambiguous "kernel suspects
272
+
this connection is stalling" signal. It appears within seconds of
273
+
the wedge starting — much faster than a stall on the bitrate chart
274
+
above (which only triggers after the player's buffer drains).
275
+
276
+
What recovery looks like. Once ACKs start flowing again (fault
277
+
cleared, retry succeeds), the kernel resets RTO to `SRTT + 4×RTTVAR`
278
+
on the very next ACK. The red line snaps back down; the purple
279
+
smoothed-RTT line resumes updating. So a recovered wedge is visible
280
+
as a sawtooth-like RTO climb followed by an instant drop, with the
281
+
RTT line resuming its normal track.
282
+
283
+
Useful pairing. RTO + the path-ping line together disambiguate the
284
+
wedge cause:
285
+
286
+
| RTT line | RTO line | Path ping | What it means |
287
+
|---|---|---|---|
288
+
| flat | climbing | also gone (ICMP filtered/dropped) | Wedge or transport fault — entire path is dead from proxy's view. |
289
+
| flat | climbing | still arriving normally | Wedge is TCP-specific — ICMP gets through but TCP is stuck. Likely middlebox dropping the connection or a broken player TCP stack, not a network outage. |
290
+
| climbing slowly | tracking RTT (small multiple above) | climbing the same | Genuine path latency increase, not a wedge — RTO is just following healthy RTT growth. |
291
+
80
292
## Securing a WAN-exposed deployment
81
293
82
294
Default docker-compose binds ClickHouse to `127.0.0.1` (host-only) and
0 commit comments