Skip to content

Commit 23b8c4e

Browse files
committed
Add statistical significance narrative to coefficient section
Explains why MWU testing was added (PhD teammate asked "is the difference real?"), how check-significance.sh works (per-window p99, ~30 samples, p < 0.05), and the honest caveat that per-window samples aren't fully uncorrelated. Distinguishes clearly between what MWU covers (latency delta realness) and what the coefficient derivation doesn't (n=4, no significance test, untested across message sizes). Assisted-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
1 parent 0787ef6 commit 23b8c4e

1 file changed

Lines changed: 6 additions & 0 deletions

File tree

_posts/2026-05-28-benchmarking-the-proxy-under-the-hood.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,12 @@ Measured: 9.7 mc per MB/s of total proxy traffic (±6.6 stdev, n=4 non-saturated
150150
→ for 1:1 produce:consume at 1 KB: 20 mc per MB/s of produce throughput
151151
```
152152

153+
I was proudly showing off some early numbers — baseline vs proxy, looking good — when one of the computer science PhDs on the team asked, "is the difference real?" Best answer I could come up with at the time: "Good question." So I went and added statistical significance testing.
154+
155+
`check-significance.sh` runs Mann-Whitney U at p < 0.05, comparing per-window p99 latency samples between baseline and candidate at each rate step. OMB slices the test phase into time windows and records a p99 per window — ~30 samples per 5-minute run — so MWU has enough data to distinguish real signal from noise. It's not perfect: those per-window samples aren't entirely uncorrelated — a GC pause can drag multiple adjacent windows — but it gives a principled answer to "is this overhead real, or am I chasing noise?"
156+
157+
The coefficient is a different matter. It's derived from JFR CPU data across n=4 non-saturated probes; the ±6.6 stdev reflects measurement noise, not a tested confidence interval. It holds at 1, 2, and 4 cores — the linear scaling claim is consistent — but its validity across message sizes or workload shapes is untested.
158+
153159
The mechanism: `cpu: 1000m → availableProcessors()=1 → one Netty event loop thread`. At 4000m that's four threads, each handling its share of connections in parallel. If the ceiling scales linearly with thread count, a 4-core pod should handle roughly four times as much. We ran it.
154160

155161
| CPU limit | Rate | p99 | Verdict |

0 commit comments

Comments
 (0)