You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Three identical sweeps against develop quantify the runner-noise
envelope. Same workload, same code, same workflow → p95 at step 350
ranged 39-122ms on baseline (3.1x spread). At step 300, 1.9x spread.
What this means for prior conclusions in this doc:
- websocket-only-is-worst HOLDS at the cliff: its envelope min (2463)
equals baseline's max (2463), envelopes don't overlap. Single
contradicting run was an outlier.
- lever-3 (#7768) "70% p95 drop at 200" was a single-run outlier
comparison. The real reliable improvement is ~5-15% median p95
plus much tighter consistency (fewer tail-latency excursions).
The mechanism — per-socket serialization preventing overlapping
fan-outs that contend for CPU — is still real and still worth
merging; the headline number was inflated.
- below the cliff, all four levers' noise envelopes overlap. No
clear winner.
Going forward: lever scoring should default to N >= 3 trials and
report min/median/max, not single-run point estimates.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: docs/scaling-dive-2026-05.md
+19-3Lines changed: 19 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -166,11 +166,27 @@ The harness-side forward-compat patch ([ether/etherpad-load-test#106](https://gi
166
166
167
167
### Methodology caveat surfaced during lever 8 scoring
168
168
169
-
The same run that confirmed lever 8 didn't help also showed `websocket-only` as the **best** lever — directly contradicting every prior dive in this doc. The cause is that**each matrix entry runs as a separate GitHub Actions job on a potentially different physical runner**. Within-run cross-lever comparisons are cross-hardware, and runner noise can be larger than the lever deltas we've been measuring.
169
+
The same run that confirmed lever 8 didn't help also showed `websocket-only` as the **best** lever — directly contradicting every prior dive in this doc. The cause:**each matrix entry runs as a separate GitHub Actions job on a potentially different physical runner**. Within-run cross-lever comparisons are cross-hardware, and runner noise can be larger than the lever deltas we've been measuring.
170
170
171
-
Strong conclusions in this doc that depend on single dive runs should be **re-validated with N ≥ 3 trials per lever**. The lever-3 (#7768) finding holds up because the histogram-bucket evidence (apply percentile distribution) is consistent across multiple measurements and the mechanism (overlapping fan-outs starving the apply path) was confirmed via histogram data, not just a single p95 row. The lever-4 (websocket-only) "always-worse" conclusion is now suspect — it might be runner-noise dominated.
171
+
To quantify the noise envelope, three identical sweeps were run against `develop` ([25954537767](https://github.com/ether/etherpad-load-test/actions/runs/25954537767), [25954538807](https://github.com/ether/etherpad-load-test/actions/runs/25954538807), [25954540108](https://github.com/ether/etherpad-load-test/actions/runs/25954540108)). p95 across the three runs at each step:
172
172
173
-
Filing this as a sequel investigation: **before strong-recommendation calls on any new lever, run 3× and treat per-lever p95 as a noise envelope, not a point estimate.** A new dive run [25954537767/25954538807/25954540108](https://github.com/ether/etherpad-load-test/actions) is doing exactly that against develop — three identical sweeps — to quantify the noise envelope.
-**Below the cliff, noise dominates.** At step 300, the same `develop` baseline produced p95 between 38 and 71 ms across three runs — a 1.9× spread. At step 350, 3.1× spread. Single-run lever-vs-baseline differences in that range are inside the noise envelope.
183
+
-**At the cliff (step 400), `websocket-only` is reliably the worst.** Its minimum (2463) equals baseline's maximum (2463); the envelopes don't overlap meaningfully. Confirms the original "ws-only is worse under load" conclusion. The single contradicting run was an outlier.
184
+
-**`new-changes-batch` shows the tightest envelope at step 200.** 32/35/38 vs baseline 30/37/51. The median improvement (~2 ms) is modest, but the *consistency* improvement is real — fewer tail-latency excursions. Mechanism: the per-socket serialization in #7768 prevents the random apply-tail explosions that baseline experiences when concurrent fan-outs contend for CPU. **Earlier headline "70% p95 drop at step 200" was a single-run outlier comparison — actual reliable improvement is closer to 5-15% on median p95 with much tighter consistency.**
185
+
-**`new-changes-batch` shows a 607 ms outlier at step 350.** Worth a second look but doesn't repeat across runs — likely a flake.
186
+
187
+
The lever-3 (#7768) finding still stands but **for a different reason than originally claimed**: not a dramatic p95 reduction, but improved consistency + the correctness benefit of preventing overlapping fan-outs on the same socket. The per-socket serialization is a real correctness fix; the NEW_CHANGES_BATCH framing is currently latent (it would fire under server slowness).
188
+
189
+
**Going forward, lever scoring should default to N ≥ 3 trials and report min/median/max, not single-run point estimates.**
0 commit comments