Skip to content

Commit 03f4308

Browse files
JohnMcLearclaude
andcommitted
docs(scaling-dive): triple-run noise envelope + honest re-evaluation
Three identical sweeps against develop quantify the runner-noise envelope. Same workload, same code, same workflow → p95 at step 350 ranged 39-122ms on baseline (3.1x spread). At step 300, 1.9x spread. What this means for prior conclusions in this doc: - websocket-only-is-worst HOLDS at the cliff: its envelope min (2463) equals baseline's max (2463), envelopes don't overlap. Single contradicting run was an outlier. - lever-3 (#7768) "70% p95 drop at 200" was a single-run outlier comparison. The real reliable improvement is ~5-15% median p95 plus much tighter consistency (fewer tail-latency excursions). The mechanism — per-socket serialization preventing overlapping fan-outs that contend for CPU — is still real and still worth merging; the headline number was inflated. - below the cliff, all four levers' noise envelopes overlap. No clear winner. Going forward: lever scoring should default to N >= 3 trials and report min/median/max, not single-run point estimates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 142c5f1 commit 03f4308

1 file changed

Lines changed: 19 additions & 3 deletions

File tree

docs/scaling-dive-2026-05.md

Lines changed: 19 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -166,11 +166,27 @@ The harness-side forward-compat patch ([ether/etherpad-load-test#106](https://gi
166166

167167
### Methodology caveat surfaced during lever 8 scoring
168168

169-
The same run that confirmed lever 8 didn't help also showed `websocket-only` as the **best** lever — directly contradicting every prior dive in this doc. The cause is that **each matrix entry runs as a separate GitHub Actions job on a potentially different physical runner**. Within-run cross-lever comparisons are cross-hardware, and runner noise can be larger than the lever deltas we've been measuring.
169+
The same run that confirmed lever 8 didn't help also showed `websocket-only` as the **best** lever — directly contradicting every prior dive in this doc. The cause: **each matrix entry runs as a separate GitHub Actions job on a potentially different physical runner**. Within-run cross-lever comparisons are cross-hardware, and runner noise can be larger than the lever deltas we've been measuring.
170170

171-
Strong conclusions in this doc that depend on single dive runs should be **re-validated with N ≥ 3 trials per lever**. The lever-3 (#7768) finding holds up because the histogram-bucket evidence (apply percentile distribution) is consistent across multiple measurements and the mechanism (overlapping fan-outs starving the apply path) was confirmed via histogram data, not just a single p95 row. The lever-4 (websocket-only) "always-worse" conclusion is now suspect — it might be runner-noise dominated.
171+
To quantify the noise envelope, three identical sweeps were run against `develop` ([25954537767](https://github.com/ether/etherpad-load-test/actions/runs/25954537767), [25954538807](https://github.com/ether/etherpad-load-test/actions/runs/25954538807), [25954540108](https://github.com/ether/etherpad-load-test/actions/runs/25954540108)). p95 across the three runs at each step:
172172

173-
Filing this as a sequel investigation: **before strong-recommendation calls on any new lever, run 3× and treat per-lever p95 as a noise envelope, not a point estimate.** A new dive run [25954537767/25954538807/25954540108](https://github.com/ether/etherpad-load-test/actions) is doing exactly that against develop — three identical sweeps — to quantify the noise envelope.
173+
| Lever | step 100 (min/med/max) | step 200 | step 300 | step 350 | step 400 |
174+
|---|---|---|---|---|---|
175+
| baseline | 28 / 38 / 38 | 30 / 37 / 51 | 38 / 45 / 71 | 39 / 39 / 122 | 1758 / 2275 / 2463 |
176+
| websocket-only | 35 / 37 / 39 | 33 / 57 / 58 | 66 / 86 / 91 | 65 / 76 / 96 | **2463 / 2545 / 2781** |
177+
| nodemem | 36 / 39 / 39 | 36 / 52 / 58 | 47 / 55 / 75 | 37 / 96 / 167 | 1716 / 2037 / 2421 |
178+
| new-changes-batch | 31 / 34 / 36 | **32 / 35 / 38** | 27 / 68 / 80 | 32 / 95 / 607 | 2311 / 2405 / 2999 |
179+
180+
What this triple-run shows:
181+
182+
- **Below the cliff, noise dominates.** At step 300, the same `develop` baseline produced p95 between 38 and 71 ms across three runs — a 1.9× spread. At step 350, 3.1× spread. Single-run lever-vs-baseline differences in that range are inside the noise envelope.
183+
- **At the cliff (step 400), `websocket-only` is reliably the worst.** Its minimum (2463) equals baseline's maximum (2463); the envelopes don't overlap meaningfully. Confirms the original "ws-only is worse under load" conclusion. The single contradicting run was an outlier.
184+
- **`new-changes-batch` shows the tightest envelope at step 200.** 32/35/38 vs baseline 30/37/51. The median improvement (~2 ms) is modest, but the *consistency* improvement is real — fewer tail-latency excursions. Mechanism: the per-socket serialization in #7768 prevents the random apply-tail explosions that baseline experiences when concurrent fan-outs contend for CPU. **Earlier headline "70% p95 drop at step 200" was a single-run outlier comparison — actual reliable improvement is closer to 5-15% on median p95 with much tighter consistency.**
185+
- **`new-changes-batch` shows a 607 ms outlier at step 350.** Worth a second look but doesn't repeat across runs — likely a flake.
186+
187+
The lever-3 (#7768) finding still stands but **for a different reason than originally claimed**: not a dramatic p95 reduction, but improved consistency + the correctness benefit of preventing overlapping fan-outs on the same socket. The per-socket serialization is a real correctness fix; the NEW_CHANGES_BATCH framing is currently latent (it would fire under server slowness).
188+
189+
**Going forward, lever scoring should default to N ≥ 3 trials and report min/median/max, not single-run point estimates.**
174190

175191
## Recommendation
176192

0 commit comments

Comments
 (0)