Skip to content

Commit 2645da2

Browse files
JohnMcLearclaude
andcommitted
docs: scaling dive 2026-05 — first numbers-backed answer to #7756
Phase 2 deliverable from the scaling-dive program. Documents: - Methodology (harness commit, runner shape, sweep specs, decision rules) - Baseline curve at authors=20..200 against develop HEAD - Per-lever scoring (perMessageDeflate deferred, nodemem no-effect, websocket-only refuted, raw ws not pursued) - Recommendation: prototype fan-out batching as the next lever (the data identifies emits scaling O(N^2) as the dominant cost) Closes Phase 2 of #7756. Phase 3 (batching prototype) is a separate feature branch the dive workflow will score. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 79f525b commit 2645da2

1 file changed

Lines changed: 133 additions & 0 deletions

File tree

docs/scaling-dive-2026-05.md

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Scaling dive — 2026-05
2+
3+
**Closes Phase 2 of #7756.** First numbers-backed answer to "how many editors can be on one pad, and what is the bottleneck when it falls over?"
4+
5+
## TL;DR
6+
7+
Two clean conclusions from three matrix runs on the same GitHub-hosted `ubuntu-latest` runner shape:
8+
9+
1. **Server-side changeset apply is not the bottleneck.** Even at 200 concurrent authors, `etherpad_changeset_apply_duration_seconds` mean is ~3.7–4.4 ms — well under client-perceived p95 (~20–25 ms). The remaining latency lives in *fan-out*, not in *apply*.
10+
2. **Dropping the socket.io polling fallback (`socketTransportProtocols: ["websocket"]`) makes things worse, not better, under high concurrency.** At 200 authors it nearly doubles client p95 (37 ms vs 20 ms baseline). The hypothesis that the polling fallback was costing us is falsified.
11+
12+
Raising the node heap (`--max-old-space-size=4096`) makes no measurable difference — memory is not where the cost lives.
13+
14+
Next step: prototype the **fan-out batching** lever (spec section 9 lever 3). Today `etherpad_socket_emits_total{type=NEW_CHANGES}` scales O(N²) — 1160 emits per 10s dwell at 20 authors grows to 66 032 emits at 200 authors. Coalescing N changesets within a configurable window before broadcasting should attack that directly.
15+
16+
## Methodology
17+
18+
- **Harness:** [`ether/etherpad-load-test`](https://github.com/ether/etherpad-load-test) at the post-#100 main (sim/ library + `--sweep` mode + `/stats/prometheus` scraping + `apply_mean_ms` / `emits_new_changes` CSV columns).
19+
- **Server-side instruments:** the three Prometheus counters added in #7762, enabled via `settings.scalingDiveMetrics=true`.
20+
- **SUT:** etherpad core `develop` HEAD at the time of run.
21+
- **Runner shape:** GitHub-hosted `ubuntu-latest` (4 vCPU, ~16 GB RAM). Same shape across all three matrix entries, so noise is constant.
22+
- **Workflow:** [`.github/workflows/scaling-dive.yml`](https://github.com/ether/etherpad-load-test/blob/main/.github/workflows/scaling-dive.yml), manual `workflow_dispatch`. Two runs analysed:
23+
- **Run 25936626554** — default sweep `authors=10..80:step=10:dwell=15s:warmup=3s`.
24+
- **Run 25936813657** — deeper sweep `authors=20..200:step=20:dwell=10s:warmup=2s` (used for the conclusions below).
25+
26+
### Decision rules (per spec section 6)
27+
28+
- p95 latency up *without* event-loop p99 up ⇒ network IO bound.
29+
- p95 latency up *with* event-loop p99 up ⇒ server CPU / event-loop bound.
30+
- p95 latency up *with* RSS climbing across steps ⇒ leak / backpressure.
31+
32+
## Baseline curve
33+
34+
The deep sweep on baseline (no levers, develop HEAD):
35+
36+
| Step | p50 | p95 | p99 | EL p99 | apply_mean | emits_NEW_CHANGES | cpu_user (s) |
37+
|---:|---:|---:|---:|---:|---:|---:|---:|
38+
| 20 | 9 | 11 | 12 | 11 | 4.84 ms | 1 160 | 2.4 |
39+
| 40 | 8 | 11 | 12 | 12 | 4.62 ms | 3 520 | 4.0 |
40+
| 60 | 8 | 11 | 13 | 12 | 4.63 ms | 7 040 | 6.3 |
41+
| 80 | 10 | 17 | 19 | 12 | 5.18 ms | 11 780 | 9.5 |
42+
| 100 | 8 | 16 | 18 | 11 | 5.08 ms | 17 668 | 13.0 |
43+
| 120 | 5 | 12 | 16 | 11 | 4.55 ms | 24 793 | 17.5 |
44+
| 140 | 3 | 8 | 11 | 11 | 3.96 ms | 33 088 | 22.8 |
45+
| 160 | 4 | 9 | 11 | 11 | 3.62 ms | 42 563 | 29.0 |
46+
| 180 | 5 | 16 | 20 | 12 | 3.56 ms | 54 112 | 36.5 |
47+
| 200 | 7 | 20 | 25 | 12 | 3.67 ms | 66 032 | 44.0 |
48+
49+
Reading against the decision rules:
50+
51+
- p95 grows slowly (11 → 20 ms across the range), but doesn't cliff.
52+
- Event-loop p99 stays at 11–12 ms — flat. **Not event-loop bound.**
53+
- RSS climbs from 393 MB → 651 MB but no leak shape (it plateaus around step 100).
54+
- CPU is the headline: 200 authors burns 44 CPU-seconds in 10 s wall-clock — ~4.4 cores. The runner has 4 vCPU. We're saturating the CPU on fan-out work.
55+
56+
So per the decision rules: **network/CPU bound, but the work is fan-out, not apply.** The `apply_mean` stays low while emits grow O(N²) with concurrency.
57+
58+
## Lever 1 — perMessageDeflate
59+
60+
**Not run.** Verifying that core's socket.io setup plumbs `perMessageDeflate` through settings is itself a small core PR. Folded into the recommendation below.
61+
62+
## Lever 2 — `--max-old-space-size=4096` (NODE_OPTIONS)
63+
64+
Run as the `nodemem` matrix entry. Selected step-by-step diff vs baseline:
65+
66+
| Step | baseline p95 | nodemem p95 | Δ |
67+
|---:|---:|---:|---:|
68+
| 80 | 17 | 17 | 0 |
69+
| 120 | 12 | 16 | +4 |
70+
| 160 | 9 | 13 | +4 |
71+
| 200 | 20 | 13 | -7 |
72+
73+
Noise within ±5 ms. RSS grows similarly. apply_mean and emits_NEW_CHANGES are essentially identical.
74+
75+
**Verdict: no measurable effect.** The user's hunch on the issue (memory is not the bottleneck) is confirmed. Don't recommend bumping the heap as a scaling lever.
76+
77+
## Lever 3 — fan-out batching
78+
79+
**Deferred.** Requires a code change in `PadMessageHandler.ts` (specifically the per-socket loop in `updatePadClients` and/or the broadcast emit at line 627). Recommended as the next concrete code change. The harness is ready to score it as soon as a candidate branch exists — point the workflow's `core_ref` input at the branch.
80+
81+
The `emits_new_changes` column on the curve table above is the direct measurement target. At 200 authors we're producing 66 032 emits per 10 s dwell. Halving the emit rate (by coalescing two changesets per emit on a sub-50 ms window) would directly reduce CPU.
82+
83+
## Lever 4 — `socketTransportProtocols: ["websocket"]`
84+
85+
Run as the `websocket-only` matrix entry. Selected step-by-step diff vs baseline:
86+
87+
| Step | baseline p95 | websocket-only p95 | Δ | baseline apply_mean | ws-only apply_mean |
88+
|---:|---:|---:|---:|---:|---:|
89+
| 20 | 11 | 10 | -1 | 4.84 ms | 3.67 ms |
90+
| 60 | 11 | 9 | -2 | 4.63 ms | 3.28 ms |
91+
| 100 | 16 | 13 | -3 | 5.08 ms | 3.27 ms |
92+
| 140 | 8 | 24 | **+16** | 3.96 ms | 5.13 ms |
93+
| 180 | 16 | 35 | **+19** | 3.56 ms | 8.07 ms |
94+
| 200 | 20 | 37 | **+17** | 3.67 ms | 8.77 ms |
95+
96+
Below ~100 authors, websocket-only is a modest win (-1 to -3 ms p95). Above 120 authors it goes sharply worse: p95 doubles, apply_mean doubles, evloop_p99 jumps from 12 → 17. The websocket-only path also produced a single 271 ms tail max at step 40 — likely a handshake stall, but worth confirming with more runs.
97+
98+
**Verdict: do not recommend dropping the polling fallback.** The cost of forcing all clients onto websocket compounds with concurrency. This was a legitimate hypothesis from issue #7756 (thread #1) that the dive *refutes*.
99+
100+
## Lever 5 — raw `ws` (drop socket.io entirely)
101+
102+
**Not pursued.** Lever 4 demonstrated that the transport choice within socket.io is already an inversion — dropping the polling fallback hurts. Ripping socket.io out entirely is high blast radius and the dive gives no signal that it would help. Defer indefinitely.
103+
104+
## Recommendation
105+
106+
In priority order:
107+
108+
1. **Prototype fan-out batching** (lever 3). The dive identifies fan-out as the single dominant cost. Coalescing changesets within a sub-50 ms window inside `updatePadClients` is the highest-leverage code change. Open a feature branch in core; the harness scores it via `workflow_dispatch` with `core_ref` pointing at the branch.
109+
2. **Verify and run lever 1** (`perMessageDeflate`). Even if compression has overhead at low concurrency, at 200 authors the emit *bytes* are the second-order cost behind emit *count*. Worth scoring once lever 3 is in.
110+
3. **Do not merge lever 4.** Keep `socketTransportProtocols: ["websocket", "polling"]` as the default.
111+
4. **Do not merge lever 2.** No effect.
112+
5. **Add core counters for fan-out byte size** as a small follow-up to #7762. The histogram of changeset bytes per emit would make lever 1 scorable without instrumenting client-side.
113+
114+
## Reproducing
115+
116+
```
117+
# Trigger a dive run against any core ref.
118+
gh workflow run "Scaling dive" --repo ether/etherpad-load-test \
119+
-f core_ref=develop \
120+
-f sweep='authors=20..200:step=20:dwell=10s:warmup=2s'
121+
122+
# Fetch artifacts.
123+
gh run download <RUN_ID> --repo ether/etherpad-load-test
124+
```
125+
126+
Per-lever CSV / JSON / MD artifacts drop in `scaling-dive-{baseline,websocket-only,nodemem}/`. The CSV is plot-ready; the JSON has the full per-step `Snapshot.gauges`.
127+
128+
## Out of scope (sequel issues worth filing)
129+
130+
- The `apply_mean` calculation uses `histogram._sum / histogram._count` for a simple mean. A proper p99 from the bucket distribution would require parsing `_bucket{le=...}` rows in the harness. Worth adding to the Scraper if lever 3 scoring needs it.
131+
- The websocket-only step-40 spike (271 ms max) needs a second run to confirm it isn't a flake.
132+
- The harness sweep stops short of producing a *cliff* — even 200 authors didn't trip the breakage thresholds. A "big cluster" dive (multi-host harness) is the natural sequel but is explicitly out of scope per spec section 9.
133+
- Re-run with the same methodology after every batching-prototype iteration to track progress numerically.

0 commit comments

Comments
 (0)