|
| 1 | +# Scaling dive — 2026-05 |
| 2 | + |
| 3 | +**Closes Phase 2 of #7756.** First numbers-backed answer to "how many editors can be on one pad, and what is the bottleneck when it falls over?" |
| 4 | + |
| 5 | +## TL;DR |
| 6 | + |
| 7 | +Two clean conclusions from three matrix runs on the same GitHub-hosted `ubuntu-latest` runner shape: |
| 8 | + |
| 9 | +1. **Server-side changeset apply is not the bottleneck.** Even at 200 concurrent authors, `etherpad_changeset_apply_duration_seconds` mean is ~3.7–4.4 ms — well under client-perceived p95 (~20–25 ms). The remaining latency lives in *fan-out*, not in *apply*. |
| 10 | +2. **Dropping the socket.io polling fallback (`socketTransportProtocols: ["websocket"]`) makes things worse, not better, under high concurrency.** At 200 authors it nearly doubles client p95 (37 ms vs 20 ms baseline). The hypothesis that the polling fallback was costing us is falsified. |
| 11 | + |
| 12 | +Raising the node heap (`--max-old-space-size=4096`) makes no measurable difference — memory is not where the cost lives. |
| 13 | + |
| 14 | +Next step: prototype the **fan-out batching** lever (spec section 9 lever 3). Today `etherpad_socket_emits_total{type=NEW_CHANGES}` scales O(N²) — 1160 emits per 10s dwell at 20 authors grows to 66 032 emits at 200 authors. Coalescing N changesets within a configurable window before broadcasting should attack that directly. |
| 15 | + |
| 16 | +## Methodology |
| 17 | + |
| 18 | +- **Harness:** [`ether/etherpad-load-test`](https://github.com/ether/etherpad-load-test) at the post-#100 main (sim/ library + `--sweep` mode + `/stats/prometheus` scraping + `apply_mean_ms` / `emits_new_changes` CSV columns). |
| 19 | +- **Server-side instruments:** the three Prometheus counters added in #7762, enabled via `settings.scalingDiveMetrics=true`. |
| 20 | +- **SUT:** etherpad core `develop` HEAD at the time of run. |
| 21 | +- **Runner shape:** GitHub-hosted `ubuntu-latest` (4 vCPU, ~16 GB RAM). Same shape across all three matrix entries, so noise is constant. |
| 22 | +- **Workflow:** [`.github/workflows/scaling-dive.yml`](https://github.com/ether/etherpad-load-test/blob/main/.github/workflows/scaling-dive.yml), manual `workflow_dispatch`. Two runs analysed: |
| 23 | + - **Run 25936626554** — default sweep `authors=10..80:step=10:dwell=15s:warmup=3s`. |
| 24 | + - **Run 25936813657** — deeper sweep `authors=20..200:step=20:dwell=10s:warmup=2s` (used for the conclusions below). |
| 25 | + |
| 26 | +### Decision rules (per spec section 6) |
| 27 | + |
| 28 | +- p95 latency up *without* event-loop p99 up ⇒ network IO bound. |
| 29 | +- p95 latency up *with* event-loop p99 up ⇒ server CPU / event-loop bound. |
| 30 | +- p95 latency up *with* RSS climbing across steps ⇒ leak / backpressure. |
| 31 | + |
| 32 | +## Baseline curve |
| 33 | + |
| 34 | +The deep sweep on baseline (no levers, develop HEAD): |
| 35 | + |
| 36 | +| Step | p50 | p95 | p99 | EL p99 | apply_mean | emits_NEW_CHANGES | cpu_user (s) | |
| 37 | +|---:|---:|---:|---:|---:|---:|---:|---:| |
| 38 | +| 20 | 9 | 11 | 12 | 11 | 4.84 ms | 1 160 | 2.4 | |
| 39 | +| 40 | 8 | 11 | 12 | 12 | 4.62 ms | 3 520 | 4.0 | |
| 40 | +| 60 | 8 | 11 | 13 | 12 | 4.63 ms | 7 040 | 6.3 | |
| 41 | +| 80 | 10 | 17 | 19 | 12 | 5.18 ms | 11 780 | 9.5 | |
| 42 | +| 100 | 8 | 16 | 18 | 11 | 5.08 ms | 17 668 | 13.0 | |
| 43 | +| 120 | 5 | 12 | 16 | 11 | 4.55 ms | 24 793 | 17.5 | |
| 44 | +| 140 | 3 | 8 | 11 | 11 | 3.96 ms | 33 088 | 22.8 | |
| 45 | +| 160 | 4 | 9 | 11 | 11 | 3.62 ms | 42 563 | 29.0 | |
| 46 | +| 180 | 5 | 16 | 20 | 12 | 3.56 ms | 54 112 | 36.5 | |
| 47 | +| 200 | 7 | 20 | 25 | 12 | 3.67 ms | 66 032 | 44.0 | |
| 48 | + |
| 49 | +Reading against the decision rules: |
| 50 | + |
| 51 | +- p95 grows slowly (11 → 20 ms across the range), but doesn't cliff. |
| 52 | +- Event-loop p99 stays at 11–12 ms — flat. **Not event-loop bound.** |
| 53 | +- RSS climbs from 393 MB → 651 MB but no leak shape (it plateaus around step 100). |
| 54 | +- CPU is the headline: 200 authors burns 44 CPU-seconds in 10 s wall-clock — ~4.4 cores. The runner has 4 vCPU. We're saturating the CPU on fan-out work. |
| 55 | + |
| 56 | +So per the decision rules: **network/CPU bound, but the work is fan-out, not apply.** The `apply_mean` stays low while emits grow O(N²) with concurrency. |
| 57 | + |
| 58 | +## Lever 1 — perMessageDeflate |
| 59 | + |
| 60 | +**Not run.** Verifying that core's socket.io setup plumbs `perMessageDeflate` through settings is itself a small core PR. Folded into the recommendation below. |
| 61 | + |
| 62 | +## Lever 2 — `--max-old-space-size=4096` (NODE_OPTIONS) |
| 63 | + |
| 64 | +Run as the `nodemem` matrix entry. Selected step-by-step diff vs baseline: |
| 65 | + |
| 66 | +| Step | baseline p95 | nodemem p95 | Δ | |
| 67 | +|---:|---:|---:|---:| |
| 68 | +| 80 | 17 | 17 | 0 | |
| 69 | +| 120 | 12 | 16 | +4 | |
| 70 | +| 160 | 9 | 13 | +4 | |
| 71 | +| 200 | 20 | 13 | -7 | |
| 72 | + |
| 73 | +Noise within ±5 ms. RSS grows similarly. apply_mean and emits_NEW_CHANGES are essentially identical. |
| 74 | + |
| 75 | +**Verdict: no measurable effect.** The user's hunch on the issue (memory is not the bottleneck) is confirmed. Don't recommend bumping the heap as a scaling lever. |
| 76 | + |
| 77 | +## Lever 3 — fan-out batching |
| 78 | + |
| 79 | +**Deferred.** Requires a code change in `PadMessageHandler.ts` (specifically the per-socket loop in `updatePadClients` and/or the broadcast emit at line 627). Recommended as the next concrete code change. The harness is ready to score it as soon as a candidate branch exists — point the workflow's `core_ref` input at the branch. |
| 80 | + |
| 81 | +The `emits_new_changes` column on the curve table above is the direct measurement target. At 200 authors we're producing 66 032 emits per 10 s dwell. Halving the emit rate (by coalescing two changesets per emit on a sub-50 ms window) would directly reduce CPU. |
| 82 | + |
| 83 | +## Lever 4 — `socketTransportProtocols: ["websocket"]` |
| 84 | + |
| 85 | +Run as the `websocket-only` matrix entry. Selected step-by-step diff vs baseline: |
| 86 | + |
| 87 | +| Step | baseline p95 | websocket-only p95 | Δ | baseline apply_mean | ws-only apply_mean | |
| 88 | +|---:|---:|---:|---:|---:|---:| |
| 89 | +| 20 | 11 | 10 | -1 | 4.84 ms | 3.67 ms | |
| 90 | +| 60 | 11 | 9 | -2 | 4.63 ms | 3.28 ms | |
| 91 | +| 100 | 16 | 13 | -3 | 5.08 ms | 3.27 ms | |
| 92 | +| 140 | 8 | 24 | **+16** | 3.96 ms | 5.13 ms | |
| 93 | +| 180 | 16 | 35 | **+19** | 3.56 ms | 8.07 ms | |
| 94 | +| 200 | 20 | 37 | **+17** | 3.67 ms | 8.77 ms | |
| 95 | + |
| 96 | +Below ~100 authors, websocket-only is a modest win (-1 to -3 ms p95). Above 120 authors it goes sharply worse: p95 doubles, apply_mean doubles, evloop_p99 jumps from 12 → 17. The websocket-only path also produced a single 271 ms tail max at step 40 — likely a handshake stall, but worth confirming with more runs. |
| 97 | + |
| 98 | +**Verdict: do not recommend dropping the polling fallback.** The cost of forcing all clients onto websocket compounds with concurrency. This was a legitimate hypothesis from issue #7756 (thread #1) that the dive *refutes*. |
| 99 | + |
| 100 | +## Lever 5 — raw `ws` (drop socket.io entirely) |
| 101 | + |
| 102 | +**Not pursued.** Lever 4 demonstrated that the transport choice within socket.io is already an inversion — dropping the polling fallback hurts. Ripping socket.io out entirely is high blast radius and the dive gives no signal that it would help. Defer indefinitely. |
| 103 | + |
| 104 | +## Recommendation |
| 105 | + |
| 106 | +In priority order: |
| 107 | + |
| 108 | +1. **Prototype fan-out batching** (lever 3). The dive identifies fan-out as the single dominant cost. Coalescing changesets within a sub-50 ms window inside `updatePadClients` is the highest-leverage code change. Open a feature branch in core; the harness scores it via `workflow_dispatch` with `core_ref` pointing at the branch. |
| 109 | +2. **Verify and run lever 1** (`perMessageDeflate`). Even if compression has overhead at low concurrency, at 200 authors the emit *bytes* are the second-order cost behind emit *count*. Worth scoring once lever 3 is in. |
| 110 | +3. **Do not merge lever 4.** Keep `socketTransportProtocols: ["websocket", "polling"]` as the default. |
| 111 | +4. **Do not merge lever 2.** No effect. |
| 112 | +5. **Add core counters for fan-out byte size** as a small follow-up to #7762. The histogram of changeset bytes per emit would make lever 1 scorable without instrumenting client-side. |
| 113 | + |
| 114 | +## Reproducing |
| 115 | + |
| 116 | +``` |
| 117 | +# Trigger a dive run against any core ref. |
| 118 | +gh workflow run "Scaling dive" --repo ether/etherpad-load-test \ |
| 119 | + -f core_ref=develop \ |
| 120 | + -f sweep='authors=20..200:step=20:dwell=10s:warmup=2s' |
| 121 | +
|
| 122 | +# Fetch artifacts. |
| 123 | +gh run download <RUN_ID> --repo ether/etherpad-load-test |
| 124 | +``` |
| 125 | + |
| 126 | +Per-lever CSV / JSON / MD artifacts drop in `scaling-dive-{baseline,websocket-only,nodemem}/`. The CSV is plot-ready; the JSON has the full per-step `Snapshot.gauges`. |
| 127 | + |
| 128 | +## Out of scope (sequel issues worth filing) |
| 129 | + |
| 130 | +- The `apply_mean` calculation uses `histogram._sum / histogram._count` for a simple mean. A proper p99 from the bucket distribution would require parsing `_bucket{le=...}` rows in the harness. Worth adding to the Scraper if lever 3 scoring needs it. |
| 131 | +- The websocket-only step-40 spike (271 ms max) needs a second run to confirm it isn't a flake. |
| 132 | +- The harness sweep stops short of producing a *cliff* — even 200 authors didn't trip the breakage thresholds. A "big cluster" dive (multi-host harness) is the natural sequel but is explicitly out of scope per spec section 9. |
| 133 | +- Re-run with the same methodology after every batching-prototype iteration to track progress numerically. |
0 commit comments