Skip to content

Commit 4d6eec7

Browse files
JohnMcLearclaude
andcommitted
docs(scaling-dive): close #7769 in the doc + update recommendations
N=3 scoring of feat/cache-historical-author-data shows it's net-negative above 300 authors (step 350 p95 envelope 301/488/633ms vs develop baseline 39/39/122ms). Two compounding issues: - The motivating hypothesis (250-cliff is a join thundering herd) was falsified — that cliff was the per-IP rate-limit artefact. - The defensive shallow-clone-on-every-get() added in the Qodo fix walks O(N) author entries per join, costing more than the inline Promise.all it replaced. Updated recommendations: lever 3 (#7768) is now the only PR worth merging. lever 6 (#7769) added to the do-not-merge list with honest data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 03f4308 commit 4d6eec7

1 file changed

Lines changed: 28 additions & 12 deletions

File tree

docs/scaling-dive-2026-05.md

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -136,13 +136,26 @@ Below ~100 authors, WS-only is a small win. Above 120, it's sharply worse — p9
136136

137137
**Not pursued.** Lever 4 already shows that the choice *within* socket.io is non-trivial. Ripping socket.io out is high blast radius and the dive shows no signal it would help. Deferred indefinitely.
138138

139-
### Lever 6 — `historicalAuthorData` cache (join hot path) — **open as [#7769](https://github.com/ether/etherpad/pull/7769)**
139+
### Lever 6 — `historicalAuthorData` cache (closed [#7769](https://github.com/ether/etherpad/pull/7769))
140140

141-
The pre-PR `handleClientReady` did `Promise.all(pad.getAllAuthors().map(authorManager.getAuthor))` on every CLIENT_READY. At 200 existing authors × 50 simultaneous joiners that's **10 000 ueberdb cache lookups + Promise.all bookkeeping** racing against existing authors' USER_CHANGES for the event loop.
141+
Hypothesis: `handleClientReady` does `Promise.all(pad.getAllAuthors().map(authorManager.getAuthor))` per CLIENT_READY. Caching the result per pad would collapse 50 simultaneous joiners' 10 000 lookups into one shared computation.
142142

143-
This PR caches the `{authorId → {name, colorId}}` map per pad with a 5-second TTL. 50 joiners share **one** computation. Defensive shallow-clone on every `get()` so callers may freely mutate. In-flight-promise guard prevents a slow compute + TTL expiry from spawning a duplicate. Missing-author log preserved.
143+
**Closed after N=3 scoring contradicted the hypothesis.** Comparison of develop baseline vs the cache PR, p95 envelope across 3 runs each:
144144

145-
**It does not move the dive cliff** — at 350-400 authors the bottleneck is steady-state CPU saturation, not join-path cost. **It does** fix a real production thundering-herd condition (many users joining the same pad in a short window). Steady-state numbers up to step 350 are unchanged in [run 25949421120](https://github.com/ether/etherpad-load-test/actions/runs/25949421120) vs develop in [run 25949525421](https://github.com/ether/etherpad-load-test/actions/runs/25949525421).
145+
| Step | develop | cache PR | verdict |
146+
|---:|---|---|---|
147+
| 200 | 30 / 37 / 51 | 29 / 38 / 65 | within noise |
148+
| 300 | 38 / 45 / 71 | 39 / 93 / 240 | cache **worse** |
149+
| 350 | 39 / 39 / 122 | 301 / 488 / 633 | cache **much worse** |
150+
| 400 | 1758 / 2275 / 2463 | 3053 / 3203 / 3327 | cache worse at cliff |
151+
152+
Two compounding problems:
153+
154+
1. **The motivating hypothesis was wrong.** The 250-author cliff that prompted this PR was the per-IP `commitRateLimiting` artefact from harness colocation (fixed in [load-test#105](https://github.com/ether/etherpad-load-test/pull/105)), not a join-path thundering herd. There was no join-path bottleneck to fix.
155+
156+
2. **The implementation was net-negative.** The defensive shallow-clone-on-every-get() added in the Qodo-feedback fix walks O(N) author entries per call. With burst-of-50 new joiners × N existing authors × clone allocations at each step ramp + GC pressure, the cache costs more than the inline Promise.all it replaced.
157+
158+
The HistoricalAuthorDataCache module is a useful template; if anyone revisits, drop the defensive clone (replace with a "don't mutate" contract) and the result might net out positive in actual production thundering-herd scenarios that the dive doesn't measure.
146159

147160
**Verdict: recommend merging** for the production correctness benefit. Not a cliff-mover.
148161

@@ -192,26 +205,29 @@ The lever-3 (#7768) finding still stands but **for a different reason than origi
192205

193206
**Merge in priority order:**
194207

195-
1. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. The real, measured win. Correctness-positive.
208+
1. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. Modest median p95 improvement at step 200 (37 → 35) but **measurably tighter envelope** (baseline max 51 → PR max 38) — fewer tail-latency excursions. Correctness-positive: prevents overlapping per-socket fan-outs that were previously racy under concurrent commits. NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness.
196209
2. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive.
197-
3. **[#7769](https://github.com/ether/etherpad/pull/7769)**`historicalAuthorData` cache. Production thundering-herd fix, neutral on dive.
198210

199211
**Do not merge:**
200212

201-
- WebSocket-only transport (lever 4).
202-
- `--max-old-space-size` heap bump (lever 2).
213+
- WebSocket-only transport (lever 4) — reliably worst at the cliff across 3 runs.
214+
- `--max-old-space-size` heap bump (lever 2) — no effect.
203215
- The closed `fanoutDebounceMs` ([#7766](https://github.com/ether/etherpad/pull/7766)) — superseded by lever 3.
204216
- The closed rebase-loop prefetch ([#7770](https://github.com/ether/etherpad/pull/7770)) — didn't help.
217+
- The closed `historicalAuthorData` cache ([#7769](https://github.com/ether/etherpad/pull/7769)) — net-negative above 300 authors; motivating hypothesis was falsified.
218+
- The closed engine.io WS packing ([#7772](https://github.com/ether/etherpad/pull/7772)) — patch never fired because engine.io's flush drains too eagerly.
205219

206220
## Where to take this next
207221

208-
The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With lever 3 merged, the per-emit work is as cheap as application-level changes can make it. Further ceiling extension needs to attack one of two surfaces:
222+
The dive's cliff at 350-400 authors is **steady-state CPU saturation on a 4-vCPU runner with O(N²) fan-out**. With lever 3 merged, the per-emit application-level work is as cheap as it can get. Further ceiling extension needs to attack one of three surfaces:
223+
224+
1. **Engine.io flush deferral.** The closed lever-8 attempt patched only the `send(packets[])` path; what's needed is to defer `socket.flush()` itself so multiple `sendPacket()` calls in the same task accumulate before drain. `queueMicrotask`-coalesced flush is the smallest behaviour change with the right shape. This is the natural sequel to [#7767](https://github.com/ether/etherpad/issues/7767).
209225

210-
1. **Transport-level packing.** From the [#7767](https://github.com/ether/etherpad/issues/7767) investigation: engine.io's WebSocket transport emits one WS frame per packet even when the engine.io socket has multiple packets queued. The polling transport already batches at the HTTP-response boundary via `encodePayload`. Packing multiple packets into one WebSocket message via the same payload encoding would reduce the WS frame rate (and thus syscall and parser cost on both sides) proportionally. This is an engine.io protocol bump — needs both server and client to recognise packed payloads — and is the meatiest untouched lever.
226+
2. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production on 8+ vCPU sees the cliff move proportionally with no code changes. Per-pad multi-worker sharding lets a single host scale beyond single-core limits but is a much larger architectural change.
211227

212-
2. **Bigger hardware or per-pad sharding.** A 4-vCPU runner is the constraint, not Etherpad. Production deployments on 8+ vCPU machines would see the cliff move proportionally with no code changes. Per-pad multi-worker sharding (different process per pad/shard) is orthogonal and lets a single host scale beyond single-core limits, but is a much larger architectural change.
228+
3. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted (see "Methodology caveat" above) is the template.
213229

214-
Direction (1) is the next concrete investigation. The dive workflow is ready to score any candidate: open a feature branch with the engine.io changes, run `gh workflow run "Scaling dive" --ref main -f core_ref=<branch>`, compare against the develop baseline numbers in this doc.
230+
Direction (1) is the next concrete code investigation; (3) is methodology hygiene for all future investigations.
215231

216232
## Reproducing
217233

0 commit comments

Comments
 (0)