docs(scaling-dive): scope worker-thread first cut for applyToText

JohnMcLear · claude · JohnMcLear · commit 661e829aef39 · 2026-05-16T12:12:17.000+01:00
Post-#7775/#7776 profile shows applyToAText splits cleanly: - applyToText (Changeset.ts:404) is pure (cs, text) -> text; trivially offloadable to a worker via worker_threads structured-clone postMessage. - applyToAttribution (Changeset.ts:684) mutates AttributePool; not trivially offloadable. Document the obvious first-pass design (run them in parallel via Promise.all inside applyToAText) and the realistic estimate (~6-8% CPU moved off the main event loop). putAttrib is only 0.26% in the post-fix profile, confirming the bulk of applyToAText's cost is in the string-manipulation half. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md
@@ -367,6 +367,12 @@ The dive's cliff at 350-400 authors is **single-event-loop saturation on one cor
 
 1. **Worker-thread offload of OT.** ~25% of CPU is in `Changeset.applyToAText` and friends — pure computation that could run in a worker thread or worker pool. The main event loop becomes a coordinator; the heavy lift parallelises. Verified necessary by the local vCPU experiment above: bigger boxes do *not* move the cliff because Etherpad uses one core regardless. Worker threads is the smallest architectural change that lifts the single-event-loop ceiling.
 
+   **Concrete first-pass design.** `applyToAText(cs, atext, pool)` (`Changeset.ts:1060`) returns `{text: applyToText(cs, atext.text), attribs: applyToAttribution(cs, atext.attribs, pool)}`. The two halves are independent:
+   - `applyToText` (`Changeset.ts:404`) is a **pure function** of `(cs, text)`. Trivially offloadable to a worker pool via `node:worker_threads`. No shared state to negotiate; strings copy via `postMessage` structured clone.
+   - `applyToAttribution` (`Changeset.ts:684`) mutates `AttributePool` via `putAttrib`. Not trivially offloadable.
+   
+   The simplest first cut: dispatch `applyToText` to a worker while `applyToAttribution` runs on the main thread; `await Promise.all([workerText, mainAttrib])` inside `applyToAText`. The post-#7775/#7776 profile shows `putAttrib` is only 0.26% of CPU, so the bulk of the ~13% appendRevision share is in `applyToText` (string ops + `StringIterator` + `StringAssembler`). Plausible offload: ~6-8% of process CPU moved off the main event loop, directly recovering cliff headroom on a single Node process. Worth a focused PoC against one worker thread before deciding pool size.
+
 2. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further.
 
 The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7775 + #7776 stacked deliver -12% to -20% CPU% with the cliff effectively shifting from ~400 to ~500 authors — the biggest single-direction perf improvement in this program, and the first set of changes that move the cliff position itself rather than just thinning the tail. #7774 layers a modest additional tail-latency improvement on top. **Past this point the cliff is no longer hardware-bound; it's single-event-loop-bound** — verified by the local taskset experiment showing the cliff doesn't move when you give Etherpad more cores. Worker-thread offload of OT is the smallest architectural change that lifts the ceiling further — a separate program of work.