| date | 2026-03-15 |
|---|---|
| topic | e2e-testing-strategy |
A comprehensive testing strategy for TinyDocy's collaborative document editor, covering four concerns: schema correctness (enforced document model with fuzzing), feature E2E (drag, fold, filter in a real browser), performance (typing latency in long/complex documents), and concurrency (100 simultaneous Yjs users editing the same document).
The strategy uses three specialized layers — Vitest for fast headless schema tests, Playwright for browser-based E2E and performance baselines, and a custom Yjs load harness for scaled concurrency simulation. CI/CD integration is deferred to a later phase.
Considered:
- Approach A (chosen): Vitest + Playwright + Custom Yjs Harness — each tool excels at its layer; concurrency scales without 100 browsers; schema fuzzing runs in milliseconds.
- Approach B: Playwright + k6 — k6 is battle-tested for load testing, but encoding Yjs sync protocol in k6 is non-trivial.
- Approach C: Playwright-only — simplest setup, but can't reach 100 concurrent users and schema fuzzing in a browser is slow.
Approach A was chosen because:
- Vitest gives sub-second feedback on schema invariants and paste handling without browser overhead
- Playwright is the gold standard for ProseMirror/Tiptap browser testing — it can measure real keystroke-to-paint timing via Performance API
- The custom Yjs harness gives precise control over CRDT simulation at scale (connect 100
Y.Docinstances to Hocuspocus via WebSocket) without the cost of 100 real browsers - The hybrid model (a few Playwright browsers + many headless Yjs clients) satisfies both UX verification and load correctness
- What it tests: TitleDocument enforcement (first node always H1), paste handling (promotion to H1, inline marks, non-textblock flattening), document structure invariants after complex operations
- Invariant scope (Vitest layer): first node is H1, valid schema (
heading block*), heading levels 1-6 only, unique heading IDs (data-toc-id), heading hierarchy consistency, section boundaries compute correctly viacomputeSection, fold state survives edits, marks survive JSON serialization roundtrip, no orphaned decorations inDecorationSetafter transactions - Fuzzing: Two modes — (a) structured randomization generating valid documents with random heading depths/counts/content lengths, verifying invariants hold; (b) true fuzzing injecting malformed HTML pastes, broken markdown, rapid random operations to find crashes
- Fuzz scale: Medium — 5,000-10,000 random operations per run (~30-60 seconds), run as a dedicated test suite
- Why headless: Schema invariants don't need a browser — ProseMirror's
EditorStateandTransactionAPIs work in Node.js. Tests run in <1s (excluding fuzz suite). - Pattern: Create editor state with extensions → apply transactions/pastes → assert schema invariants
- What it tests: All heading features (drag sections, fold/unfold with crinkle animation, filter with hierarchy visibility, TOC sidebar sync), plus performance baselines
- Feature E2E: Navigate to editor, create document with known structure, exercise each feature, assert DOM state. Use real-world document structures that mimic actual usage patterns (complex heading trees, mixed content types). Functional assertions only (DOM state, CSS classes, computed styles) — no screenshot/pixel comparison.
- Browser-only invariants (promoted from exhaustive scope): all node types render correctly in the DOM, no memory leaks in plugins (measure via
performance.measureUserAgentSpecificMemory()or heap snapshots), plugin cleanup on editor destroy - Performance: Establish baselines first — measure keystroke-to-paint latency via
PerformanceObserverin documents of increasing size (10, 50, 100, 500 headings). Record baselines, then set regression thresholds based on what we learn. - Concurrency UX: 2-3 Playwright browser contexts connected to the same document, verifying Yjs sync produces correct merged state and typing remains responsive
- What it tests: 100 simultaneous users editing the same document — CRDT convergence, no data loss, Hocuspocus stability, WebSocket throughput
- How: Node.js script that spawns N
Y.Doc+HocuspocusProviderinstances, each performing random edits (insert text, add headings, delete sections) at configurable rates. After a test duration, all docs must converge to identical state. - Scenarios: Two modes — (a) distributed editing (users edit different sections, realistic production pattern); (b) conflict-heavy stress (multiple users deliberately edit the same paragraph/heading to stress CRDT merge)
- Form factor: Standalone Node.js script (not a Vitest test) — invoked via
bun run test:loador similar. Outputs a report (convergence pass/fail, timing, memory) to stdout. - Why custom: No existing tool natively speaks the Yjs sync protocol. Building a thin harness (~200-300 lines) using
yjs+@hocuspocus/provider(already in our deps) is simpler than adapting a generic load tool. - Metrics: Convergence time, message throughput, memory usage, document size growth, any lost updates
- Fixture documents: Curated JSON documents exercising edge cases — empty document, single H1 only, deeply nested (H1>H2>H3>H4>H5>H6), wide (50 H1 sections), mixed (headings + lists + images + blockquotes + code blocks)
- Generated documents: Randomized generators producing valid documents of configurable complexity for stress testing
- Fuzz inputs: Malformed HTML clipboard data, broken markdown, rapid concurrent conflicting edits
- No hard targets yet — first run establishes baselines
- Measure: keystroke-to-paint latency, decoration rebuild time, Yjs sync roundtrip, fold/unfold animation frame drops
- After baselines exist, set regression thresholds (e.g., "must not regress more than 20% from baseline")
- Test server: Tests require the dev server (
make dev) to be running — no ephemeral Hocuspocus spin-up for now - Visual regression: Functional assertions only — no screenshot/pixel comparison testing
- Concurrency scenarios: Both distributed editing (realistic) and conflict-heavy (stress) modes
- Heading edge cases: Derive from real-world document structures, not synthetic edge case lists
- Fuzz scale: Medium — 5,000-10,000 random operations (~30-60 seconds per run)
- Invariant depth: Exhaustive — schema, hierarchy, marks roundtrip, decoration cleanup, memory leaks
- CI/CD: Deferred to a separate brainstorm/plan
→ /workflows:plan for implementation details — file structure, test utilities, specific test cases, dependency setup