A real-time collaborative music environment where human musicians and AI agents are equal peers, connected via WebSocket to a shared jam room. All communication is lightweight JSON note events ("Cloud MIDI") — no audio transmitted.
Important
v5 Upgrade: This version incorporates expert-level audio engineering critiques (Phase 10) that address hidden physical constraints in distributed real-time audio: SNTP clock sync, React thread starvation, DO I/O thrashing, live input delay, hanging notes, and LLM agent ergonomics. These mitigations are folded into every relevant section below.
Goal: Enable humans and AI agents to jam together in real-time by exchanging symbolic note events over WebSocket, with each client synthesizing audio locally.
Hard Constraints:
- Solo founder, no team — must be buildable in <2 weeks
- Cloudflare Workers ecosystem (existing muscle memory)
- Zero infrastructure cost at idle (Hibernation API)
- Browser-only for humans (no desktop app)
- No audio transmission — JSON events only
Primary Read Path: A peer connects to a room and receives (1) current room state (BPM, key), (2) peer list, (3) real-time stream of note events from all other peers.
| # | Criticism | Severity | Mitigation |
|---|---|---|---|
| 1 | Clock Skew (SNTP + Audio Anchor) — beat_time depends on synchronized time, but naive clockOffset = serverTime - localTime ignores Network Round-Trip Time (RTT). A 60ms one-way packet delay makes the offset permanently wrong by 60ms — audible as drum flams. Furthermore, Date.now() (OS system clock) and Tone.now() (hardware audio clock) tick at different rates and will drift apart over a 20-minute jam session. |
🔴 Critical | True SNTP Ping-Pong Handshake: (1) Client sends ping with t0 = performance.now(). (2) DO responds with t1 (receive time) and t2 (send time). (3) Client receives at t3. (4) TrueOffset = ((t1 - t0) + (t2 - t3)) / 2. Repeat every 30s to track drift. Audio Clock Anchor: Map the synchronized server beat_time directly to Tone.now(). Never use Date.now() for audio scheduling. All scheduling math uses Tone.Transport.seconds as the single source of truth. Encapsulated in src/shared/utils/clock-sync.ts. |
| 2 | DO is SPOF per room — if DO crashes, all peers disconnect and in-memory state is lost. | 🟡 Medium | Clients auto-reconnect with exponential backoff. DO re-instantiates from persisted storage. sequences survive in ctx.storage. Transport state persisted on every mutation. |
| 3 | No auth in MVP — any peer can impersonate any peer_id, send garbage, or spam notes. |
🟡 Medium | Acceptable for MVP (prototype). Phase 2 adds Bearer token for agents + rate limiting (max 100 note events/sec per peer). Server assigns peer_id — clients cannot choose their own. |
| 4 | Broadcast O(N) — at 20 peers × 10 notes/sec = 200 msgs/sec, DO broadcasts 200 × 19 = 3,800 sends/sec. Single-thread blocks inbound processing during broadcast. | 🟡 Medium | DO soft limit is 1000 inbound req/s. Outbound ws.send() is not rate-limited. At 20 peers this is fine. At 50+, shard rooms into sub-rooms or use fan-out Worker. |
| 5 | Patch consistency — peers use different Tone.js versions or configurations, causing different sounds for same note events. | 🟢 Low | MVP uses one hardcoded PolySynth patch. Server broadcasts patch_config in room_state. Future: hash-verified patch distribution. |
| 6 | React Main Thread Starvation — piping 50–100 incoming WebSocket note events per second into React state (setNotes([...notes, newNote])) triggers constant reconciliation cycles. This causes heavy Garbage Collection (GC) spikes, blocking the browser's main thread. A blocked main thread causes Tone.js to instantly glitch, crackle, and stutter. |
🔴 Critical | Decouple hot-path from React entirely. (1) Audio Path: Route WS onmessage directly to a vanilla TypeScript AudioEngine class that triggers Tone.js — zero React involvement. (2) Visual Path: Push note data into a mutable useRef. Have Visualizer.tsx read from the ref using a native requestAnimationFrame canvas loop. React remains entirely ignorant of the real-time event loop. Encapsulated in src/client/features/synth/model/audio-engine.ts. |
| 7 | DO Serialization & I/O Thrashing — if 10 peers send 10 notes per second = 100 SQLite inserts/sec. Synchronous disk writes or JSON.parse() 100 times/sec inside the DO's single event loop introduces micro-stutters to the WS relay, ruining the tightness of the master clock. |
🟡 Medium | Zero-Cost Router + Batch Persist. (1) Hot Path: Treat the DO as a "dumb router" for note relays. String-match "type":"note_event" and instantly ws.send(rawString) to other peers — no JSON.parse(). (2) Cold Path: Push raw strings into an in-memory array. Use ctx.waitUntil() or a setInterval to parse and bulk-insert the array to SQLite every 5 seconds via flushToSQLite(). |
| 8 | Live Input Playout Delay Paradox — Lookahead scheduling (beat_time: 4.5) is perfect for AI agents, but humans play live. If a human hits a key on Beat 4.0 and broadcasts beat_time: 4.0, it takes ~40ms to reach remote peers. Remote peers receive the note in the past (at beat 4.0 + 40ms). Tone.js will either drop it or play it out of phase. |
🟡 Medium | Hybrid Playout Delay. Local playback is instant (zero latency for the performer). For broadcast: schedule the note at Tone.now() + playoutDelay (configurable, default 40ms) and send that exact future timestamp to the DO. Remote peers schedule at the received timestamp. This sacrifices perfect global sync by ~40ms but preserves instant local feedback — the standard approach used by every professional DAW for monitoring latency. |
| 9 | Hanging Notes / MIDI Panic — if a human or AI sends note_on but their Wi-Fi drops or script crashes before note_off, that note drones indefinitely in every other peer's speakers, ruining the jam. |
🟡 Medium | DO Connection Lifecycle Cleanup. (1) DO maintains activeNotes: Map<WebSocket, Set<string>> tracking which pitches each peer has active. (2) On native Cloudflare webSocketClose and webSocketError events, DO automatically broadcasts fabricated note_off events for all of that peer's active pitches. (3) Client-side safety: auto-release any note held longer than 30 seconds. |
| Criteria | WebSocket (via DO) | WebRTC Data Channel | Server-Sent Events |
|---|---|---|---|
| Latency | 20-80ms (via CF edge) | 5-30ms (P2P direct) | 50-200ms (unidirectional) |
| Topology | Star (DO is hub) | Mesh (N² connections) | Star (server push only) |
| Bidirectional | ✅ | ✅ | ❌ (server→client only) |
| Agent compatibility | ✅ (any WS client) | ❌ (needs browser/WebRTC stack) | |
| State management | ✅ (DO is state authority) | ❌ (no central state) | |
| Complexity | Low | High (signaling, NAT traversal, STUN/TURN) | Low |
| Max peers per room | ~50 (DO limit) | ~6-8 (mesh degrades) | Unlimited (read-only) |
| Cost at idle | $0 (Hibernation API) | $0 (P2P) | N/A |
Recommendation: WebSocket via DO. We're sending ~150-byte JSON events, not audio. The 20-80ms latency is absorbed by lookahead scheduling. P2P mesh breaks at >8 peers and can't accommodate headless agents. The DO gives us free state authority.
| Criteria | Durable Objects | Upstash Redis | Supabase Realtime |
|---|---|---|---|
| Latency | <10ms (same-origin) | 20-50ms (external) | 50-100ms (Postgres) |
| WebSocket native | ✅ (Hibernation API) | ❌ (need WS proxy) | ✅ (channels) |
| Strong consistency | ✅ (single-thread) | ||
| Cost at idle | $0 | $0 (free tier) | $0 (free tier) |
| Cost at 10K MAU | ~$5/mo | ~$10/mo | ~$25/mo |
| Vendor lock-in | Medium (CF only) | Low | Medium |
| Agent-first API | ✅ (WS upgrade) | ❌ (need custom) |
Recommendation: Durable Objects. Native WebSocket + state in one primitive. Single-thread = no race conditions. Hibernation = zero cost idle. No external dependency.
| Criteria | Tone.js v15 | Elementary Audio | Raw Web Audio API |
|---|---|---|---|
| Bundle size | ~150KB gzipped | ~80KB | 0KB (native) |
| PolySynth built-in | ✅ | ❌ (build from primitives) | ❌ |
| Scheduling (Transport) | ✅ (Tone.Transport) | ❌ (manual) | ❌ (manual) |
| GitHub stars | 13K+ | 1.5K | N/A |
| Learning curve | Low | Medium | High |
| AudioWorklet DSP | ✅ (core design) | ✅ (native) |
Recommendation: Tone.js for MVP. Built-in PolySynth, Transport, and scheduling. Swap to Elementary Audio for custom DSP in Phase 2 if needed.
| Criteria | Naive Offset (serverTime - localTime) |
SNTP Ping-Pong + Audio Anchor | NTP Library (e.g., ntp-client) |
|---|---|---|---|
| RTT compensation | ❌ (ignores network delay entirely) | ✅ (full round-trip calculation) | ✅ (full NTP algorithm) |
| Audio clock alignment | ❌ (uses Date.now(), drifts from Tone.now()) |
✅ (anchored to AudioContext.currentTime) |
❌ (system clock only) |
| Implementation complexity | Trivial (~5 LOC) | Moderate (~80 LOC) | High (external dependency) |
| Works in browser | ✅ | ✅ | ❌ (needs UDP, blocked in browser) |
| Drift correction over time | ❌ (one-shot) | ✅ (periodic re-sync every 30s) | ✅ |
| Precision for music | ❌ (~60ms error possible) | ✅ (<5ms error after calibration) | N/A (not browser-compatible) |
Recommendation: SNTP Ping-Pong + Audio Anchor. The only approach that compensates for RTT AND anchors to the hardware audio clock. Critical for maintaining rhythmic integrity across a 20-minute session.
| Criteria | React State-Driven (useState per note) |
Vanilla TS AudioEngine (decoupled) |
|---|---|---|
| GC pressure at 100 events/sec | 🔴 Severe (constant re-renders, array allocations) | 🟢 None (no React involvement) |
| Tone.js scheduling reliability | 🔴 Glitches when main thread blocked by React reconciliation | 🟢 Runs independently of React lifecycle |
| Visualizer performance | 🟡 React re-renders throttled by event volume | 🟢 requestAnimationFrame reads from mutable ref |
| Code complexity | Low (familiar React patterns) | Medium (two-layer architecture: AudioEngine + React wrapper) |
| Testability | Medium | High (AudioEngine is pure TS, unit-testable without DOM) |
Recommendation: Vanilla TS AudioEngine. The performance characteristics of real-time audio are fundamentally incompatible with React's reconciliation model at high event rates. The AudioEngine class handles the hot path; React handles the cold path (UI controls, peer list, transport bar).
| Service | Hard Limit | Escape Hatch |
|---|---|---|
| DO: SQLite storage | 10 GB per DO (paid), unlimited per account | More than enough — 10K events ≈ 1.5MB |
| DO: KV key+value | 2 MB combined (SQLite-backed) | Use SQL API instead for sequences |
| DO: SQL row/blob | 2 MB max per row | Each note event JSON ≈ 150 bytes |
| DO: Inbound requests | ~1000 req/s (20 WS msgs = 1 req) | Shard rooms at 50+ peers |
| DO: WS message size | 32 MiB max | Note events are ~150 bytes — no risk |
| DO: Single-thread I/O | One event loop — synchronous disk writes or JSON.parse() at high frequency will block WS relay |
Hot-path router: string-match note events and relay raw strings without parsing. Batch persist to SQLite every 5 seconds via flushToSQLite(). |
| Worker: CPU time | 30s paid / 10ms free | Note relay is sub-1ms |
| Worker: Memory | 128 MB | DO state is <1MB even with 10K events |
| Tone.js: AudioContext | 6 per page (browser) | We use 1 |
| Web Audio: Latency | ~128 samples minimum (~3ms at 44.1kHz) | Acceptable for music |
| Web Audio: Clock drift | AudioContext.currentTime drifts from Date.now() at ~1ms/min |
Never use Date.now() for audio scheduling — anchor all beat math to Tone.now() via clock-sync.ts |
| React: Main thread budget | ~16ms per frame at 60fps — React reconciliation on high-frequency state updates consumes this budget | Route WS audio events directly to AudioEngine class — React never sees note events |
| WebSocket: Free tier | 100K requests/day | MVP usage << 100K |
Note
Storage clarified from official docs: SQLite-backed DOs (new default) have 2 MB key+value combined and 10 GB SQLite storage per DO. The 128 KiB limit only applies to legacy KV-backed DOs. The 32 KiB figure previously cited was incorrect. We use ctx.storage.sql — no chunking needed.
Warning
v5 Audio Physics Constraints: Three constraints are invisible in standard web architectures but critical for real-time audio: (1) OS clock vs audio clock drift, (2) React GC spikes blocking the audio thread, (3) DO event loop stalls from synchronous I/O. All three are mitigated in this architecture — see Phase 2 items 1, 6, and 7.
| Page | Data Needed | Source | Query Pattern |
|---|---|---|---|
| Landing | None (room ID input) | N/A | N/A |
| Jam Room | Room state, peer list, note stream | WebSocket | Single persistent connection — all data pushed |
Ideal query in plain English: "Connect to room X and continuously receive BPM, key, who's connected, and every note anyone plays — in real-time, ordered by beat position."
Schema derivation: No traditional database. DO in-memory state IS the schema:
// This is the "database" — lives in DO memory + ctx.storage
roomState: { bpm, key, transportStartTime, serverTime }
peers: Map<peerId, { name, kind, instrument, color }>
sequences: NoteEvent[] // ring buffer, persisted in SQLite table (batch-flushed every 5s)
activeNotes: Map<WebSocket, Set<string>> // v5: tracks active pitches per peer for MIDI panic
pendingFlush: string[] // v5: raw event strings awaiting batch SQLite insertN+1 risk: None. All data arrives via single WebSocket. No sub-queries.
Data flow architecture (v5):
WebSocket onmessage
├── Hot Path (note_event) ──→ string-match, relay raw to peers (no JSON.parse)
│ └──→ push raw string to pendingFlush[]
└── Cold Path (all other) ──→ JSON.parse ──→ state mutation ──→ broadcast
Client-side:
WebSocket onmessage
├── Audio Path ──→ AudioEngine.playNote() ──→ Tone.js (bypasses React entirely)
└── Visual Path ──→ noteBufferRef.current.push() ──→ rAF canvas loop reads ref
| Component | Failure Mode | User Impact | Mitigation | Recovery |
|---|---|---|---|---|
| DO instance | Crash / eviction | All peers disconnect | Auto-reconnect with backoff | DO re-instantiates, loads state from storage |
| WebSocket | Network drop | Single peer loses connection | Client reconnect + re-hydrate from DO | Peer rejoins, gets room_state + sequences |
| Tone.js | AudioContext blocked | No sound | "Click to start" button (browser requirement) | User gesture triggers Tone.start() |
| Agent | Script crash | Agent stops playing, others unaffected | Agent-side reconnect logic | Agent rejoins room |
| BPM change mid-playback | Timeline drift | Notes play at wrong time | Recalculate transportStartTime |
All peers receive corrected room_state |
| Clock skew (SNTP) | beat_time drift between peers |
Notes slightly out of sync | SNTP ping-pong re-sync every 30s, anchored to Tone.now() |
Client recalculates TrueOffset on each sync cycle |
| Clock drift (Audio vs OS) | Date.now() and Tone.now() diverge over 20-min session |
Gradually worsening timing errors | Never use Date.now() for scheduling — all beat math anchored to AudioContext.currentTime via clock-sync.ts |
Periodic re-anchor on each SNTP sync |
| Malicious peer | Spam 1000 notes/sec | Room flooded | Server-side rate limit (100 events/sec/peer) | Excess events dropped, peer warned |
| React GC spike | Main thread blocked >16ms during high note volume | Tone.js glitches, audio crackle, visual stutter | Audio path fully decoupled from React — AudioEngine class handles Tone.js directly; visualizer reads from useRef via requestAnimationFrame |
No recovery needed — architecture prevents the failure mode |
| Hanging note (MIDI panic) | Peer sends note_on then disconnects before note_off (Wi-Fi drop, script crash, tab close) |
Note drones indefinitely in all other peers' speakers | DO maintains activeNotes: Map<WebSocket, Set<string>>. On webSocketClose/webSocketError, DO broadcasts fabricated note_off for all of that peer's active pitches. Client-side safety: auto-release notes held >30s. |
Immediate cleanup — other peers hear the note stop within one WS broadcast cycle |
| DO I/O thrashing | 100+ SQLite inserts/sec from 10 peers × 10 notes/sec blocks WS relay | Micro-stutters in master clock, notes arrive late | Hot-path router: note events relayed as raw strings without JSON.parse(). pendingFlush[] batch-inserts to SQLite every 5 seconds. |
No recovery needed — architecture prevents the failure mode |
Migration: Greenfield project — no existing state to migrate. Deploy creates new DO namespace.
Rollback:
- Branch:
feat/web-daw-jam-room→ PR tostaging - Revert:
git revertmerge commit. No persistent state affected (rooms are ephemeral). - DO storage: rooms die when empty — no orphaned data.
Canary Criteria:
- WebSocket connection success rate > 99%
- Note event round-trip < 150ms (p95)
- SNTP clock sync accuracy < 10ms drift (p95)
- Zero audio glitches during 5-minute sustained 4-peer jam test
- Zero 5xx errors in Worker logs for 30 minutes post-deploy
Building an agent-first collaborative music environment. Need real-time bidirectional communication between humans (browsers) and AI agents (headless scripts) with consistent state management and zero idle cost.
Use Cloudflare Durable Objects with Hibernation WebSocket API as the central hub. Star topology with DO as state authority. JSON "Cloud MIDI" protocol with beat_time scheduling anchored to AudioContext.currentTime via SNTP ping-pong sync. Tone.js for client-side audio synthesis routed through a vanilla TypeScript AudioEngine class decoupled from React's render cycle. No audio transmission.
- Higher latency than WebRTC (20-80ms vs 5-30ms) — acceptable because we're sending symbolic events, not audio. Lookahead scheduling absorbs jitter.
- Single-thread bottleneck at scale — DO caps around 50 active peers. Acceptable for MVP. Escape: shard into sub-rooms.
- No preset ecosystem — single hardcoded synth patch. Acceptable for prototype.
- Hybrid playout delay (v5) — local playback is instant, but broadcast is delayed by ~40ms to ensure remote peers can schedule notes in the future rather than the past. Trades perfect global sync for natural local feedback. This is the standard approach used by professional DAWs and networked game engines.
- Two-layer client architecture (v5) — vanilla TS
AudioEnginehandles the real-time hot path (Tone.js scheduling, note triggering) while React handles the cold path (UI controls, peer list, transport bar). Adds architectural complexity but eliminates the fundamental incompatibility between React's reconciliation model and real-time audio at high event rates. - DO hot-path routing (v5) — note events are relayed as raw strings without
JSON.parse()to avoid blocking the event loop. SQLite persistence is batch-flushed every 5 seconds. Trades strict per-event durability for relay performance. Acceptable because note events are ephemeral — if the DO crashes, the room re-initializes and peers reconnect.
- DO SQLite storage: 10GB per DO, 2MB per row — more than sufficient. No chunking needed.
- SNTP clock sync → periodic ping-pong handshake every 30s, TrueOffset anchored to
Tone.now() - Audio vs OS clock drift →
Date.now()banned from audio scheduling; all beat math usesAudioContext.currentTime - No auth in MVP → server-assigned
peer_id, rate limiting in Phase 2
- WebRTC Data Channels: Lower latency but mesh topology degrades at >8 peers and headless agents need full WebRTC stack. Rejected.
- Upstash Redis: External dependency, no native WebSocket, eventual consistency. Rejected.
- Supabase Realtime: Higher latency, Postgres overhead for ephemeral event relay. Rejected.
- React state for audio events: Familiar pattern but fundamentally incompatible with real-time audio at >50 events/sec due to GC pressure and main thread starvation. Rejected.
- Naive
Date.now()clock sync: Ignores RTT and drifts from audio clock. Rejected in favor of SNTP + Audio Anchor.
- No conflicts with existing systems (empty workspace).
- Synergy: User's shader physics skills directly transferable to WebGL audio visualization.
- Enables: Future Agent Hub integration — agents connect to jam rooms via the same API they'd use for any other tool.
| Rule | Status | Notes |
|---|---|---|
| FSD layer hierarchy | ✅ Pass | shared/ → features/ → App.tsx |
| Barrel file enforcement | ✅ Pass | Every slice has index.ts |
| 300 LOC limit | ✅ Pass | Largest file ~200 LOC (jam-room.ts with hot-path router) |
TypeScript strict (no any) |
✅ Pass | All types defined in protocol |
| Zod at network boundary | ✅ Pass | All inbound WS messages validated (cold path only — hot path uses string-match) |
| Conventional commits | ✅ Pass | Branch + PR workflow |
| PascalCase components, kebab-case files | ✅ Pass | Per file inventory |
| Unified Worker pattern | ✅ Pass | wrangler.jsonc with assets + DO |
| React decoupled from audio hot-path | ✅ Pass | AudioEngine is vanilla TS — no React imports |
| Clock sync anchored to AudioContext | ✅ Pass | clock-sync.ts uses Tone.now(), never Date.now() |
| DO I/O batched | ✅ Pass | flushToSQLite() runs every 5s via setInterval |
| MIDI panic on peer disconnect | ✅ Pass | webSocketClose handler broadcasts note_off for all active pitches |
| # | Action | Path | LOC | Notes |
|---|---|---|---|---|
| 1 | NEW | package.json |
35 | |
| 2 | NEW | tsconfig.json |
20 | |
| 3 | NEW | wrangler.jsonc |
20 | |
| 4 | NEW | vite.config.ts |
25 | |
| 5 | NEW | src/shared/protocol/types.ts |
100 | v5: added ClockSyncPing, ClockSyncPong, NoteOff (MIDI panic), MeasureComplete types |
| 6 | NEW | src/shared/protocol/constants.ts |
45 | v5: added PLAYOUT_DELAY_MS, CLOCK_SYNC_INTERVAL_MS, FLUSH_INTERVAL_MS, MAX_NOTE_HOLD_S |
| 7 | NEW | src/shared/protocol/index.ts |
5 | |
| 8 | NEW | src/shared/utils/clock-sync.ts |
80 | v5 NEW — SNTP ping-pong RTT calculation, TrueOffset computation, periodic re-sync, getTrueNetworkTime() function, anchored to Tone.now() |
| 9 | NEW | src/server/index.ts |
25 | |
| 10 | NEW | src/server/jam-room.ts |
250 | v5: expanded with hot-path string-match router, pendingFlush[] array, flushToSQLite() debounced batch insert, activeNotes map, webSocketClose/webSocketError MIDI panic handler |
| 11 | NEW | src/server/env.ts |
10 | |
| 12 | NEW | src/client/main.tsx |
15 | |
| 13 | NEW | src/client/App.tsx |
60 | |
| 14 | NEW | src/client/features/room/model/use-room-socket.ts |
130 | v5: routes note events directly to AudioEngine (not React state). Only cold-path events (peer list, room state, transport) update React state. |
| 15 | NEW | src/client/features/room/ui/JamRoom.tsx |
100 | |
| 16 | NEW | src/client/features/room/ui/PeerList.tsx |
50 | |
| 17 | NEW | src/client/features/room/ui/TransportBar.tsx |
80 | |
| 18 | NEW | src/client/features/room/index.ts |
5 | |
| 19 | NEW | src/client/features/synth/model/audio-engine.ts |
150 | v5 NEW — Pure vanilla TS class (no React imports). Manages Tone.js PolySynth, handles playout delay scheduling (Tone.now() + PLAYOUT_DELAY_MS), note-on/note-off tracking, AudioContext lifecycle. Exposes playNote(), stopNote(), setTransport(), dispose(). |
| 20 | NEW | src/client/features/synth/model/use-synth-engine.ts |
60 | v5: reduced from 100 LOC — now a thin React wrapper around AudioEngine. Creates/disposes the engine on mount/unmount. Exposes engine ref for direct WS routing. |
| 21 | NEW | src/client/features/synth/ui/Keyboard.tsx |
120 | |
| 22 | NEW | src/client/features/synth/index.ts |
5 | |
| 23 | NEW | src/client/features/visualizer/ui/Visualizer.tsx |
80 | v5: reads from noteBufferRef.current via requestAnimationFrame loop — not React state. Canvas-based rendering. |
| 24 | NEW | src/client/features/visualizer/index.ts |
5 | |
| Total | ~1,340 | +120 LOC from v4 (new clock-sync.ts + audio-engine.ts + expanded jam-room.ts) |
Agent SDK (src/agent-sdk/index.ts, ~80 LOC) deferred to Phase 1b after core room works.
- Create
feat/web-daw-jam-roombranch - Scaffold Vite + React project with
npx create-vite - Install:
tone,zod,hono(for future API routes) - Configure
wrangler.jsoncwith DO binding - Verify
npx wrangler devworks with DO locally
What you CAN do after Phase 1:
- Open
https://jam-room.example.com/room/my-roomin two browser tabs - Play notes on a virtual keyboard (or MIDI controller) — hear them in both tabs
- See a peer list showing who's connected with color-coded note attribution
- Change BPM/key — all peers sync instantly
- See a real-time waveform visualizer driven by the audio output
- Experience tight rhythmic sync thanks to SNTP clock anchoring
- No audio glitches even at high note density thanks to React-decoupled AudioEngine
- Notes properly cleaned up when a peer disconnects (no hanging drones)
What's explicitly DEFERRED:
- Agent SDK (Phase 1b — after core room proven)
- LLM Agent Measure Aggregation (Phase 1b —
measure_completeevent bundling for Foundation Models. The DO emits a JSON bundle of all notes at the end of each measure, allowing LLMs to receive bounded context windows instead of a real-time firehose. LLM agents can sit dormant, receive the measure summary, take 2 seconds to "think", and schedule their response to drop at the start of Measure N+2.) - Multi-instrument selection
- Step sequencer / pattern recording
- AI composition endpoint
- Room persistence / lobby
- Authentication
What DOESN'T change: No existing projects affected. Greenfield workspace.
Note
These mitigations were identified via expert-level audio engineering review and address physical constraints in distributed real-time audio that are invisible in standard web architectures. Each has been folded into the relevant phase above. This section serves as a cross-reference summary.
| # | Dragon | Root Cause | Mitigation | Integrated In |
|---|---|---|---|---|
| 1 | Clock Skew Math Ignores RTT | Naive serverTime - localTime offset ignores 20-80ms network delay. Also, Date.now() and Tone.now() drift apart at ~1ms/min. |
SNTP ping-pong handshake with TrueOffset = ((t1 - t0) + (t2 - t3)) / 2. All audio scheduling anchored to Tone.now(), never Date.now(). |
Phase 2 #1, Phase 4, Phase 6, new file clock-sync.ts |
| 2 | React Main Thread Starvation | 50-100 WS events/sec in React state → constant reconciliation → GC spikes → Tone.js glitches | Vanilla TS AudioEngine class handles audio hot path. React only handles cold-path UI. Visualizer reads from useRef via requestAnimationFrame. |
Phase 2 #6, Phase 4, Phase 6, new file audio-engine.ts |
| # | Dragon | Root Cause | Mitigation | Integrated In |
|---|---|---|---|---|
| 3 | DO Serialization Thrashing | 100+ JSON.parse() + SQLite inserts/sec blocks DO event loop → WS relay stutters |
String-match note events for raw relay (no parse). Batch persist to SQLite every 5s. | Phase 2 #7, Phase 4, Phase 5, jam-room.ts |
| 4 | Live Input Playout Delay | Human note_on at Beat 4.0 arrives 40ms late at remote peers → played in the past |
Hybrid delay: instant local playback + broadcast at Tone.now() + playoutDelay. |
Phase 2 #8, audio-engine.ts |
| # | Dragon | Root Cause | Mitigation | Integrated In |
|---|---|---|---|---|
| 5 | Hanging Notes / MIDI Panic | note_on without note_off due to disconnect → infinite drone |
DO tracks activeNotes per peer. webSocketClose broadcasts fabricated note_off. |
Phase 2 #9, Phase 6, jam-room.ts |
| 6 | LLM Agent Firehose | Foundation Models can't process continuous 150-byte frames — need bounded context windows | measure_complete event bundles all notes at measure end. LLMs receive, think, respond 2 measures later. |
Phase 9e (deferred to Phase 1b) |
Automated: Vitest for protocol Zod schemas + DO unit tests (@cloudflare/vitest-pool-workers). Cypress for two-client WebSocket relay + frontend audio playback.
Clock Sync Verification (v5): Unit test clock-sync.ts with mocked latencies (10ms, 50ms, 100ms) — verify TrueOffset calculation accuracy within 5ms.
Audio Decoupling Verification (v5): Stress test with 100 synthetic WS events/sec — verify zero React re-renders on note events, zero Tone.js glitches over 60 seconds.
MIDI Panic Verification (v5): Simulate peer disconnect with active notes — verify note_off broadcast within 100ms and no hanging audio.
DO Hot-Path Verification (v5): Load test with 10 synthetic peers × 10 notes/sec — verify WS relay latency stays <10ms p95, SQLite batch flush occurs every ~5s.
Manual: Two-tab jam test, latency feel test, BPM change propagation.
Canary: WS success rate >99%, note round-trip <150ms p95, SNTP sync accuracy <10ms, zero 5xx for 30 min.
- Phase 2: 9 criticisms found and addressed (v5: expanded from 5 to 9)
- Phase 3: Trade-off matrices for transport, state manager, synth engine, clock sync, audio architecture (v5: added 2 new matrices)
- Phase 4: Hard limits for DO, Workers, Tone.js, Web Audio, WebSocket, React main thread, audio clock drift (v5: expanded)
- Phase 5: UI data access defined (single WebSocket, no DB queries) + data flow architecture (v5: hot-path/cold-path diagram)
- Phase 6: Failure mode table for 11 components (v5: expanded from 7 to 11)
- Phase 7: Rollback plan (git revert, no DB migrations) + expanded canary criteria (v5)
- Phase 8: ADR written with v5 trade-offs (playout delay, two-layer client, hot-path routing)
- Phase 9a: Cross-system validation — no conflicts
- Phase 9b: Compliance audit — all rules pass (v5: added 4 new audit items)
- Phase 9c: File inventory — 24 files tracked (v5: +2 from v4)
- Phase 9d: Prerequisites identified
- Phase 9e: End-result vision documented with v5 capabilities + deferred LLM measure aggregation
- Phase 10: Production-audio grade mitigations — 6 hidden dragons identified, all integrated