Skip to content

Fix race: client receives broadcast updates before sync handshake completes#138

Open
xrl wants to merge 2 commits into
y-crdt:mainfrom
xrl:fix-sync-handshake-race
Open

Fix race: client receives broadcast updates before sync handshake completes#138
xrl wants to merge 2 commits into
y-crdt:mainfrom
xrl:fix-sync-handshake-race

Conversation

@xrl
Copy link
Copy Markdown

@xrl xrl commented Apr 9, 2026

Problem

YRoom.serve() adds the client to self.clients immediately upon connection — before the SYNC_STEP1/SYNC_STEP2 handshake finishes. If the server-side document is mutated concurrently, _broadcast_updates sends a SYNC_UPDATE to the un-synced client. The client-side yjs then throws "Unexpected case" in findIndexSS because the update references a struct at a clock the client doesn't have yet.

How to trigger

This happens reliably in JupyterLab with jupyter-collaboration when an AI agent (via MCP tools) rapidly adds cells to a notebook while a new browser tab opens the same notebook. The new tab's WebSocket connects and starts the sync handshake, but before SYNC_STEP2 is received, a broadcast update arrives referencing a struct one clock tick past what the client has.

Debug data from instrumented yjs findIndexSS

[yjs findIndexSS] FAILED — clock not found in structs
{ searchClock: 1223, structsLength: 75, allRanges: "0..0, 1..8, ..., 1200..1222" }

The client has structs through clock 1222. The server sent an update referencing clock 1223. Always exactly one tick past the end — one broadcast update leaked through before the sync completed.

The race

1. Client connects → self.clients.add(channel)     ← TOO EARLY
2. Server sends SYNC_STEP1 (state vector) to client
3. MCP add_cell mutates server doc → _broadcast_updates fires
4. _broadcast_updates sends SYNC_UPDATE to ALL self.clients
   (including the new, un-synced client)
5. Client receives SYNC_UPDATE referencing clock 1223
6. Client hasn't processed SYNC_STEP2 yet — only has through 1222
7. findIndexSS throws "Unexpected case"

Fix

Defer self.clients.add(channel) until after the first sync message exchange completes. The client only joins the broadcast list once it has a complete initial state.

Also changes self.clients.remove() to self.clients.discard() in the finally block, since a client that disconnects before syncing was never added to the set.

Test

Added test_broadcast_not_sent_before_sync_completes which:

  1. Creates a YRoom with existing data
  2. Connects a new client
  3. Mutates the server-side doc 20 times during the handshake
  4. Asserts the client receives all data without errors

Without the fix, the client would crash on "Unexpected case" during step 3.

…completes

YRoom.serve() added the client to self.clients before the initial sync
exchange (SYNC_STEP1/SYNC_STEP2) finished. If the server-side document
was mutated concurrently (e.g. by an MCP tool adding notebook cells),
_broadcast_updates would send a SYNC_UPDATE to the un-synced client.

The client-side yjs then throws "Unexpected case" in findIndexSS because
the update references a struct at a clock the client doesn't have yet
(always exactly one tick past the client's last known struct).

The fix defers self.clients.add(channel) until after the first sync
message exchange completes, so broadcast updates only reach clients that
have a complete initial state.

Also changes self.clients.remove() to self.clients.discard() in the
finally block, since a client that disconnected before syncing was never
added to the set.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
channel.path,
)
tg.start_soon(channel.send, reply)
if not synced:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we are synchronized when we receive a SYNC_STEP2 message:

Suggested change
if not synced:
if not synced and message[1] == YSyncMessageType.SYNC_STEP2:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that's not correct. The client is synchronized when the server sends a SYNC_STEP2 after receiving the client's SYNC_STEP1, so we don't need the synced variable:

Suggested change
if not synced:

@davidbrochart davidbrochart added the bug Something isn't working label Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants