Skip to content

Coalesce burst negotiations and avoid full-reconnect on single NegotiationError under concurrent signaling #1915

@U-OK

Description

@U-OK

Describe the problem

In rooms with many participants, a burst of server-initiated renegotiations (e.g. several participants leaving within a few seconds — each leave triggers onMediaSectionsRequirement) causes severe client-side degradation:

  1. RTCEngine.negotiate() is invoked repeatedly in rapid succession. Each invocation registers Closing and Restarting listeners on the engine emitter (this.on(EngineEvent.Closing, handleClosed) / this.on(EngineEvent.Restarting, handleClosed)). With ~11 concurrent calls, Node's EventEmitter raises MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 restarting/closing listeners added.

  2. When one of the concurrent negotiations fails, fullReconnectOnNext = true is set unconditionally, triggering a full engine restart (cleanupPeerConnections + cleanupClient + rejoin signal + rejoin engine).

  3. Any track publish in flight at that moment (publishOrRepublishTrack) awaits reconnectFuture.promise and stalls for the full duration of the restart.

Observed impact in production (12-participant room, 7 leaves within 4s, Chrome 146 / macOS, livekit-client@2.18.6, stable network, signal WS never disconnected):

  • setMicrophoneEnabled(true) stalled for 5.4 seconds when the user first enabled their mic during the mass-leave moment.
  • Noise-suppression processor initialization (AudioWorklet.addModule) failed immediately after the forced restart with AbortError: Unable to load a worklet's module, forcing passthrough fallback.
  • Two MaxListenersExceededWarning entries appeared in Sentry (one for restarting, one for closing).

Relevant breadcrumbs:

07:23:17.395Z publishing track {kind:“audio”, source:“microphone”}
07:23:17-19 7 × video_room.member.status_updated user_state:“left”
07:23:19.730Z MaxListenersExceededWarning: 11 restarting listeners added
(stack: RTCEngine.negotiate → client.onMediaSectionsRequirement → SignalClient.handleSignalResponse)
07:23:19.939Z MaxListenersExceededWarning: 11 closing listeners added (same stack)
07:23:21.421Z [mic] slow toggle — setMicMs=5399 totalMs=5433 trackExists=false

The root cause is not a network failure (HTTP and WS were fine throughout). It is the combination of (a) no listener-accumulation guard on the engine emitter, and (b) overly aggressive escalation to full-reconnect on the first NegotiationError within a concurrent burst.

Describe the proposed solution

  1. Stop accumulating listeners per negotiate() call. In RTCEngine.negotiate(), instead of adding a fresh Closing / Restarting listener pair on every invocation, either:

    • Register the abort-on-close/restart logic once at engine construction, routing to all in-flight abort controllers through a shared set, or
    • Call this.setMaxListeners(0) (or a reasonable bound like 64) on the engine emitter so that expected concurrency does not trip Node's leak-detection warning.
  2. Do not force full-reconnect on a single transient NegotiationError when other negotiations are concurrently in flight. Under signaling bursts, one negotiation failing (e.g. because the PC state was mutated by a parallel negotiation) does not mean the peer connection is genuinely broken. Retry the failing negotiation locally first; only set fullReconnectOnNext = true if a retry confirms the PC is unrecoverable.

  3. Coalesce bursty onMediaSectionsRequirement into a single renegotiation. Debounce the handler (e.g. by microtask or a short timer) so that N consecutive media-section-requirement signals within a short window collapse into one negotiate() call. This is the highest-leverage fix — it eliminates the burst at the source rather than hardening downstream layers.

Any one of these reduces the impact significantly; together they remove the class of bug.

Alternatives considered

  • Pass a custom AudioContext via webAudioMix.audioContext in Room options to avoid livekit closing our context on disconnect. This only mitigates the AudioWorklet.addModule failure (a downstream symptom), not the core 5.4s publish stall or the listener accumulation.
  • Application-level retry of the mic toggle in client code. Does not help: publishOrRepublishTrack is already queued and will wait on reconnectFuture regardless of how many times the user clicks.
  • Raise setMaxListeners from app code. Not possible — the RTCEngine emitter is internal and not exposed on the Room API.
  • Increase signaling timeouts. Does not address the root cause (timeouts were not hit; the 5.4s stall is from the full-reconnect flow completing successfully, not from a timeout).

Importance

nice to have

Additional Information

Code references (livekit-client 2.18.6 bundled as livekit-client.esm.mjs):

  • Listener registration per negotiate: RTCEngine.negotiate()this.on(EngineEvent.Closing, handleClosed) and this.on(EngineEvent.Restarting, handleClosed) with matching off() in finally. Cleanup is correct, but under concurrency 11+ pending calls stack 22 listeners.
  • Full-reconnect escalation: fullReconnectOnNext = true inside the NegotiationError branch of negotiate().
  • Publish stall point: publishOrRepublishTrack — first statement awaits this.reconnectFuture?.promise.
  • Context close on disconnect: Room.disconnect() closes this.audioContext when typeof webAudioMix === 'boolean' (default), which is what breaks downstream AudioWorklet.addModule after forced restart.

Happy to share full Sentry breadcrumb dump or test against a candidate fix if it helps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions