Skip to content

Propagate more n/w errors#987

Open
pblazej wants to merge 8 commits intomainfrom
blaze/completer-error-propagation
Open

Propagate more n/w errors#987
pblazej wants to merge 8 commits intomainfrom
blaze/completer-error-propagation

Conversation

@pblazej
Copy link
Copy Markdown
Contributor

@pblazej pblazej commented May 6, 2026

Small but important updates to AsyncCompleter(Map), allowing custom error types.

Should not collide with (upcoming) typed throws.

pblazej and others added 2 commits May 6, 2026 09:59
When the WebRTC peer connection went to .disconnected/.failed, the
transport delegate called primaryTransportConnectedCompleter.reset(),
which threw LiveKitError(.cancelled) to anyone awaiting the completer.
Consumers couldn't distinguish a network-induced disconnect from a
user-initiated cancel through room.disconnectError or
didDisconnectWithError.

- AsyncCompleter.reset(throwing:) accepts an optional Error and
  normalizes via LiveKitError.from. WaitEntry.cancel(throwing:) takes
  an optional LiveKitError, defaulting to .cancelled, so the
  withTaskCancellationHandler.onCancel path still produces .cancelled
  for real Task cancellation.
- Room+TransportDelegate now resets the transport completers with a
  typed LiveKitError(.network, message:) on PC disconnect/failed.
- Room.cleanUp / SignalClient.cleanUp thread the disconnect cause
  through to in-flight completer waiters instead of masking it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the previous .network-typing fix to the completers that
the initial change missed:

- CompleterMapActor.reset(throwing:) mirrors AsyncCompleter; SignalClient
  now passes disconnectError to _addTrackCompleters so a sendAddTrack
  waiter caught mid-disconnect sees the underlying cause.
- DataChannelPair.reset(throwing:) + Room.cleanUpRTC(withError:) so the
  publisherDataChannel.openCompleter waiter inside Room.send(dataPacket:)
  sees .network instead of .cancelled when the room is being torn down.

Adds CompleterMapActorTests covering fan-out + default-to-cancelled
behavior, and promotes waitForRegistration to file scope so both suites
share one helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pblazej pblazej requested a review from hiroshihorie May 6, 2026 08:18
@pblazej pblazej changed the title Propagate more tranport errors Propagate more transport errors May 6, 2026
pblazej and others added 3 commits May 6, 2026 10:35
Closes the only completer that previously had no error-propagation path
on disconnect. Both halves of this change are needed; either alone
leaves the leak open.

- Room.cleanUp(withError:) now resets activeParticipantCompleters with
  the typed disconnectError, so any RemoteParticipant.waitUntilActive
  in flight is unblocked immediately with .network (or whatever cause)
  instead of waiting up to its full 10s timeout.

- CompleterMapActor.resume(throwing:, for:) is now a no-op when no
  completer exists for the key. cleanUpParticipants mutates each
  participant's state to .unknown, which fires a fire-and-forget
  Task { resume(throwing: .participantRemoved, ...) } from the
  state observer in Participant.swift. Without this guard, those
  Tasks land after the reset and re-populate the just-cleared map
  with stale .failure(.participantRemoved) entries, which then
  greet the next session's same-identity waitUntilActive.

  resume(returning:, for:) keeps its auto-create behavior: success
  is meaningfully sticky ("the participant is active, anyone asking
  later should know"), failure is not.

Tests refactored to a do/catch + Issue.record helper because
#expect(throws:) returns Void (not E?) on the swift-testing bundled
with Xcode 16.2. New CompleterMapActorTests cover the asymmetry
(no-op on missing key for throws, remember-success for returning,
existing-waiter still receives throws).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The compiler's "never mutated" analysis flags `weak var weakRP: ... = remoteParticipant`
because the runtime nil-out isn't user-code mutation — but its
suggested fix (`let`) won't compile, since `weak let` is illegal.

Move `weak` into the closure capture list instead. The closure now
holds the only weak reference and reads it when the leak check fires;
same semantics, no warning, fewer lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stopwatch is a deprecated typealias for Span and split(label:) was
renamed to record(_:at:). Update PublishDeviceOptimization tests
accordingly and rename the local sw → span to match the type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pblazej pblazej changed the title Propagate more transport errors Propagate more n/w errors May 6, 2026
Span is a class, so the local doesn't need to be var — the var
triggers Swift's "never mutated" warning on stricter CI builds
(Xcode 26.4). Switch to let; record(...) calls work either way.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hiroshihorie
Copy link
Copy Markdown
Member

One small concern: this now turns every .disconnected / .failed peer-connection callback into LiveKitError(.network) before checking the room state.

Could this also happen during a normal room.disconnect()? In that path we move to .disconnecting, then cleanUpRTC() closes the transports, and WebRTC may emit a disconnect/failed state as part of that teardown. If a caller is concurrently waiting on the transport completers, they could now see .network even though the disconnect was user-initiated.

Maybe we should gate this to active connection/reconnect states, or skip the typed network reset when the room is already .disconnecting / .disconnected, so the new distinction between network failures and user cancellation stays reliable.

@pblazej
Copy link
Copy Markdown
Contributor Author

pblazej commented May 7, 2026

@hiroshihorie yes, I'm afraid of wrapping here; will check your case maybe adding more tests around it.

BTW I started working on typed throws #988 to see how much damage it does.

pblazej and others added 2 commits May 7, 2026 11:15
Per @hiroshihorie's review on PR #987: a peer-connection
.disconnected/.failed callback that fires while the room is already
.disconnecting/.disconnected (i.e. as part of Transport.close() during
cleanUpRTC, before _pc.delegate = nil takes effect) would otherwise
reset the transport completers with LiveKitError(.network). A caller
that registered a wait between cleanUp's reset and the late delegate
firing would then see .network — even though the disconnect was
user-initiated.

Gate the typed-network construction on connection state: when the
room is in teardown, fall back to default .cancelled so the
distinction between network-induced disconnects and user cancellation
stays reliable.

The race is timing-dependent and not deterministic to reproduce as a
unit test (would require either a real livekit-server connection or a
Transport mock that doesn't currently exist in the test
infrastructure). Existing CompleterTests cover the underlying
mechanism on both branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleans up the gate added in aa99573 by reading the state once via a
small computed property on the enum.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pblazej pblazej requested a review from xianshijing-lk as a code owner May 7, 2026 09:33
@pblazej
Copy link
Copy Markdown
Contributor Author

pblazej commented May 7, 2026

@hiroshihorie fixed in 56d4a8b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants