Conversation
|
|
|
@dfed are you able to validate this hypothesis? Looks like |
|
Trying to isolate the changes between m137 vs m144: Crash 1 (ICE transport use-after-free): // m137 — allocator stays alive
network_thread()->BlockingCall([this] {
TeardownDataChannelTransport_n({});
transport_controller_.reset();
port_allocator_->DiscardCandidatePool(); // pool discarded, allocator lives
if (network_thread_safety_) {
network_thread_safety_->SetNotAlive();
}
});
// m144 — allocator destroyed
void PeerConnection::CloseOnNetworkThread() {
network_thread()->BlockingCall([this] {
TeardownDataChannelTransport_n(RTCError::OK());
port_allocator_->DiscardCandidatePool();
transport_controller_.reset();
port_allocator_.reset(); // NEW: allocator destroyed
network_thread_safety_->SetNotAlive();
});
}Crash 2 (AVAudioEngine deallocation assertion): // m137 DeleteChannel — no media_engine_ref_
context()->worker_thread()->BlockingCall([&]() {
auto channel_to_delete = std::move(channel_);
for (const auto& sender : senders_)
sender->internal()->SetMediaChannel(nullptr);
for (const auto& receiver : receivers_)
receiver->internal()->SetMediaChannel(nullptr);
channel_to_delete.reset();
});
// m144 DeleteChannel — releases media engine reference
context()->worker_thread()->BlockingCall([&]() {
RTC_DCHECK_RUN_ON(context()->worker_thread());
auto channel_to_delete = std::move(channel_);
for (const auto& sender : senders_)
sender->internal()->SetMediaChannel(nullptr);
for (const auto& receiver : receivers_)
receiver->internal()->SetMediaChannel(nullptr);
channel_to_delete.reset();
media_engine_ref_.reset(); // NEW: can trigger audio engine teardown
});Both are fixed in m146 by commit |
|
@pblazej not sure how I would validate this. Seems plausible to me. But I'm not manually calling any of these APIs. What specifically can I help validate? |
|
@dfed I mean - are you able to test/deploy the fix in production? |
|
got it @pblazej. the crash isn't frequent enough for me to reproduce locally, and shipping a release build from a PR branch isn't something we usually do. |
|
@dfed what's the repro rate from your metrics? Is it in the order of 0.1%-0.001%? |
|
@hiroshihorie I tried to re-create the decision chain behind removing tracks, I think that's just an unfortunate side effect of WebRTC internal changes (similarly to the previous little regressions). Wdyt about removing that tentatively? |
These two new issues combined affect 0.015% of sessions. They are our second and fourth highest crasher. |
removeTrack nulls sender tracks and changes transceiver directions, causing PeerConnection::Close() to skip ClearSend/DetachTrack in its StopTransceiverProcedure and hit edge cases in the worker-thread teardown (ICE use-after-free, AVAudioEngine deallocation assertion). Close() handles full cleanup on its own. The loop was originally commented out as "not required?" and was accidentally uncommented during a threading refactor (101e09d). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
82a786c to
2e0b7f4
Compare
SymbolicatedCrashes.zip
Calling
removeTrackfor all senders beforeclose()puts WebRTC into apartially-cleaned state that causes
Close()to follow a different teardownpath, crashing on the worker thread.
RemoveTrackOrError(pc/peer_connection.cc:975) nulls the sender track andchanges transceiver direction:
SetTrack(nullptr)triggersClearSend()(pc/rtp_sender.cc:494) becauseprev_can_send_trackwas true:Later,
Close()(pc/peer_connection.cc:1876) runsStopTransceiverProcedure→
sender->Stop()(pc/rtp_sender.cc:562):Without
removeTrack,Stop()runs the full path:DetachTrack→ClearSend→RemoveTrackFromStats. With it, all three are skipped. Thesender enters
DeleteChannelin a partially-cleaned state.DeleteChannel(pc/rtp_transceiver.cc:385) then destroys the channel on theworker thread:
The channel destructor encountering the modified state leads to crash 1 (freed
Port/socket resources accessed during ICE transport teardown) and crash 2
(inconsistent audio engine state → assertion in buffer deallocation during
[AVAudioEngine stop]).Both
removeTrackandclose()go throughPeerConnectionProxy(
pc/proxy.h), which marshals them synchronously to the signaling thread — nothreading race between them. The issue is the modified internal state, not
concurrency.