Skip to content

Fix m144 teardown crashes#967

Open
pblazej wants to merge 1 commit intomainfrom
blaze/fix-remove-track-crashes
Open

Fix m144 teardown crashes#967
pblazej wants to merge 1 commit intomainfrom
blaze/fix-remove-track-crashes

Conversation

@pblazej
Copy link
Copy Markdown
Contributor

@pblazej pblazej commented Apr 14, 2026

SymbolicatedCrashes.zip

Calling removeTrack for all senders before close() puts WebRTC into a
partially-cleaned state that causes Close() to follow a different teardown
path, crashing on the worker thread.

RemoveTrackOrError (pc/peer_connection.cc:975) nulls the sender track and
changes transceiver direction:

sender->SetTrack(nullptr);                      // nulls track
transceiver->internal()->set_direction(...);     // kSendRecv → kRecvOnly
sdp_handler_->UpdateNegotiationNeeded();

SetTrack(nullptr) triggers ClearSend() (pc/rtp_sender.cc:494) because
prev_can_send_track was true:

track_ = nullptr;
if (can_send_track()) {         // false — track_ is null
    SetSend();
} else if (prev_can_send_track) {  // true
    ClearSend();                // disables audio sending on worker thread
}
void AudioRtpSender::ClearSend() {
    bool success = worker_thread_->BlockingCall([&] {
        return voice_media_channel()->SetAudioSend(ssrc_, false, &options, nullptr);
    });
}

Later, Close() (pc/peer_connection.cc:1876) runs StopTransceiverProcedure
sender->Stop() (pc/rtp_sender.cc:562):

void RtpSenderBase::Stop() {
    if (stopped_) return;
    if (track_) {               // NULL — removeTrack already set it
        DetachTrack();          // SKIPPED
        track_->UnregisterObserver(this);  // SKIPPED
    }
    if (can_send_track()) {     // FALSE — track_ is null
        ClearSend();            // SKIPPED
        RemoveTrackFromStats(); // SKIPPED
    }
    media_channel_ = nullptr;
    stopped_ = true;
}

Without removeTrack, Stop() runs the full path: DetachTrack
ClearSendRemoveTrackFromStats. With it, all three are skipped. The
sender enters DeleteChannel in a partially-cleaned state.

DeleteChannel (pc/rtp_transceiver.cc:385) then destroys the channel on the
worker thread:

void RtpTransceiver::DeleteChannel() {
    context()->worker_thread()->BlockingCall([&]() {
        auto channel_to_delete = std::move(channel_);
        for (auto& sender : senders_)
            sender->internal()->SetMediaChannel(nullptr);
        for (auto& receiver : receivers_)
            receiver->internal()->SetMediaChannel(nullptr);
        channel_to_delete.reset();  // channel destroyed → [AVAudioEngine stop]
        media_engine_ref_.reset();
    });
}

The channel destructor encountering the modified state leads to crash 1 (freed
Port/socket resources accessed during ICE transport teardown) and crash 2
(inconsistent audio engine state → assertion in buffer deallocation during
[AVAudioEngine stop]).

Both removeTrack and close() go through PeerConnectionProxy
(pc/proxy.h), which marshals them synchronously to the signaling thread — no
threading race between them. The issue is the modified internal state, not
concurrency.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 14, 2026

⚠️ This PR does not contain any files in the .changes directory.

@pblazej
Copy link
Copy Markdown
Contributor Author

pblazej commented Apr 14, 2026

@dfed are you able to validate this hypothesis?

Looks like removeTrack() was just tentative.

@pblazej pblazej marked this pull request as draft April 14, 2026 08:50
@pblazej
Copy link
Copy Markdown
Contributor Author

pblazej commented Apr 14, 2026

Trying to isolate the changes between m137 vs m144:

Crash 1 (ICE transport use-after-free): CloseOnNetworkThread() in m144 added
port_allocator_.reset() — the port allocator is now destroyed during
Close(). In m137, the allocator survived Close() and was only freed when
the PeerConnection was deallocated. If the worker thread has any in-flight
messages referencing Port objects (owned by the allocator), they now access
freed memory.

// m137 — allocator stays alive
network_thread()->BlockingCall([this] {
    TeardownDataChannelTransport_n({});
    transport_controller_.reset();
    port_allocator_->DiscardCandidatePool();  // pool discarded, allocator lives
    if (network_thread_safety_) {
        network_thread_safety_->SetNotAlive();
    }
});

// m144 — allocator destroyed
void PeerConnection::CloseOnNetworkThread() {
    network_thread()->BlockingCall([this] {
        TeardownDataChannelTransport_n(RTCError::OK());
        port_allocator_->DiscardCandidatePool();
        transport_controller_.reset();
        port_allocator_.reset();              // NEW: allocator destroyed
        network_thread_safety_->SetNotAlive();
    });
}

Crash 2 (AVAudioEngine deallocation assertion): DeleteChannel() in m144
added media_engine_ref_.reset() after destroying the channel. This
MediaEngineReference is new in m144 (commit 3d7137f54f, "Introduce
MediaEngineReference for managing access on the worker thread"). Releasing it
on the worker thread during channel destruction can drop the last reference to
the media engine, triggering audio engine cleanup that cascades into
[AVAudioEngine stop] → buffer deallocation assertion. In m137 this reference
didn't exist — media engine lifetime was managed at a higher level.

// m137 DeleteChannel — no media_engine_ref_
context()->worker_thread()->BlockingCall([&]() {
    auto channel_to_delete = std::move(channel_);
    for (const auto& sender : senders_)
        sender->internal()->SetMediaChannel(nullptr);
    for (const auto& receiver : receivers_)
        receiver->internal()->SetMediaChannel(nullptr);
    channel_to_delete.reset();
});

// m144 DeleteChannel — releases media engine reference
context()->worker_thread()->BlockingCall([&]() {
    RTC_DCHECK_RUN_ON(context()->worker_thread());
    auto channel_to_delete = std::move(channel_);
    for (const auto& sender : senders_)
        sender->internal()->SetMediaChannel(nullptr);
    for (const auto& receiver : receivers_)
        receiver->internal()->SetMediaChannel(nullptr);
    channel_to_delete.reset();
    media_engine_ref_.reset();  // NEW: can trigger audio engine teardown
});

Both are fixed in m146 by commit 660fb3fc4d ("Clear media channel pointer
for RtpSender in Stop()"), which moves ClearSend and SetMediaChannel(nullptr)
into a single worker_thread_->BlockingCall in RtpSenderBase::Stop(),
ensuring proper synchronization between sender teardown and channel destruction.

@pblazej pblazej changed the title Fix teardown crashes Fix m144 teardown crashes Apr 14, 2026
@dfed
Copy link
Copy Markdown
Contributor

dfed commented Apr 14, 2026

@pblazej not sure how I would validate this. Seems plausible to me. But I'm not manually calling any of these APIs. What specifically can I help validate?

@pblazej
Copy link
Copy Markdown
Contributor Author

pblazej commented Apr 15, 2026

@dfed I mean - are you able to test/deploy the fix in production?

@dfed
Copy link
Copy Markdown
Contributor

dfed commented Apr 16, 2026

got it @pblazej. the crash isn't frequent enough for me to reproduce locally, and shipping a release build from a PR branch isn't something we usually do.

@pblazej
Copy link
Copy Markdown
Contributor Author

pblazej commented Apr 16, 2026

@dfed what's the repro rate from your metrics? Is it in the order of 0.1%-0.001%?

@pblazej
Copy link
Copy Markdown
Contributor Author

pblazej commented Apr 16, 2026

@hiroshihorie I tried to re-create the decision chain behind removing tracks, I think that's just an unfortunate side effect of WebRTC internal changes (similarly to the previous little regressions).

Wdyt about removing that tentatively?

@dfed
Copy link
Copy Markdown
Contributor

dfed commented Apr 16, 2026

what's the repro rate from your metrics? Is it in the order of 0.1%-0.001%?

These two new issues combined affect 0.015% of sessions. They are our second and fourth highest crasher.

@pblazej pblazej marked this pull request as ready for review April 21, 2026 09:40
@pblazej pblazej requested a review from hiroshihorie April 21, 2026 09:40
removeTrack nulls sender tracks and changes transceiver directions,
causing PeerConnection::Close() to skip ClearSend/DetachTrack in its
StopTransceiverProcedure and hit edge cases in the worker-thread
teardown (ICE use-after-free, AVAudioEngine deallocation assertion).
Close() handles full cleanup on its own. The loop was originally
commented out as "not required?" and was accidentally uncommented
during a threading refactor (101e09d).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@pblazej pblazej force-pushed the blaze/fix-remove-track-crashes branch from 82a786c to 2e0b7f4 Compare April 21, 2026 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants