Skip to content

RUST-2394 Fix a race condition deadlock in Client::shutdown#1658

Merged
abr-egn merged 2 commits intomongodb:mainfrom
abr-egn:RUST-2394/shutdown-deadlock
Apr 15, 2026
Merged

RUST-2394 Fix a race condition deadlock in Client::shutdown#1658
abr-egn merged 2 commits intomongodb:mainfrom
abr-egn:RUST-2394/shutdown-deadlock

Conversation

@abr-egn
Copy link
Copy Markdown
Contributor

@abr-egn abr-egn commented Apr 8, 2026

RUST-2394

Credit for both the find and the fix to @jyemin 🙂

@abr-egn
Copy link
Copy Markdown
Contributor Author

abr-egn commented Apr 9, 2026

cargo-deny failure is unrelated and fixed in #1659.

@abr-egn abr-egn requested review from BorisDog and jyemin April 9, 2026 08:59
@abr-egn abr-egn marked this pull request as ready for review April 9, 2026 08:59
@abr-egn abr-egn requested a review from a team as a code owner April 9, 2026 08:59
@BorisDog BorisDog requested a review from Copilot April 9, 2026 19:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a shutdown-time deadlock in SDAM topology shutdown by ensuring monitors can’t block indefinitely waiting for topology update acknowledgements after the topology worker has exited its main loop.

Changes:

  • Drop the topology worker’s update_receiver during shutdown so monitor topology_updater.update() calls fail fast instead of waiting forever for an ack.
  • Add a regression test that widens the race window and asserts Client::shutdown() completes within a timeout across repeated iterations.
  • Introduce a test-only ClientOptions::TestOptions knob to inject a small shutdown delay to reliably reproduce the previous deadlock window.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
driver/src/sdam/topology.rs Drops update_receiver during shutdown to prevent monitor update/ack deadlock; adds optional test-only delay to widen race window.
driver/src/sdam/test.rs Adds regression test ensuring Client::shutdown() does not deadlock under streaming hello + handler overhead.
driver/src/client/options.rs Adds test-only option topology_worker_shutdown_delay used by the regression test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@BorisDog BorisDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, question about the new test.

Comment thread driver/src/sdam/test.rs
/// The fix is to drop the topology worker's update_receiver before closing monitors,
/// so that the monitor's update() call fails immediately.
#[tokio::test(flavor = "multi_thread")]
async fn shutdown_does_not_deadlock() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to reproduce this situation deterministically?
Something like:

  1. Wait for update to be called, and block it
  2. Call shutdown during a blocked update
  3. Unblock the update

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not simply or cleanly, I think. The best way I can think of to implement a version that doesn't rely on sleep timing would be to thread a testing Barrier into both the topology monitor and worker, but it would need to be conditionally enabled by the test right before the shutdown call to avoid blocking on pre-test hearbeats, and that sounds iffy to me.

In practice, the timing test here is effectively deterministic; it fails on the first iteration of the loop.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining that update can't be blocked by the event listener (driver user).

@abr-egn abr-egn requested a review from BorisDog April 13, 2026 18:21
Copy link
Copy Markdown
Contributor

@BorisDog BorisDog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread driver/src/sdam/test.rs
/// The fix is to drop the topology worker's update_receiver before closing monitors,
/// so that the monitor's update() call fails immediately.
#[tokio::test(flavor = "multi_thread")]
async fn shutdown_does_not_deadlock() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining that update can't be blocked by the event listener (driver user).

@abr-egn abr-egn merged commit 81c72e5 into mongodb:main Apr 15, 2026
22 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants