RUST-2394 Fix a race condition deadlock in Client::shutdown#1658
RUST-2394 Fix a race condition deadlock in Client::shutdown#1658abr-egn merged 2 commits intomongodb:mainfrom
Conversation
|
cargo-deny failure is unrelated and fixed in #1659. |
There was a problem hiding this comment.
Pull request overview
Fixes a shutdown-time deadlock in SDAM topology shutdown by ensuring monitors can’t block indefinitely waiting for topology update acknowledgements after the topology worker has exited its main loop.
Changes:
- Drop the topology worker’s
update_receiverduring shutdown so monitortopology_updater.update()calls fail fast instead of waiting forever for an ack. - Add a regression test that widens the race window and asserts
Client::shutdown()completes within a timeout across repeated iterations. - Introduce a test-only
ClientOptions::TestOptionsknob to inject a small shutdown delay to reliably reproduce the previous deadlock window.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
driver/src/sdam/topology.rs |
Drops update_receiver during shutdown to prevent monitor update/ack deadlock; adds optional test-only delay to widen race window. |
driver/src/sdam/test.rs |
Adds regression test ensuring Client::shutdown() does not deadlock under streaming hello + handler overhead. |
driver/src/client/options.rs |
Adds test-only option topology_worker_shutdown_delay used by the regression test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
BorisDog
left a comment
There was a problem hiding this comment.
Looks good, question about the new test.
| /// The fix is to drop the topology worker's update_receiver before closing monitors, | ||
| /// so that the monitor's update() call fails immediately. | ||
| #[tokio::test(flavor = "multi_thread")] | ||
| async fn shutdown_does_not_deadlock() { |
There was a problem hiding this comment.
Is there a way to reproduce this situation deterministically?
Something like:
- Wait for
updateto be called, and block it - Call
shutdownduring a blockedupdate - Unblock the
update
There was a problem hiding this comment.
Not simply or cleanly, I think. The best way I can think of to implement a version that doesn't rely on sleep timing would be to thread a testing Barrier into both the topology monitor and worker, but it would need to be conditionally enabled by the test right before the shutdown call to avoid blocking on pre-test hearbeats, and that sounds iffy to me.
In practice, the timing test here is effectively deterministic; it fails on the first iteration of the loop.
There was a problem hiding this comment.
Thanks for explaining that update can't be blocked by the event listener (driver user).
| /// The fix is to drop the topology worker's update_receiver before closing monitors, | ||
| /// so that the monitor's update() call fails immediately. | ||
| #[tokio::test(flavor = "multi_thread")] | ||
| async fn shutdown_does_not_deadlock() { |
There was a problem hiding this comment.
Thanks for explaining that update can't be blocked by the event listener (driver user).
RUST-2394
Credit for both the find and the fix to @jyemin 🙂