Avoid blocking peer connector termination in routing paths#2586
Closed
BOURBONCASK wants to merge 1 commit into
Closed
Avoid blocking peer connector termination in routing paths#2586BOURBONCASK wants to merge 1 commit into
BOURBONCASK wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2581.
This PR avoids synchronously blocking routing/control paths while marking peer connector start conditions as terminated.
In the reproducer from #2581, a peer-to-peer churn workload can eventually stall publisher
put()calls. The captured backtraces seem to show routing/control paths blocked around peer connector termination while other threads are waiting on routing table/control locks. In particular,route_declare_final()andGossip::link_states()could call:from paths that may still be handling routing state updates.
terminate_peer_connector_zid()only updatesRuntime::start_conditions()and notifies startup waiters. From what I can tell, it does not need to complete synchronously while the routing path is still active. This PR changes those calls to spawn the termination update on the runtime instead of blocking in place.This keeps the routing path from waiting on the async
StartConditionsmutex while routing/control locks may be held.Validation:
cargo check -p zenohput()calls could stall, withslowest_putreaching the 5s stall threshold after roughly one hour. This matches the issue report, where a longer run eventually entered a sustained stall after about 4 hours.slowest_putstayed around 400ms, well below the 5s stall threshold.Discussion:
This is intended as a small and focused fix. It preserves the existing behavior of terminating the peer connector start condition, but performs that update asynchronously instead of blocking the routing path.
A possibly cleaner but more invasive alternative would be to restructure the routing/gossip paths so they collect side effects such as "terminate this peer connector zid" while holding routing state, then execute those side effects only after the routing table/control locks have been released. That would preserve synchronous completion without blocking inside the locked section, but it would require broader changes across the routing trait/call structure.
I'm opening this minimal version first because it directly addresses the observed lock-chain risk with a small diff. I'd be very happy to adjust the approach if you guys prefer the more explicit "defer side effects until after locks are dropped" design.
🏷️ Label-Based Checklist
No specific label requirements detected.
Current labels: No labels
Add one of these labels to this PR to see relevant checklist items:
api-sync,breaking-change,bug,ci,dependencies,documentation,enhancement,new feature,internalThis section updates automatically when labels change.