Skip to content

Avoid blocking peer connector termination in routing paths#2586

Closed
BOURBONCASK wants to merge 1 commit into
eclipse-zenoh:mainfrom
BOURBONCASK:fix/avoid-routing-block-in-peer-connector-termination
Closed

Avoid blocking peer connector termination in routing paths#2586
BOURBONCASK wants to merge 1 commit into
eclipse-zenoh:mainfrom
BOURBONCASK:fix/avoid-routing-block-in-peer-connector-termination

Conversation

@BOURBONCASK
Copy link
Copy Markdown

@BOURBONCASK BOURBONCASK commented Apr 24, 2026

Fixes #2581.

This PR avoids synchronously blocking routing/control paths while marking peer connector start conditions as terminated.

In the reproducer from #2581, a peer-to-peer churn workload can eventually stall publisher put() calls. The captured backtraces seem to show routing/control paths blocked around peer connector termination while other threads are waiting on routing table/control locks. In particular, route_declare_final() and Gossip::link_states() could call:

ZRuntime::Net.block_in_place(...terminate_peer_connector_zid(...))

from paths that may still be handling routing state updates.

terminate_peer_connector_zid() only updates Runtime::start_conditions() and notifies startup waiters. From what I can tell, it does not need to complete synchronously while the routing path is still active. This PR changes those calls to spawn the termination update on the runtime instead of blocking in place.

This keeps the routing path from waiting on the async StartConditions mutex while routing/control locks may be held.

Validation:

  • cargo check -p zenoh
  • Reproducer from Persistent routing lock stall/deadlock under peer churn #2581:
    • Before this fix: publisher put() calls could stall, with slowest_put reaching the 5s stall threshold after roughly one hour. This matches the issue report, where a longer run eventually entered a sustained stall after about 4 hours.
    • With this fix: the same stress workload ran for about 10 hours without stalling. The observed slowest_put stayed around 400ms, well below the 5s stall threshold.

Discussion:

This is intended as a small and focused fix. It preserves the existing behavior of terminating the peer connector start condition, but performs that update asynchronously instead of blocking the routing path.

A possibly cleaner but more invasive alternative would be to restructure the routing/gossip paths so they collect side effects such as "terminate this peer connector zid" while holding routing state, then execute those side effects only after the routing table/control locks have been released. That would preserve synchronous completion without blocking inside the locked section, but it would require broader changes across the routing trait/call structure.

I'm opening this minimal version first because it directly addresses the observed lock-chain risk with a small diff. I'd be very happy to adjust the approach if you guys prefer the more explicit "defer side effects until after locks are dropped" design.


🏷️ Label-Based Checklist

No specific label requirements detected.

Current labels: No labels

Add one of these labels to this PR to see relevant checklist items: api-sync, breaking-change, bug, ci, dependencies, documentation, enhancement, new feature, internal

This section updates automatically when labels change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persistent routing lock stall/deadlock under peer churn

1 participant