Skip to content

Commit 02da941

Browse files
committed
fix(gateway): tighten ~GatewayNode graph-event teardown for TSan
TSan flagged a Write/Read race on the rclcpp::Event control block between the per-context graph listener thread (calling NodeGraph::notify_graph_change) and our automatic member destruction in ~GatewayNode. Reproduced today in test_gateway_node where many GatewayNode instances are created and destroyed in sequence while the per-process GraphListener keeps running. Two-pronged fix: - Explicitly cancel and reset graph_check_timer_, backstop_timer_ and graph_event_ at the end of the ~GatewayNode body, before any other member runs its destructor. This drains the timer (its [this] lambda reads graph_event_) and drops our shared_ptr while we still control the ordering, instead of relying on declaration-order destruction. - Add a TSan suppression for the residual library-side window inside rclcpp::node_interfaces::NodeGraph::notify_graph_change, matching the existing approach for other rclcpp-internal races (signal handler, logging, Context shutdown). We do not own that code path; the fix above closes our half of the race.
1 parent e845583 commit 02da941

2 files changed

Lines changed: 30 additions & 1 deletion

File tree

src/ros2_medkit_gateway/src/gateway_node.cpp

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1537,7 +1537,26 @@ GatewayNode::~GatewayNode() {
15371537
if (config_mgr_) {
15381538
config_mgr_->shutdown();
15391539
}
1540-
// 9. Normal member destruction (managers safe - all transports stopped)
1540+
// 9. Cancel the graph timers and drop our graph_event_ reference before
1541+
// any other member destruction begins. rclcpp's graph listener thread
1542+
// holds shared_ptrs to all registered graph events (graph_events_ in
1543+
// NodeGraph) and may concurrently iterate them via notify_graph_change()
1544+
// while we're tearing down. TSan flagged a Write/Read race on the
1545+
// Event's control block between that thread and our shared_ptr release
1546+
// during automatic member destruction (see TestGatewayNode TSan run).
1547+
// Doing the reset here gives the timer a chance to drain before its
1548+
// [this]-captured graph_event_ disappears, and shrinks the window where
1549+
// the graph listener can still see a live shared_ptr from us.
1550+
if (graph_check_timer_) {
1551+
graph_check_timer_->cancel();
1552+
graph_check_timer_.reset();
1553+
}
1554+
if (backstop_timer_) {
1555+
backstop_timer_->cancel();
1556+
backstop_timer_.reset();
1557+
}
1558+
graph_event_.reset();
1559+
// 10. Normal member destruction (managers safe - all transports stopped)
15411560
}
15421561

15431562
const ThreadSafeEntityCache & GatewayNode::get_thread_safe_cache() const {

tsan_suppressions.txt

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,16 @@ called_from_lib:libcpp-httplib.so
4141
# test teardown when executor threads race with node destruction.
4242
mutex:__gthread_mutex_unlock
4343

44+
# rclcpp: NodeGraph::notify_graph_change() iterates the per-context vector
45+
# of graph events (held as Event::WeakPtr) from the graph listener thread
46+
# while a node is being destroyed in another thread, racing on the Event's
47+
# shared_ptr control block. Triggered by TestGatewayNode where many nodes
48+
# are created/destroyed in sequence while the per-process GraphListener
49+
# keeps running. We already reset graph_event_ before automatic member
50+
# destruction (see ~GatewayNode); this suppression covers the remaining
51+
# library-side window we cannot tighten without changing rclcpp.
52+
race:rclcpp::node_interfaces::NodeGraph::notify_graph_change
53+
4454

4555
# rclcpp: lock-order-inversion in Context::add_on_shutdown_callback
4656
# during test TearDownTestSuite when rclcpp::shutdown() races with

0 commit comments

Comments
 (0)