Commit 72ae9b8
committed
fix(tests,ci): stabilize shutdown-race flakes across distros
Integration-test CI has been flaky on Humble and Jazzy with a rotating
victim (demo_calibration_service, demo_brake_actuator,
demo_brake_pressure_sensor, demo_engine_temp_sensor, demo_lidar_sensor,
fault_manager_node) dying during shutdown and failing test_exit_codes.
Prior commits in this area (3adf71b, 1dc0a7e, 2f23c09 and others)
added defensive cleanup to individual demo nodes; the crashes kept
moving which was a sign the root cause was not in the nodes. This
change addresses the six distinct bugs that were actually responsible.
1. SIGINT-during-rclcpp-init race in all demo nodes
-----------------------------------------------------
If SIGINT fires between rclcpp::init() and the executor's guard
condition being allocated, rclcpp's default signal handler
invalidates the shared context mid-call; in-flight rcl_service_init /
rcl_guard_condition_init throws RCLError, the exception escapes the
bare demo main() and std::terminate aborts the process with SIGABRT.
Fix: block SIGINT/SIGTERM with pthread_sigmask, let rclcpp::init()
install its handler, construct the node, add it to an explicit
SingleThreadedExecutor (so the guard condition is allocated while
signals are still blocked), then restore the mask. A signal queued in
that window is delivered to a fully-assembled executor and shuts the
node down normally. Applied uniformly to all 10 demo nodes. The
pre-existing std::set_terminate in long_calibration_action.cpp
documents a separate rclcpp_action mid-goal-shutdown race and is
kept.
Local repro on Humble: 3 crashes / 180 iters unpatched vs 0 / 180
patched, in the 50-200 ms post-launch SIGINT window.
2. FastRTPS discovery-teardown use-after-free
-----------------------------------------------
Humble ships FastRTPS 2.6 which has a use-after-free in its
endpoint-discovery-protocol proxy tables on peer disappearance.
When the gateway shuts down cleanly, its "writer leaving" messages
are processed by peer nodes' FastRTPS listener thread which segfaults
in EDP::unpairWriterProxy / PDP::removeWriterProxyData. Backtrace
captured under stress via LD_PRELOAD=libSegFault.so confirms the
crash is in libfastrtps.so, not in our code.
Fix: set RMW_IMPLEMENTATION=rmw_cyclonedds_cpp in the CI test run
environment (both build-and-test matrix covering humble/rolling and
the jazzy-test job). CycloneDDS has a more robust discovery teardown
path.
Local repro: 1 crash / 6 iters with FastRTPS, 0 / 30 with CycloneDDS.
Also install ros-${distro}-rmw-cyclonedds-cpp in the deps step.
3. Unhandled context-invalid throws in fault_manager
------------------------------------------------------
RosbagCapture::get_topic_type, RosbagCapture::resolve_topics, and
SnapshotCapture::get_topic_type invoke node_->get_topic_names_and_types
which throws std::runtime_error when the rcl context is invalidated
mid-call. In RosbagCapture the call is reachable from a discovery
retry timer that fires every 500 ms, and the race window widens under
sanitizer load to the point where the TSan CI job hits it reliably
(SIGABRT / exit -6).
The gateway's cache refresh already catches and logs this
gracefully. Apply the same approach at the three fault_manager call
sites: wrap the rcl call in try/catch and treat invalid-context as
"no topic info available right now" - callers already handle the
empty return (skip subscribe, skip snapshot, retry next tick).
4. TSan-flagged race on RosbagCapture::post_fault_timer_
----------------------------------------------------------
TSan flags a data race on the post_fault_timer_ member
(rclcpp::TimerBase::SharedPtr):
WRITE on_fault_confirmed() at rosbag_capture.cpp:208
(service callback thread)
READ post_fault_timer_callback() at rosbag_capture.cpp:656/658
(main executor thread)
Both paths touch the SharedPtr without synchronisation. Even outside
TSan this is a real use-after-free waiting for the right scheduling.
Fix: introduce std::mutex post_fault_timer_mutex_ and guard every
access (on_fault_confirmed, post_fault_timer_callback, stop). Scope
kept minimal so the executor is not blocked while the service callback
creates the timer.
5. REST server start/stop race in GatewayNode
-----------------------------------------------
GatewayNode::start_rest_server() signalled "server running" on the
worker thread before calling rest_server_->start() (which is where
cpp-httplib's listen() actually binds and enters the accept loop).
For short-lived test fixtures that construct and immediately destroy
a GatewayNode (e.g. the per-test SetUp/TearDown in
test_plugin_notify_integration), the main thread can race ahead and
call stop_rest_server() before the worker has reached listen(). In
that order the stop() request is dropped, listen() subsequently
blocks forever, and the destructor's condition-variable wait hangs
until the 60s test timeout fires.
Fix: drop the server_running_/server_cv_/server_mutex_ machinery and
poll rest_server_->is_running() (which reflects cpp-httplib's actual
accept-loop state). start_rest_server now returns only once the
server is observably ready; stop_rest_server stops and joins directly.
Added RESTServer::is_running() as a passthrough to HttpServerManager.
6. test_cross_ecu_fanout poll accepts pre-discovery responses
---------------------------------------------------------------
test_data_include_peer_topics used `lambda d: d if d.get('items')`
as its readiness predicate. The gateway's /data response contains a
/rosout item as soon as LogManager subscribes, well before demo-node
discovery completes; the poll therefore returned immediately with only
{topic: '/rosout'} and the subsequent assertTrue(has_local) failed.
Tighten the predicate to require both a /powertrain/engine/ item and
a /chassis/brakes/ or /perception/lidar/ item before returning.
Verification
------------
Local stress (Humble container under 16-CPU stress-ng pressure,
LD_PRELOAD=libSegFault.so, RMW_IMPLEMENTATION=rmw_cyclonedds_cpp):
- 200 iters of test_entity_listing: 0 fails
- 30 iters of test_triggers_updates: 0 fails
- fault_manager unit + integration tests: pass
- test_plugin_notify_integration (Jazzy): 5/5 subtests pass
- test_cross_ecu_fanout (Jazzy): pass1 parent ccbddd7 commit 72ae9b8
19 files changed
Lines changed: 332 additions & 79 deletions
File tree
- .github/workflows
- src
- ros2_medkit_fault_manager
- include/ros2_medkit_fault_manager
- src
- ros2_medkit_gateway
- include/ros2_medkit_gateway
- http
- src
- ros2_medkit_integration_tests
- demo_nodes
- test/features
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
57 | 60 | | |
58 | 61 | | |
59 | 62 | | |
| |||
73 | 76 | | |
74 | 77 | | |
75 | 78 | | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
76 | 88 | | |
77 | 89 | | |
78 | 90 | | |
| |||
185 | 197 | | |
186 | 198 | | |
187 | 199 | | |
188 | | - | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
189 | 203 | | |
190 | 204 | | |
191 | 205 | | |
| |||
199 | 213 | | |
200 | 214 | | |
201 | 215 | | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
202 | 221 | | |
203 | 222 | | |
204 | 223 | | |
| |||
Lines changed: 4 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
165 | 165 | | |
166 | 166 | | |
167 | 167 | | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
168 | 172 | | |
169 | 173 | | |
170 | 174 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
135 | 135 | | |
136 | 136 | | |
137 | 137 | | |
138 | | - | |
139 | | - | |
140 | | - | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
141 | 144 | | |
142 | 145 | | |
143 | 146 | | |
| |||
202 | 205 | | |
203 | 206 | | |
204 | 207 | | |
205 | | - | |
206 | | - | |
207 | | - | |
208 | | - | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
209 | 215 | | |
210 | 216 | | |
211 | 217 | | |
| |||
436 | 442 | | |
437 | 443 | | |
438 | 444 | | |
439 | | - | |
440 | | - | |
441 | | - | |
442 | | - | |
443 | | - | |
444 | | - | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
445 | 455 | | |
446 | | - | |
| 456 | + | |
| 457 | + | |
447 | 458 | | |
448 | 459 | | |
449 | 460 | | |
| |||
476 | 487 | | |
477 | 488 | | |
478 | 489 | | |
479 | | - | |
480 | | - | |
481 | | - | |
482 | | - | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
| 497 | + | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
483 | 502 | | |
484 | 503 | | |
485 | 504 | | |
| |||
640 | 659 | | |
641 | 660 | | |
642 | 661 | | |
643 | | - | |
644 | | - | |
645 | | - | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
646 | 668 | | |
647 | 669 | | |
648 | 670 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
348 | 348 | | |
349 | 349 | | |
350 | 350 | | |
351 | | - | |
352 | | - | |
353 | | - | |
354 | | - | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
355 | 363 | | |
356 | 364 | | |
357 | 365 | | |
| |||
Lines changed: 0 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
18 | | - | |
19 | 17 | | |
20 | | - | |
21 | 18 | | |
22 | 19 | | |
23 | 20 | | |
| |||
296 | 293 | | |
297 | 294 | | |
298 | 295 | | |
299 | | - | |
300 | | - | |
301 | | - | |
302 | 296 | | |
303 | 297 | | |
304 | 298 | | |
| |||
Lines changed: 7 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
81 | 81 | | |
82 | 82 | | |
83 | 83 | | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
84 | 91 | | |
85 | 92 | | |
86 | 93 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1827 | 1827 | | |
1828 | 1828 | | |
1829 | 1829 | | |
1830 | | - | |
1831 | | - | |
1832 | | - | |
1833 | | - | |
1834 | | - | |
1835 | | - | |
1836 | 1830 | | |
1837 | 1831 | | |
1838 | 1832 | | |
1839 | 1833 | | |
1840 | 1834 | | |
1841 | 1835 | | |
1842 | 1836 | | |
1843 | | - | |
1844 | | - | |
1845 | | - | |
1846 | | - | |
1847 | | - | |
1848 | | - | |
1849 | 1837 | | |
1850 | 1838 | | |
1851 | | - | |
1852 | | - | |
1853 | | - | |
1854 | | - | |
1855 | | - | |
| 1839 | + | |
| 1840 | + | |
| 1841 | + | |
| 1842 | + | |
| 1843 | + | |
| 1844 | + | |
| 1845 | + | |
| 1846 | + | |
| 1847 | + | |
| 1848 | + | |
| 1849 | + | |
| 1850 | + | |
| 1851 | + | |
| 1852 | + | |
1856 | 1853 | | |
1857 | 1854 | | |
1858 | 1855 | | |
1859 | 1856 | | |
1860 | 1857 | | |
1861 | 1858 | | |
1862 | | - | |
1863 | | - | |
1864 | 1859 | | |
1865 | | - | |
1866 | | - | |
1867 | | - | |
1868 | | - | |
1869 | | - | |
1870 | 1860 | | |
1871 | 1861 | | |
1872 | 1862 | | |
| |||
Lines changed: 18 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| |||
84 | 85 | | |
85 | 86 | | |
86 | 87 | | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
87 | 98 | | |
88 | 99 | | |
89 | | - | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
90 | 107 | | |
91 | 108 | | |
92 | 109 | | |
| |||
Lines changed: 18 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
| |||
103 | 104 | | |
104 | 105 | | |
105 | 106 | | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
106 | 117 | | |
107 | 118 | | |
108 | | - | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
109 | 126 | | |
110 | 127 | | |
111 | 128 | | |
| |||
Lines changed: 18 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| |||
60 | 61 | | |
61 | 62 | | |
62 | 63 | | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
63 | 74 | | |
64 | 75 | | |
65 | | - | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
66 | 83 | | |
67 | 84 | | |
68 | 85 | | |
| |||
0 commit comments