perf(mavlink): replace forward_message mutex with atomic reader counter#27076
perf(mavlink): replace forward_message mutex with atomic reader counter#27076dakejahl wants to merge 1 commit into
Conversation
Use a lightweight atomic reader counter instead of the global mavlink_module_mutex in forward_message(). Deletion paths null the array entry then spin-wait for the counter to drain before freeing. Fixes a use-after-free race in stop_command() from #26927 where an instance could be freed while another receiver thread held a pointer. Signed-off-by: Jacob Dahl <dahl.jakejacob@gmail.com>
There was a problem hiding this comment.
Pull request overview
Reworks MAVLink instance forwarding to remove mavlink_module_mutex from the forward_message() hot path by introducing an atomic “in-flight forwarders” counter, and updates instance teardown paths to wait for in-flight forwards before freeing instances (aiming to fix the stop/delete use-after-free race from #26927).
Changes:
- Add global atomic
mavlink_forward_msg_activecounter for tracking activeforward_message()iterations. - Remove global mutex from
Mavlink::forward_message()and use the counter instead. - Update
start_helper()andstop_command()teardown to null out the instance slot undermavlink_module_mutex, then wait for the counter to drain beforedelete.
Comments suppressed due to low confidence (1)
src/modules/mavlink/mavlink_main.cpp:524
mavlink_module_instancesis a plainMavlink*array, butforward_message()now iterates over it withoutmavlink_module_mutexwhile other threads can concurrently writenullptrinto the array under the mutex (egstop_command()/start_helper()). This is a C++ data race/UB regardless of pointer-size atomicity on the CPU. To make this safe, make the array entries atomic (egpx4::atomic<Mavlink*>with.load()/.store()), or otherwise ensure readers and writers synchronize via the same mechanism.
mavlink_forward_msg_active.fetch_add(1);
for (Mavlink *inst : mavlink_module_instances) {
if (inst && (inst != self) && (inst->get_forwarding_on())) {
// Pass message only if target component was seen before
| // Wait for any in-flight forward_message() to finish before freeing | ||
| while (mavlink_forward_msg_active.load() > 0) { | ||
| px4_usleep(100); | ||
| } |
There was a problem hiding this comment.
The global reader counter is waited on in deletion paths, but nothing prevents new forward_message() calls from other still-running MAVLink instances from incrementing mavlink_forward_msg_active while this loop is waiting. With continuous traffic this can starve and potentially block thread cleanup indefinitely. Consider adding a writer gate (eg a global "forwarding blocked" flag checked before incrementing the counter), or switch to a per-instance refcount so you only wait for readers that could still hold a pointer to the instance being deleted; also consider adding a bounded timeout/error path to avoid infinite waits.
| // Wait for any in-flight forward_message() to finish before freeing | ||
| while (mavlink_forward_msg_active.load() > 0) { | ||
| px4_usleep(100); | ||
| } |
There was a problem hiding this comment.
Same starvation risk as in start_helper(): this waits for the global mavlink_forward_msg_active to reach 0 while other MAVLink receiver threads can continue calling forward_message() and keep the counter non-zero. This can make mavlink stop hang under sustained traffic. Use a mechanism that prevents new readers from entering while waiting (writer gate), or a per-instance reader count that drains once the instance is removed from mavlink_module_instances, and consider a timeout to avoid blocking indefinitely.
| @@ -523,6 +527,8 @@ Mavlink::forward_message(const mavlink_message_t *msg, Mavlink *self) | |||
| } | |||
| } | |||
| } | |||
|
|
|||
| mavlink_forward_msg_active.fetch_sub(1); | |||
There was a problem hiding this comment.
mavlink_forward_msg_active.fetch_add(1) / fetch_sub(1) is manually paired; if future edits add early returns or error paths inside the loop, it's easy to miss the decrement and wedge the stop/start wait loops. Consider using an RAII scope guard so the counter is decremented on all exits from forward_message().
🔎 FLASH Analysispx4_fmu-v5x [Total VM Diff: 152 byte (0.01 %)]px4_fmu-v6x [Total VM Diff: 88 byte (0 %)]Updated: 2026-04-13T17:19:12 |
| // Wait for any in-flight forward_message() to finish before freeing | ||
| while (mavlink_forward_msg_active.load() > 0) { | ||
| px4_usleep(100); | ||
| } |
There was a problem hiding this comment.
This wait loop has no timeout and can block mavlink stop indefinitely if mavlink_forward_msg_active never reaches 0 (eg continuous forwarding load across other instances). Please add a bounded wait with a clear error path, and consider waiting only for references to this specific inst (per-instance refcount / hazard-pointer style) rather than a global counter.
| @@ -523,6 +527,8 @@ Mavlink::forward_message(const mavlink_message_t *msg, Mavlink *self) | |||
| } | |||
| } | |||
| } | |||
There was a problem hiding this comment.
forward_message() now iterates over mavlink_module_instances without holding mavlink_module_mutex, while other code paths (eg stop_command() / start_helper()) still write mavlink_module_instances[...]=nullptr under that mutex. Because the array elements are plain Mavlink*, this introduces a C++ data race (UB) between unsynchronized reads and writes even if the pointer write is naturally atomic on the CPU. Consider switching mavlink_module_instances to an atomic pointer array (eg px4::atomic<Mavlink*>) and using atomic load/store (or atomic exchange on removal), or otherwise ensure readers and writers use the same synchronization primitive.
| while (mavlink_forward_msg_active.load() > 0) { | ||
| px4_usleep(100); |
There was a problem hiding this comment.
It seems like we are replacing the mutex with a spinlock which in my experience isn't a great performance trade-off. Do you have any benchmark results?
Also the complexity has increased significantly and one can easily cause critical bugs when working in this module. The syncing has to be done carefully and it's all tied to manual checks and loops.
There was a problem hiding this comment.
I haven't bench marked it, and yeah I'm not thrilled with the read counter and spin lock... maybe we can resolve this without needing that or the mutex. Need to think about it more.
There was a problem hiding this comment.
The issue is the mutex is in the hot path, whereas the spin lock is only on stop() which is rare
In my opinion, this is an architectural issue and needs broader re-design. Do we need to fix that urgently or can it wait for the mavlink re-design work? |
|
@onur-ozkan It can wait 👍 |
Summary
mavlink_module_mutexlock inforward_message()with a lightweight atomic reader counter (mavlink_forward_msg_active), removing per-message lock overhead from the forwarding hot pathstop_command,start_helper) null the array entry under the existing mutex, then spin-wait for the counter to drain before freeing the instancestop_command()identified in perf(mavlink): remove mutex from forward_message hot path #26927, where an instance could be freed while another receiver thread'sforward_message()still held a pointer to it