Skip to content

perf(mavlink): replace forward_message mutex with atomic reader counter#27076

Closed
dakejahl wants to merge 1 commit into
mainfrom
dakejahl/mavlink-remove-forward-mutex
Closed

perf(mavlink): replace forward_message mutex with atomic reader counter#27076
dakejahl wants to merge 1 commit into
mainfrom
dakejahl/mavlink-remove-forward-mutex

Conversation

@dakejahl
Copy link
Copy Markdown
Contributor

Summary

  • Replace the global mavlink_module_mutex lock in forward_message() with a lightweight atomic reader counter (mavlink_forward_msg_active), removing per-message lock overhead from the forwarding hot path
  • Deletion paths (stop_command, start_helper) null the array entry under the existing mutex, then spin-wait for the counter to drain before freeing the instance
  • Fixes the use-after-free race in stop_command() identified in perf(mavlink): remove mutex from forward_message hot path #26927, where an instance could be freed while another receiver thread's forward_message() still held a pointer to it

Use a lightweight atomic reader counter instead of the global
mavlink_module_mutex in forward_message(). Deletion paths null the array
entry then spin-wait for the counter to drain before freeing.

Fixes a use-after-free race in stop_command() from #26927 where an
instance could be freed while another receiver thread held a pointer.

Signed-off-by: Jacob Dahl <dahl.jakejacob@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reworks MAVLink instance forwarding to remove mavlink_module_mutex from the forward_message() hot path by introducing an atomic “in-flight forwarders” counter, and updates instance teardown paths to wait for in-flight forwards before freeing instances (aiming to fix the stop/delete use-after-free race from #26927).

Changes:

  • Add global atomic mavlink_forward_msg_active counter for tracking active forward_message() iterations.
  • Remove global mutex from Mavlink::forward_message() and use the counter instead.
  • Update start_helper() and stop_command() teardown to null out the instance slot under mavlink_module_mutex, then wait for the counter to drain before delete.
Comments suppressed due to low confidence (1)

src/modules/mavlink/mavlink_main.cpp:524

  • mavlink_module_instances is a plain Mavlink* array, but forward_message() now iterates over it without mavlink_module_mutex while other threads can concurrently write nullptr into the array under the mutex (eg stop_command() / start_helper()). This is a C++ data race/UB regardless of pointer-size atomicity on the CPU. To make this safe, make the array entries atomic (eg px4::atomic<Mavlink*> with .load()/.store()), or otherwise ensure readers and writers synchronize via the same mechanism.
	mavlink_forward_msg_active.fetch_add(1);

	for (Mavlink *inst : mavlink_module_instances) {
		if (inst && (inst != self) && (inst->get_forwarding_on())) {
			// Pass message only if target component was seen before

Comment on lines +2989 to 2992
// Wait for any in-flight forward_message() to finish before freeing
while (mavlink_forward_msg_active.load() > 0) {
px4_usleep(100);
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The global reader counter is waited on in deletion paths, but nothing prevents new forward_message() calls from other still-running MAVLink instances from incrementing mavlink_forward_msg_active while this loop is waiting. With continuous traffic this can starve and potentially block thread cleanup indefinitely. Consider adding a writer gate (eg a global "forwarding blocked" flag checked before incrementing the counter), or switch to a per-instance refcount so you only wait for readers that could still hold a pointer to the instance being deleted; also consider adding a bounded timeout/error path to avoid infinite waits.

Copilot uses AI. Check for mistakes.
Comment on lines +3317 to +3320
// Wait for any in-flight forward_message() to finish before freeing
while (mavlink_forward_msg_active.load() > 0) {
px4_usleep(100);
}
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same starvation risk as in start_helper(): this waits for the global mavlink_forward_msg_active to reach 0 while other MAVLink receiver threads can continue calling forward_message() and keep the counter non-zero. This can make mavlink stop hang under sustained traffic. Use a mechanism that prevents new readers from entering while waiting (writer gate), or a per-instance reader count that drains once the instance is removed from mavlink_module_instances, and consider a timeout to avoid blocking indefinitely.

Copilot uses AI. Check for mistakes.
Comment on lines 517 to +531
@@ -523,6 +527,8 @@ Mavlink::forward_message(const mavlink_message_t *msg, Mavlink *self)
}
}
}

mavlink_forward_msg_active.fetch_sub(1);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mavlink_forward_msg_active.fetch_add(1) / fetch_sub(1) is manually paired; if future edits add early returns or error paths inside the loop, it's easy to miss the decrement and wedge the stop/start wait loops. Consider using an RAII scope guard so the counter is decremented on all exits from forward_message().

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

🔎 FLASH Analysis

px4_fmu-v5x [Total VM Diff: 152 byte (0.01 %)]
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +0.0%     +88  +0.0%     +88    .text
    [NEW]     +60  [NEW]     +60    CSWTCH.1972
    [NEW]     +44  [NEW]     +44    CSWTCH.2769
     +15%     +32   +15%     +32    Mavlink::forward_message()
     +21%     +28   +21%     +28    Mavlink::start_helper()
    +4.9%     +20  +4.9%     +20    Mavlink::stop_command()
    [NEW]     +14  [NEW]     +14    CSWTCH.3532
    [NEW]     +14  [NEW]     +14    CSWTCH.3533
    +4.8%      +4  +4.8%      +4    McAutotuneAttitudeControl::custom_command()
    +0.2%      +4  +0.2%      +4    _GLOBAL__sub_I_mavlink_system
    [DEL]     -14  [DEL]     -14    CSWTCH.3530
    [DEL]     -14  [DEL]     -14    CSWTCH.3531
    [DEL]     -44  [DEL]     -44    CSWTCH.2767
    [DEL]     -60  [DEL]     -60    CSWTCH.1970
  [ = ]       0  +0.1%     +64    .bss
    [ = ]       0  [NEW]     +68    [section .bss]
    [ = ]       0  [NEW]      +4    mavlink_forward_msg_active
    [ = ]       0   +11%      +4    read_ok
    [ = ]       0 -75.0%     -12    g_dma_perf
  +0.0%     +55  [ = ]       0    .debug_abbrev
  -0.0%     -12  [ = ]       0    .debug_frame
  +0.0%     +90  [ = ]       0    .debug_info
  +0.0%     +71  [ = ]       0    .debug_line
  -0.0%     -14  [ = ]       0    .debug_loclists
  +0.0%     +43  [ = ]       0    .debug_rnglists
  +0.0%     +27  [ = ]       0    .debug_str
  +0.0%     +32  [ = ]       0    .strtab
    [DEL]     -12  [ = ]       0    CSWTCH.1970
    [NEW]     +12  [ = ]       0    CSWTCH.1972
    [DEL]     -12  [ = ]       0    CSWTCH.2767
    [NEW]     +12  [ = ]       0    CSWTCH.2769
    [DEL]     -24  [ = ]       0    CSWTCH.3530
    [DEL]     -24  [ = ]       0    CSWTCH.3531
    [NEW]     +24  [ = ]       0    CSWTCH.3532
    [NEW]     +24  [ = ]       0    CSWTCH.3533
     +36%     +16  [ = ]       0    ___ZL19param_get_cplusplustPf.isra.0_veneer
   -18.0%     -16  [ = ]       0    ___ZN7sensors22VehicleAngularVelocity21FilterAngularVelocityEiPfi_veneer
    [NEW]     +32  [ = ]       0    mavlink_forward_msg_active
  +0.0%     +48  [ = ]       0    .symtab
    [DEL]     -32  [ = ]       0    CSWTCH.1970
    [NEW]     +32  [ = ]       0    CSWTCH.1972
    [DEL]     -32  [ = ]       0    CSWTCH.2767
    [NEW]     +32  [ = ]       0    CSWTCH.2769
    [DEL]     -48  [ = ]       0    CSWTCH.3530
    [DEL]     -48  [ = ]       0    CSWTCH.3531
    [NEW]     +48  [ = ]       0    CSWTCH.3532
    [NEW]     +48  [ = ]       0    CSWTCH.3533
     +20%     +16  [ = ]       0    ManualControlSelector::isInputValid()
   -50.0%     -32  [ = ]       0    McAutotuneAttitudeControl::custom_command()
     +33%     +32  [ = ]       0    _GLOBAL__sub_I_mavlink_system
     +17%     +32  [ = ]       0    ___ZL19param_get_cplusplustPf.isra.0_veneer
   -40.0%     -32  [ = ]       0    ___ZN7sensors22VehicleAngularVelocity21FilterAngularVelocityEiPfi_veneer
    [NEW]     +32  [ = ]       0    mavlink_forward_msg_active
  -1.1%     -88  [ = ]       0    [Unmapped]
  +0.0%    +340  +0.0%    +152    TOTAL

px4_fmu-v6x [Total VM Diff: 88 byte (0 %)]
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +0.0%     +88  +0.0%     +88    .text
    [NEW]     +60  [NEW]     +60    CSWTCH.1972
    [NEW]     +44  [NEW]     +44    CSWTCH.2769
     +15%     +32   +15%     +32    Mavlink::forward_message()
     +21%     +28   +21%     +28    Mavlink::start_helper()
    +4.9%     +20  +4.9%     +20    Mavlink::stop_command()
    [NEW]     +14  [NEW]     +14    CSWTCH.3532
    [NEW]     +14  [NEW]     +14    CSWTCH.3533
    +4.8%      +4  +4.8%      +4    McAutotuneAttitudeControl::custom_command()
    +0.2%      +4  +0.2%      +4    _GLOBAL__sub_I_mavlink_system
    [DEL]     -14  [DEL]     -14    CSWTCH.3530
    [DEL]     -14  [DEL]     -14    CSWTCH.3531
    [DEL]     -44  [DEL]     -44    CSWTCH.2767
    [DEL]     -60  [DEL]     -60    CSWTCH.1970
  +0.0%     +55  [ = ]       0    .debug_abbrev
  -0.0%     -12  [ = ]       0    .debug_frame
  +0.0%     +90  [ = ]       0    .debug_info
  +0.0%     +71  [ = ]       0    .debug_line
  -0.0%     -14  [ = ]       0    .debug_loclists
  +0.0%     +43  [ = ]       0    .debug_rnglists
  +0.0%     +27  [ = ]       0    .debug_str
  +0.0%     +32  [ = ]       0    .strtab
    [DEL]     -12  [ = ]       0    CSWTCH.1970
    [NEW]     +12  [ = ]       0    CSWTCH.1972
    [DEL]     -12  [ = ]       0    CSWTCH.2767
    [NEW]     +12  [ = ]       0    CSWTCH.2769
    [DEL]     -24  [ = ]       0    CSWTCH.3530
    [DEL]     -24  [ = ]       0    CSWTCH.3531
    [NEW]     +24  [ = ]       0    CSWTCH.3532
    [NEW]     +24  [ = ]       0    CSWTCH.3533
    [NEW]     +32  [ = ]       0    mavlink_forward_msg_active
  +0.0%     +48  [ = ]       0    .symtab
    [DEL]     -32  [ = ]       0    CSWTCH.1970
    [NEW]     +32  [ = ]       0    CSWTCH.1972
    [DEL]     -32  [ = ]       0    CSWTCH.2767
    [NEW]     +32  [ = ]       0    CSWTCH.2769
    [DEL]     -48  [ = ]       0    CSWTCH.3530
    [DEL]     -48  [ = ]       0    CSWTCH.3531
    [NEW]     +48  [ = ]       0    CSWTCH.3532
    [NEW]     +48  [ = ]       0    CSWTCH.3533
     +20%     +16  [ = ]       0    MavlinkReceiver::handle_messages_in_gimbal_mode()
   -50.0%     -32  [ = ]       0    McAutotuneAttitudeControl::custom_command()
     +33%     +32  [ = ]       0    _GLOBAL__sub_I_mavlink_system
    [NEW]     +32  [ = ]       0    mavlink_forward_msg_active
  -2.1%     -88  [ = ]       0    [Unmapped]
  +0.0%    +340  +0.0%     +88    TOTAL

Updated: 2026-04-13T17:19:12

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment on lines +3317 to +3320
// Wait for any in-flight forward_message() to finish before freeing
while (mavlink_forward_msg_active.load() > 0) {
px4_usleep(100);
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wait loop has no timeout and can block mavlink stop indefinitely if mavlink_forward_msg_active never reaches 0 (eg continuous forwarding load across other instances). Please add a bounded wait with a clear error path, and consider waiting only for references to this specific inst (per-instance refcount / hazard-pointer style) rather than a global counter.

Copilot uses AI. Check for mistakes.
Comment on lines 517 to 529
@@ -523,6 +527,8 @@ Mavlink::forward_message(const mavlink_message_t *msg, Mavlink *self)
}
}
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forward_message() now iterates over mavlink_module_instances without holding mavlink_module_mutex, while other code paths (eg stop_command() / start_helper()) still write mavlink_module_instances[...]=nullptr under that mutex. Because the array elements are plain Mavlink*, this introduces a C++ data race (UB) between unsynchronized reads and writes even if the pointer write is naturally atomic on the CPU. Consider switching mavlink_module_instances to an atomic pointer array (eg px4::atomic<Mavlink*>) and using atomic load/store (or atomic exchange on removal), or otherwise ensure readers and writers use the same synchronization primitive.

Copilot uses AI. Check for mistakes.
Comment on lines +2990 to +2991
while (mavlink_forward_msg_active.load() > 0) {
px4_usleep(100);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we are replacing the mutex with a spinlock which in my experience isn't a great performance trade-off. Do you have any benchmark results?

Also the complexity has increased significantly and one can easily cause critical bugs when working in this module. The syncing has to be done carefully and it's all tied to manual checks and loops.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't bench marked it, and yeah I'm not thrilled with the read counter and spin lock... maybe we can resolve this without needing that or the mutex. Need to think about it more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is the mutex is in the hot path, whereas the spin lock is only on stop() which is rare

@onur-ozkan
Copy link
Copy Markdown
Member

Fixes the use-after-free race in stop_command() identified in #26927, where an instance could be freed while another receiver thread's forward_message() still held a pointer to it

In my opinion, this is an architectural issue and needs broader re-design. Do we need to fix that urgently or can it wait for the mavlink re-design work?

@dakejahl
Copy link
Copy Markdown
Contributor Author

@onur-ozkan It can wait 👍

@dakejahl dakejahl closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants