Skip to content

fix: prevent deadlocks when instrumenting with clippy-tracing#5852

Open
snigenigmatic wants to merge 5 commits intofirecracker-microvm:mainfrom
snigenigmatic:fix-tracing-instrumentation-deadlocks
Open

fix: prevent deadlocks when instrumenting with clippy-tracing#5852
snigenigmatic wants to merge 5 commits intofirecracker-microvm:mainfrom
snigenigmatic:fix-tracing-instrumentation-deadlocks

Conversation

@snigenigmatic
Copy link
Copy Markdown

Changes

  • src/vmm/src/logger/logging.rs: In Logger::update(), moved the open_file_nonblock() call to before the LOGGER mutex is acquired. Previously, the file was opened while holding the lock, so when open_file_nonblock was instrumented by clippy-tracing, the resulting log::trace!() call would re-enter Logger::log() and attempt to acquire the same non-reentrant std::sync::Mutex, deadlocking the process on startup.
  • src/clippy-tracing/src/main.rs: Added an i.sig.abi.is_none() guard to both the create_check_visitor_function! and create_fix_visitor_function! macros. This prevents clippy-tracing from instrumenting extern "C" functions (signal handlers, FFI callbacks), which would introduce mutex acquisitions in async-signal-unsafe contexts. When a signal interrupted a thread already holding the LOGGER mutex (e.g. during warn!() in the vCPU paused() state), the instrumented signal handler would deadlock trying to acquire it again.
  • src/clippy-tracing/tests/integration_tests.rs: Added three integration tests (fix_skips_extern_c_fn, check_skips_extern_c_fn, fix_skips_extern_c_fn_in_impl) verifying that extern "C" functions are correctly skipped during both fix and check actions, including inside impl blocks.
    ...

Reason

When running clippy-tracing --action fix with default settings, at least two deadlock scenarios arise (#5807):

  1. Startup hang: main_exec() calls LOGGER.update(config) which acquires the LOGGER mutex, then calls open_file_nonblock(). When instrumented, that function calls log::trace!() which tries to re-acquire the same mutex on the same thread -- deadlock.
  2. Snapshot resume hang: The vCPU thread calls warn!() (acquiring the LOGGER mutex) after receiving a Resume event with immediate_exit == 1. The RT signal sent by send_event() can arrive mid-warn!(), and if the handle_signal function was instrumented, it tries to acquire the LOGGER mutex from the signal handler context -- deadlock.
    These fixes eliminate both deadlock sources: (1) by narrowing the critical section in Logger::update(), and (2) by ensuring signal handlers are never instrumented.
    ...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

Copilot AI review requested due to automatic review settings April 22, 2026 16:09
@snigenigmatic snigenigmatic force-pushed the fix-tracing-instrumentation-deadlocks branch from f80a97c to 1f061b5 Compare April 22, 2026 16:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR mitigates deadlock scenarios introduced when clippy-tracing instruments functions that may log while the global LOGGER mutex is held, and avoids instrumenting extern ABI functions (e.g., signal handlers / FFI callbacks) to prevent async-signal-unsafe logging paths.

Changes:

  • Narrow Logger::update()’s critical section by opening the log target file before taking the LOGGER mutex.
  • Update clippy-tracing’s check/fix visitors to skip functions with an explicit ABI (extern ...).
  • Add integration tests validating extern "C" functions are skipped in both check and fix modes (including inside impl blocks).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/vmm/src/logger/logging.rs Reorders file opening to avoid logging re-entrancy while holding the logger mutex.
src/clippy-tracing/src/main.rs Adds an ABI guard to prevent instrumentation of extern functions.
src/clippy-tracing/tests/integration_tests.rs Adds integration tests covering the new “skip extern” behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// re-entering the locked Logger.
let file = config
.log_path
.map(|p| open_file_nonblock(&p))
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config.log_path.map(|p| ...) moves log_path out of config, but config is used again below (level/show_level/show_log_origin/module). This will not compile due to a partial move. Consider borrowing the path (e.g., config.log_path.as_ref() / as_deref()) or destructuring config into locals before opening the file.

Suggested change
.map(|p| open_file_nonblock(&p))
.as_ref()
.map(|p| open_file_nonblock(p))

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.83%. Comparing base (ae57b7a) to head (c2816b2).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5852      +/-   ##
==========================================
+ Coverage   82.79%   82.83%   +0.03%     
==========================================
  Files         276      276              
  Lines       29764    29770       +6     
==========================================
+ Hits        24643    24659      +16     
+ Misses       5121     5111      -10     
Flag Coverage Δ
5.10-m5n.metal 83.13% <100.00%> (+0.04%) ⬆️
5.10-m6a.metal 82.46% <100.00%> (+0.05%) ⬆️
5.10-m6g.metal 79.74% <100.00%> (+0.04%) ⬆️
5.10-m6i.metal 83.13% <100.00%> (+0.04%) ⬆️
5.10-m7a.metal-48xl 82.45% <100.00%> (+0.05%) ⬆️
5.10-m7g.metal 79.74% <100.00%> (+0.04%) ⬆️
5.10-m7i.metal-24xl 83.09% <100.00%> (+0.04%) ⬆️
5.10-m7i.metal-48xl 83.10% <100.00%> (+0.05%) ⬆️
5.10-m8g.metal-24xl 79.74% <100.00%> (+0.04%) ⬆️
5.10-m8g.metal-48xl 79.74% <100.00%> (+0.04%) ⬆️
5.10-m8i.metal-48xl 83.10% <100.00%> (+0.04%) ⬆️
5.10-m8i.metal-96xl 83.10% <100.00%> (+0.04%) ⬆️
6.1-m5n.metal 83.15% <100.00%> (+0.04%) ⬆️
6.1-m6a.metal 82.48% <100.00%> (+0.03%) ⬆️
6.1-m6g.metal 79.74% <100.00%> (+0.04%) ⬆️
6.1-m6i.metal 83.15% <100.00%> (+0.04%) ⬆️
6.1-m7a.metal-48xl 82.47% <100.00%> (+0.03%) ⬆️
6.1-m7g.metal 79.74% <100.00%> (+0.03%) ⬆️
6.1-m7i.metal-24xl 83.17% <100.00%> (+0.05%) ⬆️
6.1-m7i.metal-48xl 83.17% <100.00%> (+0.04%) ⬆️
6.1-m8g.metal-24xl 79.74% <100.00%> (+0.04%) ⬆️
6.1-m8g.metal-48xl 79.74% <100.00%> (+0.04%) ⬆️
6.1-m8i.metal-48xl 83.16% <100.00%> (+0.04%) ⬆️
6.1-m8i.metal-96xl 83.16% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

snigenigmatic and others added 4 commits April 29, 2026 17:41
Two deadlock sources exist when clippy-tracing instruments the codebase:
1. Logger::update() called open_file_nonblock() while holding the LOGGER
   mutex. When instrumented, that function calls log::trace!() which
   re-enters Logger::log() and attempts to acquire the same
   non-reentrant mutex, causing a deadlock on startup.
2. Signal handlers (extern C fns) were instrumented by clippy-tracing,
   adding mutex acquisitions in async-signal-unsafe contexts. When a
   signal interrupted a thread already holding the LOGGER mutex, the
   handler would deadlock trying to acquire it again.
Fix (1) by moving the open_file_nonblock() call before the mutex is
acquired in Logger::update(). Fix (2) by teaching clippy-tracing to
skip functions with an explicit ABI (e.g. extern C), since these are
typically signal handlers or FFI callbacks that must not acquire locks.
Closes firecracker-microvm#5807

Signed-off-by: snigenigmatic <kaustubhc.1105@gmail.com>
Two deadlock sources exist when clippy-tracing instruments the codebase:
1. Logger::update() called open_file_nonblock() while holding the LOGGER
   mutex. When instrumented, that function calls log::trace!() which
   re-enters Logger::log() and attempts to acquire the same
   non-reentrant mutex, causing a deadlock on startup.
2. Signal handlers (extern C fns) were instrumented by clippy-tracing,
   adding mutex acquisitions in async-signal-unsafe contexts. When a
   signal interrupted a thread already holding the LOGGER mutex, the
   handler would deadlock trying to acquire it again.
Fix (1) by moving the open_file_nonblock() call before the mutex is
acquired in Logger::update(). Fix (2) by teaching clippy-tracing to
skip functions with an explicit ABI (e.g. extern C), since these are
typically signal handlers or FFI callbacks that must not acquire locks.
Closes firecracker-microvm#5807

Signed-off-by: snigenigmatic <kaustubhc.1105@gmail.com>
@snigenigmatic snigenigmatic force-pushed the fix-tracing-instrumentation-deadlocks branch from 4a10d4b to 7a25fc6 Compare April 29, 2026 17:42
The refactored update() path that opens the file before acquiring the
lock was not exercised by any unit test. Add a test that calls update()
with a valid log_path and asserts the target is set on the logger.

Signed-off-by: snigenigmatic <kaustubhc.1105@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants