Skip to content

fix(tests): avoid soft lockup in test_serial_active_tx_snapshot#5866

Closed
JackThomson2 wants to merge 1 commit into
firecracker-microvm:mainfrom
JackThomson2:fix/serial-active-tx-snapshot-test
Closed

fix(tests): avoid soft lockup in test_serial_active_tx_snapshot#5866
JackThomson2 wants to merge 1 commit into
firecracker-microvm:mainfrom
JackThomson2:fix/serial-active-tx-snapshot-test

Conversation

@JackThomson2

Copy link
Copy Markdown
Contributor

The test was hanging because cat /dev/zero > /dev/ttyS0 on a multi-vCPU guest triggers a soft lockup in the 8250 IRQ handler. The root cause is that FC's serial emulation re-asserts IIR (THRE pending) on every THR write, causing the guest's serial8250_interrupt() loop to run up to PASS_LIMIT=512 iterations per interrupt with IRQs disabled (4-12ms). When writer and IRQ handler are on different CPUs, the writer continuously re-asserts IIR between the handler's reads, so the loop never exits naturally.

Fix by:

  • Using SSH as the control channel instead of serial (which is unusable during the flood)
  • Using 4 vCPUs with explicit IRQ affinity pinning: all IRQs to CPU0 (keeping SSH/virtio-net responsive), serial IRQ to CPU1 (isolating the lockup), writer pinned to CPU3 (cross-CPU to exercise the worst case)
  • After snapshot/restore, killing the writer via SSH then verifying serial console functionality

This preserves the original test intent (snapshot with active serial TX on a multi-vCPU guest) while working around the known device emulation bug.

Changes

...

Reason

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@JackThomson2 JackThomson2 marked this pull request as ready for review May 7, 2026 10:19
@JackThomson2 JackThomson2 force-pushed the fix/serial-active-tx-snapshot-test branch from 8bf5d27 to 83e99fa Compare May 7, 2026 10:20
@codecov

codecov Bot commented May 7, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.80%. Comparing base (e7e0efe) to head (efb7ca1).
⚠️ Report is 26 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5866      +/-   ##
==========================================
+ Coverage   82.62%   82.80%   +0.18%     
==========================================
  Files         275      277       +2     
  Lines       29750    29895     +145     
==========================================
+ Hits        24581    24756     +175     
+ Misses       5169     5139      -30     
Flag Coverage Δ
5.10-m5n.metal 83.09% <ø> (?)
5.10-m6a.metal 82.43% <ø> (?)
5.10-m6g.metal 79.73% <ø> (ø)
5.10-m6i.metal 83.09% <ø> (?)
5.10-m7a.metal-48xl 82.42% <ø> (?)
5.10-m7g.metal 79.73% <ø> (ø)
5.10-m7i.metal-24xl 83.07% <ø> (?)
5.10-m7i.metal-48xl 83.07% <ø> (?)
5.10-m8g.metal-24xl 79.73% <ø> (ø)
5.10-m8g.metal-48xl 79.73% <ø> (ø)
5.10-m8i.metal-48xl 83.07% <ø> (?)
5.10-m8i.metal-96xl 83.07% <ø> (?)
6.1-m5n.metal 83.12% <ø> (?)
6.1-m6a.metal 82.46% <ø> (?)
6.1-m6g.metal 79.73% <ø> (ø)
6.1-m6i.metal 83.12% <ø> (+<0.01%) ⬆️
6.1-m7a.metal-48xl 82.45% <ø> (?)
6.1-m7g.metal 79.73% <ø> (ø)
6.1-m7i.metal-24xl 83.13% <ø> (?)
6.1-m7i.metal-48xl 83.14% <ø> (?)
6.1-m8g.metal-24xl 79.73% <ø> (-0.01%) ⬇️
6.1-m8g.metal-48xl 79.73% <ø> (ø)
6.1-m8i.metal-48xl 83.14% <ø> (?)
6.1-m8i.metal-96xl 83.14% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JackThomson2 JackThomson2 force-pushed the fix/serial-active-tx-snapshot-test branch from 83e99fa to e5bc71f Compare May 7, 2026 10:27
@JackThomson2 JackThomson2 requested a review from ilstam May 7, 2026 11:27
# looking for the # prompt at the end
serial.rx(microvm.distro.shell_prompt)

# Start an unbounded serial transmission from inside the guest such that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this comment?

microvm.start()

# looking for the # prompt at the end
serial.rx(microvm.distro.shell_prompt)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for removing this?

" nohup taskset -c 3 cat /dev/zero > /dev/ttyS0 2>/dev/null &"
)
# Give the guest time to saturate the TX buffer
time.sleep(2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason the sleep duration is increased from 1 second to 2?

time.sleep(2)

# Create snapshot.
# Create snapshot — FC pause works even during the serial flood because

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment necessary?

Comment thread tests/integration_tests/functional/test_serial_io.py
# affinity persists from the snapshot so only CPU1 is affected and SSH on
# CPU0 remains reachable.
vm.ssh.check_output("pkill -9 cat || true")
time.sleep(0.5)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this sleep here? sleep.rx() will sleep while waiting for characters

# Give the guest time to start the transmission
time.sleep(1)
# Pin all IRQs to CPU0 first, then move serial to CPU1. This ensures the
# serial flood lockup (which starves the IRQ-handling CPU) is isolated to

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment assumes the reader knows about "the lockup", but it should rather first explain what might be the problem without doing any pinning and then explain how the pinning helps.

# Pin all IRQs to CPU0 first, then move serial to CPU1. This ensures the
# serial flood lockup (which starves the IRQ-handling CPU) is isolated to
# CPU1, while SSH (virtio-net) stays on CPU0 and remains reachable.
# Pin the writer (cat) to CPU3 so it's on a different CPU than the serial

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also explain that we want them to be in separate CPUs in order to make it likely that the guest has TX interrupts enabled at the moment we take the snapshot.

serial.rx(vm.distro.shell_prompt)
serial.tx("pwd")
res = serial.rx("#")
assert "/root" in res

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit message, let's not call it a device emulation bug. The first paragraph goes unnecessarily deep into the guest driver IMHO (talking about PASS_LIMIT=512 and whatnot). I think it's simpler to just say that this unbounded transmission causes a TX interrupt storm in the guest and leads to a soft lockup for that CPU.

assert "/root" in res


def test_serial_active_tx_snapshot(uvm_plain, microvm_factory):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run the updated test with commit 25d7066a8dc37 ("devices: serial: Save/restore UART state to/from snapshots") reverted and make sure it fails?

The test was hanging because the unbounded `cat /dev/zero > /dev/ttyS0`
transmission causes a TX interrupt storm in the guest, which leads to
a soft lockup on the CPU that handles the serial IRQ. While that CPU
is locked up, the serial console becomes unusable, so the test could
no longer regain a shell after restore.

Fix by:
- Using SSH as the control channel instead of serial (which is
  unusable during the flood).
- Using 4 vCPUs with explicit IRQ affinity pinning: all IRQs to CPU0
  (keeping SSH/virtio-net responsive), the serial IRQ to CPU1
  (containing the soft lockup), and the writer pinned to CPU3 so it
  runs on a different CPU than the serial IRQ handler. Keeping the
  writer and the IRQ handler on separate CPUs is also what makes it
  likely that the guest still has TX interrupts enabled at the moment
  we take the snapshot, which is the scenario this test is meant to
  exercise.
- After snapshot/restore, killing the writer via SSH and then
  verifying the serial console is functional again.

Signed-off-by: Jack Thomson <jackabt@amazon.com>
@JackThomson2 JackThomson2 force-pushed the fix/serial-active-tx-snapshot-test branch from e5bc71f to efb7ca1 Compare May 11, 2026 13:11
@JackThomson2

Copy link
Copy Markdown
Contributor Author

We're closing this, unfortunately there doesnt seem to be a reliable way to fix this test failure + catch the regression. So we have decided to just mark the test as flaky

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants