fix(tests): avoid soft lockup in test_serial_active_tx_snapshot by JackThomson2 · Pull Request #5866 · firecracker-microvm/firecracker

JackThomson2 · 2026-05-07T10:18:56Z

The test was hanging because cat /dev/zero > /dev/ttyS0 on a multi-vCPU guest triggers a soft lockup in the 8250 IRQ handler. The root cause is that FC's serial emulation re-asserts IIR (THRE pending) on every THR write, causing the guest's serial8250_interrupt() loop to run up to PASS_LIMIT=512 iterations per interrupt with IRQs disabled (4-12ms). When writer and IRQ handler are on different CPUs, the writer continuously re-asserts IIR between the handler's reads, so the loop never exits naturally.

Fix by:

Using SSH as the control channel instead of serial (which is unusable during the flood)
Using 4 vCPUs with explicit IRQ affinity pinning: all IRQs to CPU0 (keeping SSH/virtio-net responsive), serial IRQ to CPU1 (isolating the lockup), writer pinned to CPU3 (cross-CPU to exercise the worst case)
After snapshot/restore, killing the writer via SSH then verifying serial console functionality

This preserves the original test intent (snapshot with active serial TX on a multi-vCPU guest) while working around the known device emulation bug.

Changes

...

Reason

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

This functionality cannot be added in rust-vmm.

codecov · 2026-05-07T10:26:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.80%. Comparing base (e7e0efe) to head (efb7ca1).
⚠️ Report is 26 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5866      +/-   ##
==========================================
+ Coverage   82.62%   82.80%   +0.18%     
==========================================
  Files         275      277       +2     
  Lines       29750    29895     +145     
==========================================
+ Hits        24581    24756     +175     
+ Misses       5169     5139      -30

Flag	Coverage Δ
5.10-m5n.metal	`83.09% <ø> (?)`
5.10-m6a.metal	`82.43% <ø> (?)`
5.10-m6g.metal	`79.73% <ø> (ø)`
5.10-m6i.metal	`83.09% <ø> (?)`
5.10-m7a.metal-48xl	`82.42% <ø> (?)`
5.10-m7g.metal	`79.73% <ø> (ø)`
5.10-m7i.metal-24xl	`83.07% <ø> (?)`
5.10-m7i.metal-48xl	`83.07% <ø> (?)`
5.10-m8g.metal-24xl	`79.73% <ø> (ø)`
5.10-m8g.metal-48xl	`79.73% <ø> (ø)`
5.10-m8i.metal-48xl	`83.07% <ø> (?)`
5.10-m8i.metal-96xl	`83.07% <ø> (?)`
6.1-m5n.metal	`83.12% <ø> (?)`
6.1-m6a.metal	`82.46% <ø> (?)`
6.1-m6g.metal	`79.73% <ø> (ø)`
6.1-m6i.metal	`83.12% <ø> (+<0.01%)`	⬆️
6.1-m7a.metal-48xl	`82.45% <ø> (?)`
6.1-m7g.metal	`79.73% <ø> (ø)`
6.1-m7i.metal-24xl	`83.13% <ø> (?)`
6.1-m7i.metal-48xl	`83.14% <ø> (?)`
6.1-m8g.metal-24xl	`79.73% <ø> (-0.01%)`	⬇️
6.1-m8g.metal-48xl	`79.73% <ø> (ø)`
6.1-m8i.metal-48xl	`83.14% <ø> (?)`
6.1-m8i.metal-96xl	`83.14% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ilstam · 2026-05-07T13:01:06Z

-    # looking for the # prompt at the end
-    serial.rx(microvm.distro.shell_prompt)
-
-    # Start an unbounded serial transmission from inside the guest such that


Why remove this comment?

ilstam · 2026-05-07T13:01:45Z

    microvm.start()

-    # looking for the # prompt at the end
-    serial.rx(microvm.distro.shell_prompt)


Is there a reason for removing this?

ilstam · 2026-05-07T13:02:10Z

+        " nohup taskset -c 3 cat /dev/zero > /dev/ttyS0 2>/dev/null &"
+    )
+    # Give the guest time to saturate the TX buffer
+    time.sleep(2)


Is there a reason the sleep duration is increased from 1 second to 2?

ilstam · 2026-05-07T13:02:52Z

+    time.sleep(2)

-    # Create snapshot.
+    # Create snapshot — FC pause works even during the serial flood because


Is this comment necessary?

ilstam · 2026-05-07T13:06:46Z

+    # affinity persists from the snapshot so only CPU1 is affected and SSH on
+    # CPU0 remains reachable.
+    vm.ssh.check_output("pkill -9 cat || true")
+    time.sleep(0.5)


Do we need this sleep here? sleep.rx() will sleep while waiting for characters

ilstam · 2026-05-07T13:09:45Z

-    # Give the guest time to start the transmission
-    time.sleep(1)
+    # Pin all IRQs to CPU0 first, then move serial to CPU1. This ensures the
+    # serial flood lockup (which starves the IRQ-handling CPU) is isolated to


This comment assumes the reader knows about "the lockup", but it should rather first explain what might be the problem without doing any pinning and then explain how the pinning helps.

ilstam · 2026-05-07T13:12:40Z

+    # Pin all IRQs to CPU0 first, then move serial to CPU1. This ensures the
+    # serial flood lockup (which starves the IRQ-handling CPU) is isolated to
+    # CPU1, while SSH (virtio-net) stays on CPU0 and remains reachable.
+    # Pin the writer (cat) to CPU3 so it's on a different CPU than the serial


We should also explain that we want them to be in separate CPUs in order to make it likely that the guest has TX interrupts enabled at the moment we take the snapshot.

ilstam · 2026-05-07T13:19:01Z

    serial.rx(vm.distro.shell_prompt)
    serial.tx("pwd")
    res = serial.rx("#")
    assert "/root" in res


In the commit message, let's not call it a device emulation bug. The first paragraph goes unnecessarily deep into the guest driver IMHO (talking about PASS_LIMIT=512 and whatnot). I think it's simpler to just say that this unbounded transmission causes a TX interrupt storm in the guest and leads to a soft lockup for that CPU.

ilstam · 2026-05-07T13:21:22Z

    assert "/root" in res


 def test_serial_active_tx_snapshot(uvm_plain, microvm_factory):


Can we run the updated test with commit 25d7066a8dc37 ("devices: serial: Save/restore UART state to/from snapshots") reverted and make sure it fails?

The test was hanging because the unbounded `cat /dev/zero > /dev/ttyS0` transmission causes a TX interrupt storm in the guest, which leads to a soft lockup on the CPU that handles the serial IRQ. While that CPU is locked up, the serial console becomes unusable, so the test could no longer regain a shell after restore. Fix by: - Using SSH as the control channel instead of serial (which is unusable during the flood). - Using 4 vCPUs with explicit IRQ affinity pinning: all IRQs to CPU0 (keeping SSH/virtio-net responsive), the serial IRQ to CPU1 (containing the soft lockup), and the writer pinned to CPU3 so it runs on a different CPU than the serial IRQ handler. Keeping the writer and the IRQ handler on separate CPUs is also what makes it likely that the guest still has TX interrupts enabled at the moment we take the snapshot, which is the scenario this test is meant to exercise. - After snapshot/restore, killing the writer via SSH and then verifying the serial console is functional again. Signed-off-by: Jack Thomson <jackabt@amazon.com>

JackThomson2 · 2026-05-21T09:18:14Z

We're closing this, unfortunately there doesnt seem to be a reliable way to fix this test failure + catch the regression. So we have decided to just mark the test as flaky

JackThomson2 marked this pull request as ready for review May 7, 2026 10:19

JackThomson2 force-pushed the fix/serial-active-tx-snapshot-test branch from 8bf5d27 to 83e99fa Compare May 7, 2026 10:20

JackThomson2 force-pushed the fix/serial-active-tx-snapshot-test branch from 83e99fa to e5bc71f Compare May 7, 2026 10:27

JackThomson2 requested a review from ilstam May 7, 2026 11:27

ilstam reviewed May 7, 2026

View reviewed changes

JackThomson2 force-pushed the fix/serial-active-tx-snapshot-test branch from e5bc71f to efb7ca1 Compare May 11, 2026 13:11

JackThomson2 closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tests): avoid soft lockup in test_serial_active_tx_snapshot#5866

fix(tests): avoid soft lockup in test_serial_active_tx_snapshot#5866
JackThomson2 wants to merge 1 commit into
firecracker-microvm:mainfrom
JackThomson2:fix/serial-active-tx-snapshot-test

JackThomson2 commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading

Uh oh!

ilstam May 7, 2026

Uh oh!

ilstam May 7, 2026

Uh oh!

ilstam May 7, 2026

Uh oh!

ilstam May 7, 2026

Uh oh!

Uh oh!

ilstam May 7, 2026

Uh oh!

ilstam May 7, 2026

Uh oh!

ilstam May 7, 2026

Uh oh!

ilstam May 7, 2026

Uh oh!

ilstam May 7, 2026

Uh oh!

JackThomson2 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assert "/root" in res


		def test_serial_active_tx_snapshot(uvm_plain, microvm_factory):

Conversation

JackThomson2 commented May 7, 2026

Changes

Reason

License Acceptance

PR Checklist

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JackThomson2 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 7, 2026 •

edited

Loading