fix(tests): avoid soft lockup in test_serial_active_tx_snapshot#5866
fix(tests): avoid soft lockup in test_serial_active_tx_snapshot#5866JackThomson2 wants to merge 1 commit into
Conversation
8bf5d27 to
83e99fa
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5866 +/- ##
==========================================
+ Coverage 82.62% 82.80% +0.18%
==========================================
Files 275 277 +2
Lines 29750 29895 +145
==========================================
+ Hits 24581 24756 +175
+ Misses 5169 5139 -30
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
83e99fa to
e5bc71f
Compare
| # looking for the # prompt at the end | ||
| serial.rx(microvm.distro.shell_prompt) | ||
|
|
||
| # Start an unbounded serial transmission from inside the guest such that |
| microvm.start() | ||
|
|
||
| # looking for the # prompt at the end | ||
| serial.rx(microvm.distro.shell_prompt) |
There was a problem hiding this comment.
Is there a reason for removing this?
| " nohup taskset -c 3 cat /dev/zero > /dev/ttyS0 2>/dev/null &" | ||
| ) | ||
| # Give the guest time to saturate the TX buffer | ||
| time.sleep(2) |
There was a problem hiding this comment.
Is there a reason the sleep duration is increased from 1 second to 2?
| time.sleep(2) | ||
|
|
||
| # Create snapshot. | ||
| # Create snapshot — FC pause works even during the serial flood because |
| # affinity persists from the snapshot so only CPU1 is affected and SSH on | ||
| # CPU0 remains reachable. | ||
| vm.ssh.check_output("pkill -9 cat || true") | ||
| time.sleep(0.5) |
There was a problem hiding this comment.
Do we need this sleep here? sleep.rx() will sleep while waiting for characters
| # Give the guest time to start the transmission | ||
| time.sleep(1) | ||
| # Pin all IRQs to CPU0 first, then move serial to CPU1. This ensures the | ||
| # serial flood lockup (which starves the IRQ-handling CPU) is isolated to |
There was a problem hiding this comment.
This comment assumes the reader knows about "the lockup", but it should rather first explain what might be the problem without doing any pinning and then explain how the pinning helps.
| # Pin all IRQs to CPU0 first, then move serial to CPU1. This ensures the | ||
| # serial flood lockup (which starves the IRQ-handling CPU) is isolated to | ||
| # CPU1, while SSH (virtio-net) stays on CPU0 and remains reachable. | ||
| # Pin the writer (cat) to CPU3 so it's on a different CPU than the serial |
There was a problem hiding this comment.
We should also explain that we want them to be in separate CPUs in order to make it likely that the guest has TX interrupts enabled at the moment we take the snapshot.
| serial.rx(vm.distro.shell_prompt) | ||
| serial.tx("pwd") | ||
| res = serial.rx("#") | ||
| assert "/root" in res |
There was a problem hiding this comment.
In the commit message, let's not call it a device emulation bug. The first paragraph goes unnecessarily deep into the guest driver IMHO (talking about PASS_LIMIT=512 and whatnot). I think it's simpler to just say that this unbounded transmission causes a TX interrupt storm in the guest and leads to a soft lockup for that CPU.
| assert "/root" in res | ||
|
|
||
|
|
||
| def test_serial_active_tx_snapshot(uvm_plain, microvm_factory): |
There was a problem hiding this comment.
Can we run the updated test with commit 25d7066a8dc37 ("devices: serial: Save/restore UART state to/from snapshots") reverted and make sure it fails?
The test was hanging because the unbounded `cat /dev/zero > /dev/ttyS0` transmission causes a TX interrupt storm in the guest, which leads to a soft lockup on the CPU that handles the serial IRQ. While that CPU is locked up, the serial console becomes unusable, so the test could no longer regain a shell after restore. Fix by: - Using SSH as the control channel instead of serial (which is unusable during the flood). - Using 4 vCPUs with explicit IRQ affinity pinning: all IRQs to CPU0 (keeping SSH/virtio-net responsive), the serial IRQ to CPU1 (containing the soft lockup), and the writer pinned to CPU3 so it runs on a different CPU than the serial IRQ handler. Keeping the writer and the IRQ handler on separate CPUs is also what makes it likely that the guest still has TX interrupts enabled at the moment we take the snapshot, which is the scenario this test is meant to exercise. - After snapshot/restore, killing the writer via SSH and then verifying the serial console is functional again. Signed-off-by: Jack Thomson <jackabt@amazon.com>
e5bc71f to
efb7ca1
Compare
|
We're closing this, unfortunately there doesnt seem to be a reliable way to fix this test failure + catch the regression. So we have decided to just mark the test as flaky |
The test was hanging because
cat /dev/zero > /dev/ttyS0on a multi-vCPU guest triggers a soft lockup in the 8250 IRQ handler. The root cause is that FC's serial emulation re-asserts IIR (THRE pending) on every THR write, causing the guest's serial8250_interrupt() loop to run up to PASS_LIMIT=512 iterations per interrupt with IRQs disabled (4-12ms). When writer and IRQ handler are on different CPUs, the writer continuously re-asserts IIR between the handler's reads, so the loop never exits naturally.Fix by:
This preserves the original test intent (snapshot with active serial TX on a multi-vCPU guest) while working around the known device emulation bug.
Changes
...
Reason
...
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.