serial: Add TX backpressure to prevent guest soft lockup by JackThomson2 · Pull Request #5865 · firecracker-microvm/firecracker

JackThomson2 · 2026-05-05T16:41:15Z

Changes

...

Reason

...

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

This functionality cannot be added in rust-vmm.

A guest tight loop writing to /dev/ttyS0 (e.g. `cat /dev/zero > /dev/ttyS0`) saturates the vCPU thread in MMIO-exit handling and starves the other vCPU, producing soft-lockup, workqueue-lockup and RCU-stall warnings such as: BUG: workqueue lockup - pool cpus=0-1 ... stuck for 38s! watchdog: BUG: soft lockup - CPU#1 stuck for 86s! [swapper/1:0] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks The underlying vm-superio Serial keeps LSR_THR_EMPTY|LSR_IDLE set unconditionally (the upstream comment is "we should always be ready to receive more data"), so the guest 8250 driver never throttles and writes each byte through MMIO. The vCPU thread then performs a 1-byte write(2)+flush+eventfd-write per byte; with FC's O_NONBLOCK stdout the host-side pipe buffer no longer paces it either. Wrap a soft TX FIFO in front of vm-superio: - Guest data writes are pushed onto a bounded VecDeque (no syscall on the vCPU thread) and a TimerFd is armed for one drain interval. - LSR reads mask THR_EMPTY|IDLE while the queue is at the modelled FIFO depth, so the guest sees a busy port and waits. - A drain runs on the event-manager thread (event subscriber on the timerfd), pops up to one FIFO per tick, and feeds each byte through the regular `Serial::write(DATA, b)` path (write+flush+IRQ raise). - Loopback writes bypass the queue so vm-superio's synchronous RX-FIFO routing and RDA interrupts stay in sync with the guest. - Overrun beyond `TX_QUEUE_CAPACITY` drops bytes via the existing `tx_lost_byte` event, matching real-hardware FIFO overrun. Empirically a 60-second `cat /dev/zero > /dev/ttyS0` produces zero soft-lockup messages; the flooding vCPU's host-side stime drops from ~76% to ~5%. Signed-off-by: Jack Thomson <jackabt@amazon.com>

The soft TX FIFO added in the previous commit holds bytes the guest has written to the data register but the host hasn't yet drained. With the queue not part of `SerialState`, snapshot/restore (and live migration) silently drops up to 64 KiB of pending console output — observable as missing bytes in the destination's serial.log when a snapshot is taken mid-burst. Persist the queue: - Add `SerialState::tx_queue: Vec<u8>` with `#[serde(default)]`, so older snapshots restore as an empty queue and round-trip cleanly. - Bump `SNAPSHOT_VERSION` from 10.0.0 to 10.1.0. The format is a strict superset of v10.0.0; older Firecrackers reject the new snapshot via the existing minor-version gate in `Snapshot::load`. - Add `SerialWrapper::tx_queue_snapshot` and `restore_tx_queue` so `DeviceManager::serial_state` and the x86 / aarch64 restore paths can move the bytes through the persistence layer. - Truncate on restore to the live `TX_QUEUE_CAPACITY` to keep the runtime invariant `tx_queue.len() <= TX_QUEUE_CAPACITY` even if a snapshot was taken with a larger cap. Re-arm the drain timer if the restored queue is non-empty so bytes start flowing immediately on the destination side. Signed-off-by: Jack Thomson <jackabt@amazon.com>

codecov · 2026-05-05T16:47:59Z

Codecov Report

❌ Patch coverage is 78.94737% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.80%. Comparing base (e7e0efe) to head (c9bf18c).

Files with missing lines	Patch %	Lines
src/vmm/src/devices/legacy/serial.rs	83.14%	15 Missing ⚠️
src/vmm/src/device_manager/persist.rs	14.28%	6 Missing ⚠️
src/vmm/src/device_manager/mod.rs	83.33%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5865      +/-   ##
==========================================
+ Coverage   82.62%   82.80%   +0.17%     
==========================================
  Files         275      277       +2     
  Lines       29750    29989     +239     
==========================================
+ Hits        24581    24831     +250     
+ Misses       5169     5158      -11

Flag	Coverage Δ
5.10-m5n.metal	`83.12% <85.71%> (?)`
5.10-m6a.metal	`82.45% <85.71%> (?)`
5.10-m6g.metal	`79.78% <74.74%> (+0.04%)`	⬆️
5.10-m6i.metal	`83.11% <85.71%> (?)`
5.10-m7a.metal-48xl	`82.44% <85.71%> (?)`
5.10-m7g.metal	`79.78% <74.74%> (+0.04%)`	⬆️
5.10-m7i.metal-24xl	`83.08% <85.71%> (?)`
5.10-m7i.metal-48xl	`83.08% <85.71%> (?)`
5.10-m8g.metal-24xl	`79.78% <74.74%> (+0.04%)`	⬆️
5.10-m8g.metal-48xl	`79.78% <74.74%> (+0.04%)`	⬆️
5.10-m8i.metal-48xl	`83.09% <85.71%> (?)`
5.10-m8i.metal-96xl	`83.09% <85.71%> (?)`
6.1-m5n.metal	`83.14% <85.71%> (?)`
6.1-m6a.metal	`82.47% <85.71%> (?)`
6.1-m6g.metal	`79.78% <74.74%> (+0.04%)`	⬆️
6.1-m6i.metal	`83.13% <85.71%> (+0.01%)`	⬆️
6.1-m7a.metal-48xl	`82.46% <85.71%> (?)`
6.1-m7g.metal	`79.78% <74.74%> (+0.04%)`	⬆️
6.1-m7i.metal-24xl	`83.15% <85.71%> (?)`
6.1-m7i.metal-48xl	`83.15% <85.71%> (?)`
6.1-m8g.metal-24xl	`79.78% <74.74%> (+0.05%)`	⬆️
6.1-m8g.metal-48xl	`79.78% <74.74%> (+0.04%)`	⬆️
6.1-m8i.metal-48xl	`83.16% <85.71%> (?)`
6.1-m8i.metal-96xl	`83.15% <85.71%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

JackThomson2 added 2 commits May 5, 2026 16:39

JackThomson2 closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serial: Add TX backpressure to prevent guest soft lockup#5865

serial: Add TX backpressure to prevent guest soft lockup#5865
JackThomson2 wants to merge 2 commits into
firecracker-microvm:mainfrom
JackThomson2:fix/serial_flooding

JackThomson2 commented May 5, 2026

Uh oh!

codecov Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JackThomson2 commented May 5, 2026

Changes

Reason

License Acceptance

PR Checklist

Uh oh!

codecov Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 5, 2026 •

edited

Loading