forget less NVMe state across migration by iximeow · Pull Request #1156 · oxidecomputer/propolis

iximeow · 2026-06-18T05:39:24Z

add an additional test for propolis-to-itself migration and run the smoke tests there. extend the smoke test to try making sure disks are at least kind of basically still functional.

this should fix #981, but I have to admit I couldn't really reproduce the second failure mode that I alluded to there or @lgfa29 saw last week. I think that returning incorrect submission queue heads could have an outcome where the guest OS thinks it's submitting I/Os in a way we wouldn't get to, but beyond the status quo obviously being wrong, I'm not sure I've actually tickled that condition much at all.

I mostly really want to extend #966 to actually drive I/Os now, because either of the improvements here should fail quickly through a fuzzer...

add an additional test for propolis-to-itself migration and run the smoke tests there. extend the smoke test to try making sure disks are at least kind of basically still functional.

iximeow · 2026-06-18T05:40:35Z

+                // we have a small conundrum: it's possible that the home
+                // directory we wrote into is a tmpfs anyway. so by succeeding
+                // this much, we may have really just tested memory is intact.


@hawkw i would like for you to look at this change just because i can imagine the eye roll and swear when you read this comment thanks in advance. it's so much more annoying than i thought it was.

iximeow · 2026-06-18T06:20:49Z

hm. the failures are fascinating!

phd_tests::migrate::smoke_test yields:

localhost:~# dmesg | grep 'reset controller'
[   68.965446] nvme nvme0: I/O 82 QID 2 timeout, reset controller
localhost:~# echo $?
0

so.. the test detected a problem! also what in the world.

migration_smoke_test on the other hand:

localhost:~# ls /media/nvme0n1/boot
System.map-virt  config-virt      dtbs-virt        grub             initramfs-virt   modloop-virt     syslinux         vmlinuz-virt
localhost:~# echo $?
0
localhost:~# dmesg | grep 'reset controller'
localhost:~# echo $?
1

... I presumably should be ignoring the exit code of the grep, oops.

lgfa29 · 2026-06-18T19:30:01Z

I ran a quick test using the same setup I did previously and things have certainly improved, the migration actually finished now 😄

But I still ran into issues after the migration, but each time it was a different.

The first time I ran a migration the disk seemed to be "stuck" for about 2min, but then I saw these kernel messages in the guest and the disk seemed to be fine:

[   89.895139] nvme nvme0: I/O 900 (I/O Cmd) QID 2 timeout, aborting
[   89.896745] nvme nvme0: Abort status: 0x0
[  120.103133] nvme nvme0: I/O 900 QID 2 timeout, reset controller
[  120.135440] nvme0n1: I/O Cmd(0x2) @ LBA 1336320, 32 blocks, I/O Error (sct 0x3 / sc 0x71)
[  120.137336] I/O error, dev nvme0n1, sector 1336320 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
[  120.151848] nvme nvme0: 2/0/0 default/read/poll queues

After the second test the disk seemed to be fine, but then I left the guest running while I was in a meeting and when I came it was unresponsive and I had several messages like these:

[   99.371306] nvme nvme0: I/O 263 QID 2 timeout, reset controller
[   99.418681] nvme nvme0: 2/0/0 default/read/poll queues
[  130.091326] nvme nvme0: I/O 263 (I/O Cmd) QID 2 timeout, aborting
[  130.092870] nvme nvme0: Abort status: 0x0
[  160.811324] nvme nvme0: I/O 263 QID 2 timeout, reset controller
[  160.858176] nvme nvme0: 2/0/0 default/read/poll queues
[  191.531317] nvme nvme0: I/O 263 QID 2 timeout, disable controller
[  191.563602] I/O error, dev nvme0n1, sector 2884112 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  191.565764] I/O error, dev nvme0n1, sector 2884112 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  191.567984] Aborting journal on device nvme0n1p1-8.
[  191.568026] nvme nvme0: unable to set dbbuf
[  191.570141] nvme nvme0: failed to mark controller live state
[  191.570198] I/O error, dev nvme0n1, sector 358320 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.570972] nvme nvme0: Removing after probe failure status: -19
[  191.572323] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 44790)
[  191.574436] Buffer I/O error on device nvme0n1p1, logical block 12022
[  191.575283] I/O error, dev nvme0n1, sector 358408 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.575302] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[  191.576423] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 44801)
[  191.579136] Buffer I/O error on device nvme0n1p1, logical block 12033
[  191.579981] I/O error, dev nvme0n1, sector 527624 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.581113] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 65953)
[  191.582495] Buffer I/O error on device nvme0n1p1, logical block 33185
[  191.583445] I/O error, dev nvme0n1, sector 355296 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.584784] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 44412)
[  191.586373] Buffer I/O error on device nvme0n1p1, logical block 11644
[  191.587339] I/O error, dev nvme0n1, sector 355584 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.588686] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 44448)

In my last test, the disk started OK for read, but writes would hang and then I started to get:

Last login: Thu Jun 18 18:17:43 UTC 2026 on ttyS0
root@localhost:~#
root@localhost:~# ls
mnt  test
root@localhost:~# [   37.163334] nvme nvme0: I/O 201 (I/O Cmd) QID 2 timeout, aborting
[   37.164890] nvme nvme0: Abort status: 0x0
[   37.215307] nvme nvme0: I/O 202 (I/O Cmd) QID 2 timeout, aborting
[   37.216783] nvme nvme0: Abort status: 0x0

root@localhost:~# ls
mnt  test
root@localhost:~# ls test
test
root@localhost:~# cat test
1
root@localhost:~# [   67.371330] nvme nvme0: I/O 201 QID 2 timeout, reset controller
[   67.416449] nvme nvme0: 2/0/0 default/read/poll queues

root@localhost:~# ls
mnt  test
root@localhost:~# cat ./test
1
root@localhost:~# [   99.371308] nvme nvme0: I/O 201 (I/O Cmd) QID 2 timeout, aborting
[   99.372610] nvme nvme0: Abort status: 0x0

root@localhost:~#
root@localhost:~# cat ./test [  130.091303] nvme nvme0: I/O 201 QID 2 timeout, reset controller
[  130.136615] nvme nvme0: 2/0/0 default/read/poll queues

1
root@localhost:~# echo 2 > ./test

[  160.811324] nvme nvme0: I/O 201 QID 2 timeout, disable controller
[  160.843610] I/O error, dev nvme0n1, sector 2884112 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  160.845169] I/O error, dev nvme0n1, sector 2884112 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  160.846325] Aborting journal on device nvme0n1p1-8.
[  160.846396] nvme nvme0: unable to set dbbuf
[  160.848017] nvme nvme0: failed to mark controller live state
[  160.848047] I/O error, dev nvme0n1, sector 532504 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  160.849102] nvme nvme0: Removing after probe failure status: -19
[  160.850177] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 94 starting block 66563)
[  160.850961] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm kworker/u4:1: Detected aborted journal
[  160.852306] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[  160.853598] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm kworker/u4:4: Detected aborted journal
[  160.854895] I/O error, dev nvme0n1, sector 2883584 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  160.857402] Buffer I/O error on dev nvme0n1p1, logical block 327680, lost sync page write
[  160.858386] JBD2: I/O error when updating journal superblock for nvme0n1p1-8.
[  160.859265] I/O error, dev nvme0n1, sector 262144 op 0x1:(WRITE) flags 0x23800 phys_seg 1 prio class 2
root@localhost:~#
[  160.861812] Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page write
root@localhost:~# [  160.863545] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  160.863557] EXT4-fs (nvme0n1p1): previous I/O error to superblock detected
[  160.864327] EXT4-fs (nvme0n1p1): Remounting filesystem read-only
[  160.866214] EXT4-fs (nvme0n1p1): ext4_writepages: jbd2_start: 1024 pages, ino 95; err -30
[  160.871292] nvme0n1: detected capacity change from 6291456 to 0
[  160.871327] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 95 starting block 67414)
[  160.872211] Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page write
[  160.873556] Buffer I/O error on device nvme0n1p1, logical block 34646
[  160.874499] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  160.875297] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 95 starting block 67369)
[  160.876050] EXT4-fs (nvme0n1p1): Remounting filesystem read-only

For reference, these are my test steps:

Start first propolis-server

pfexec propolis-server run ./OVMF_CODE.fd 0.0.0.0:3000 0.0.0.0:4000

Start second propolis-server (same host in my case)

pfexec propolis-server run ./OVMF_CODE.fd 0.0.0.0:3001 0.0.0.0:4001

Create a debian.toml file

[main]
name = "testvm-debian"
cpus = 2
memory = 4096
bootrom = "./OVMF_CODE.fd"

[block_dev.os]
type = "file"
path = "./debian-12-nocloud-amd64.raw"

[dev.block0]
driver = "pci-nvme"
block_dev = "os"
pci-path = "0.3.0"

Start the VM

propolis-cli --server 127.0.0.1 --port 3000 new -t ./debian.toml -c 2 -m 2048 test-vm
propolis-cli --server 127.0.0.1 --port 3000 state run
propolis-cli --server 127.0.0.1 --port 3000 serial

Trigger a migration

propolis-cli --server 127.0.0.1 --port 3000 migrate -p 3001 127.0.0.1

Attempt to read and write files in the guest

iximeow · 2026-06-19T02:07:25Z

after staring at this some more: it looks like there's a condition where we can import a VM, have a submission enqueued, resume devices, but incorrectly decide the queue is empty. the guest doesn't submit another I/O because the one it's waiting for is outstanding, so it discovers the disk is stuck. then all the normal "your disk stopped" happens. this happens with and without doorbell buffers, I've been looking with them disabled for simplicity.

as we saw earlier, migrating while the I/O is submitted but unprocessed does "unstick" it, but I don't quite follow how. where it's stuck, nvme_block_notify!() says queue 2 got a notify for one item, then block queue 0 (??? this should be 1!) polls, returns None, and that's it. then you migrate the VM, resume nudges all the queues, the I/O gets done, and we're back.

so... why does nvme_block_notify!() for queue 2 lead into a block_poll!() for I/O queue 0? it's clearly a fixed assignment strategy because when that worker goes idle it only checks the first queue, so it's in next_req() not next_req_any() where None would have it check other queues until there's a request. .. huh? ..

the comment included here gets at the problem, but I spotted this by attaching to every propolis probe and pouring over all the output, touching up each relevant probe a bit as-needed. the tell there is that when waking a worker collection, eventually if you have a device with two queues (such as in a VM with two vCPUs) you'll see one scroll by that has a notify_workers of 0x00ab, while the other queue has a notify_workers of 0x0055. this is clearly wrong: worker 0 has a fixed assignment, but both minders believe that worker 0 will handle their I/Os. the wrong queue wakes thread 0, which looks at its assigned queue and sees no I/O, idles, and we never discover I/O on the queue that we'd intended to wake a worker for. the problem here is that when a worker idles, it sets a bit in the minder's notify_workers, but then when the worker is reassigned we don't clear that bit from the notify_workers bitmask. we *should* just flush all the bits when reassigning queues, but there's not a handle to the queue collection to do that. so instead, when attempting to wake queues, accept we might see bits for workers we really shouldn't wake, and if that worker has a fixed assignment that is not our queue, do not count it towards waking the requested number of workers. this involves more locking in the wake path than i want to add, but is good enough to prove out that the thing works.

iximeow · 2026-06-24T01:30:11Z

67c2524 looks like it covers the issue from here. clippy hates it, i think the change itself is not the right way to spell it, but it describes the issue i see and a fix in the interim until i figure out plumbing in a QueueCollection or .... something?

lgfa29 · 2026-06-24T16:43:08Z

I ran a few tests using a similar setup as my previous test and the migrated disks seem to work fine now 🎉

iximeow · 2026-06-24T22:38:27Z

+                    continue;
+                }
+            };
+            }


so, at this point we'll drop slot.state.lock(). we could reconfigure the worker at this point, so that slot.wake() just below will wake the worker which then goes and checks a different fixed queue anyway. an example here is if a guest has gone and enabled (or disabled) an I/O queue and moved assignments around.

this is an improvement only as long as queues are not being reassigned concurrently with I/O, and if there is I/O then we have the same issue of queues poking workers that have no hope of serving them.

I might want to revert this and land 1156 as a marginal improvement for now, and have an issue about the shape of this problem and notify_workers in the face of device reconfiguration...

iximeow · 2026-06-24T22:39:03Z

+                    continue;
+                }
+            };
+            }


so, at this point we'll drop slot.state.lock(). we could reconfigure the worker at this point, so that slot.wake() just below will wake the worker which then goes and checks a different fixed queue anyway. an example here is if a guest has gone and enabled (or disabled) an I/O queue and moved assignments around.

this is an improvement only as long as queues are not being reassigned concurrently with I/O, and if there is I/O then we have the same issue of queues poking workers that have no hope of serving them.

I might want to revert this and land 1156 as a marginal improvement for now, and have an issue about the shape of this problem and notify_workers in the face of device reconfiguration...

hawkw · 2026-06-25T20:57:53Z


-    fn update_assignment(&self, assign: &Assignment) {
+    /// Reconfigure this worker for a polling strategy matching the current
+    /// assignment. Returns if the worker has actually been modified; `false` if


this is a super ultra nitpick, but perhaps

Suggested change

/// assignment. Returns if the worker has actually been modified; `false` if

/// assignment. Returns whether the worker has actually been modified; `false` if

since i feel "returns if" is misinterpretable as "does not return until"? even though it's probably obvious.

hawkw · 2026-06-25T20:59:18Z

+            if slot.update_assignment(&assign) {
+                any_updated = true;
+            }


could also be:

Suggested change

if slot.update_assignment(&assign) {

any_updated = true;

}

any_updated |= slot.update_assignment(&assign);

but it's up to you whether this is better or worse

hawkw · 2026-06-25T21:29:14Z

        let mut ctrl = self.state.lock().unwrap();
        ctrl.import(input, self)?;
+        // Now that the controller is imported, update our mirror of its
+        // enablement bit.


mmmmmmaybe worth explicitly calling out that this is inside the lock?

hawkw · 2026-06-25T21:33:48Z

+                    // the guest is *probably* an Alpine. `/` is a figment of
+                    // its imagination; a tmpfs with the actual disk somewhere
+                    // under /media, like /media/nvme0n1. try looking there, and
+                    // if that fails we'll need to improve the test, phd, or
+                    // both.


lmao this rocks alpine is such a distro

hawkw · 2026-06-25T21:35:02Z

+                // ignore the status from `grep` because it returns 1 when the
+                // regex matches no line. we don't want to `check_err()`
+                // because it's useful for debugging to see what *did* match.


isn't there a grep flag that makes it not do that, or did i make that up?

forget less NVMe state across migration

9b10a83

add an additional test for propolis-to-itself migration and run the smoke tests there. extend the smoke test to try making sure disks are at least kind of basically still functional.

iximeow added bug Something that isn't working. storage Related to storage devices/backends. migration Issues related to live migration. labels Jun 18, 2026

iximeow commented Jun 18, 2026

View reviewed changes

hawkw self-requested a review June 18, 2026 19:58

grep returns matchiness in status

64f769e

iximeow commented Jun 24, 2026

View reviewed changes

actually content with this worker reassignment/notify bits handling

72bcc9d

hawkw reviewed Jun 25, 2026

View reviewed changes

	/// assignment. Returns if the worker has actually been modified; `false` if
	/// assignment. Returns whether the worker has actually been modified; `false` if

Uh oh!

Conversation

iximeow commented Jun 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iximeow commented Jun 18, 2026

Uh oh!

lgfa29 commented Jun 18, 2026

Uh oh!

iximeow commented Jun 19, 2026

Uh oh!

iximeow commented Jun 24, 2026

Uh oh!

lgfa29 commented Jun 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants