Skip to content

forget less NVMe state across migration#1156

Open
iximeow wants to merge 4 commits into
masterfrom
ixi/nvme-migration
Open

forget less NVMe state across migration#1156
iximeow wants to merge 4 commits into
masterfrom
ixi/nvme-migration

Conversation

@iximeow

@iximeow iximeow commented Jun 18, 2026

Copy link
Copy Markdown
Member

add an additional test for propolis-to-itself migration and run the smoke tests there. extend the smoke test to try making sure disks are at least kind of basically still functional.

this should fix #981, but I have to admit I couldn't really reproduce the second failure mode that I alluded to there or @lgfa29 saw last week. I think that returning incorrect submission queue heads could have an outcome where the guest OS thinks it's submitting I/Os in a way we wouldn't get to, but beyond the status quo obviously being wrong, I'm not sure I've actually tickled that condition much at all.

I mostly really want to extend #966 to actually drive I/Os now, because either of the improvements here should fail quickly through a fuzzer...

add an additional test for propolis-to-itself migration and run the
smoke tests there. extend the smoke test to try making sure disks are
at least kind of basically still functional.
@iximeow iximeow added bug Something that isn't working. storage Related to storage devices/backends. migration Issues related to live migration. labels Jun 18, 2026
Comment on lines +361 to +363
// we have a small conundrum: it's possible that the home
// directory we wrote into is a tmpfs anyway. so by succeeding
// this much, we may have really just tested memory is intact.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hawkw i would like for you to look at this change just because i can imagine the eye roll and swear when you read this comment thanks in advance. it's so much more annoying than i thought it was.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙄🤬

@iximeow

iximeow commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

hm. the failures are fascinating!

phd_tests::migrate::smoke_test yields:

localhost:~# dmesg | grep 'reset controller'
[   68.965446] nvme nvme0: I/O 82 QID 2 timeout, reset controller
localhost:~# echo $?
0

so.. the test detected a problem! also what in the world.

migration_smoke_test on the other hand:

localhost:~# ls /media/nvme0n1/boot
System.map-virt  config-virt      dtbs-virt        grub             initramfs-virt   modloop-virt     syslinux         vmlinuz-virt
localhost:~# echo $?
0
localhost:~# dmesg | grep 'reset controller'
localhost:~# echo $?
1

... I presumably should be ignoring the exit code of the grep, oops.

@lgfa29

lgfa29 commented Jun 18, 2026

Copy link
Copy Markdown
Member

I ran a quick test using the same setup I did previously and things have certainly improved, the migration actually finished now 😄

But I still ran into issues after the migration, but each time it was a different.

The first time I ran a migration the disk seemed to be "stuck" for about 2min, but then I saw these kernel messages in the guest and the disk seemed to be fine:

[   89.895139] nvme nvme0: I/O 900 (I/O Cmd) QID 2 timeout, aborting
[   89.896745] nvme nvme0: Abort status: 0x0
[  120.103133] nvme nvme0: I/O 900 QID 2 timeout, reset controller
[  120.135440] nvme0n1: I/O Cmd(0x2) @ LBA 1336320, 32 blocks, I/O Error (sct 0x3 / sc 0x71)
[  120.137336] I/O error, dev nvme0n1, sector 1336320 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
[  120.151848] nvme nvme0: 2/0/0 default/read/poll queues

After the second test the disk seemed to be fine, but then I left the guest running while I was in a meeting and when I came it was unresponsive and I had several messages like these:

[   99.371306] nvme nvme0: I/O 263 QID 2 timeout, reset controller
[   99.418681] nvme nvme0: 2/0/0 default/read/poll queues
[  130.091326] nvme nvme0: I/O 263 (I/O Cmd) QID 2 timeout, aborting
[  130.092870] nvme nvme0: Abort status: 0x0
[  160.811324] nvme nvme0: I/O 263 QID 2 timeout, reset controller
[  160.858176] nvme nvme0: 2/0/0 default/read/poll queues
[  191.531317] nvme nvme0: I/O 263 QID 2 timeout, disable controller
[  191.563602] I/O error, dev nvme0n1, sector 2884112 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  191.565764] I/O error, dev nvme0n1, sector 2884112 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  191.567984] Aborting journal on device nvme0n1p1-8.
[  191.568026] nvme nvme0: unable to set dbbuf
[  191.570141] nvme nvme0: failed to mark controller live state
[  191.570198] I/O error, dev nvme0n1, sector 358320 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.570972] nvme nvme0: Removing after probe failure status: -19
[  191.572323] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 44790)
[  191.574436] Buffer I/O error on device nvme0n1p1, logical block 12022
[  191.575283] I/O error, dev nvme0n1, sector 358408 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.575302] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[  191.576423] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 44801)
[  191.579136] Buffer I/O error on device nvme0n1p1, logical block 12033
[  191.579981] I/O error, dev nvme0n1, sector 527624 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.581113] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 65953)
[  191.582495] Buffer I/O error on device nvme0n1p1, logical block 33185
[  191.583445] I/O error, dev nvme0n1, sector 355296 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.584784] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 44412)
[  191.586373] Buffer I/O error on device nvme0n1p1, logical block 11644
[  191.587339] I/O error, dev nvme0n1, sector 355584 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  191.588686] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 110 starting block 44448)

In my last test, the disk started OK for read, but writes would hang and then I started to get:

Last login: Thu Jun 18 18:17:43 UTC 2026 on ttyS0
root@localhost:~#
root@localhost:~# ls
mnt  test
root@localhost:~# [   37.163334] nvme nvme0: I/O 201 (I/O Cmd) QID 2 timeout, aborting
[   37.164890] nvme nvme0: Abort status: 0x0
[   37.215307] nvme nvme0: I/O 202 (I/O Cmd) QID 2 timeout, aborting
[   37.216783] nvme nvme0: Abort status: 0x0

root@localhost:~# ls
mnt  test
root@localhost:~# ls test
test
root@localhost:~# cat test
1
root@localhost:~# [   67.371330] nvme nvme0: I/O 201 QID 2 timeout, reset controller
[   67.416449] nvme nvme0: 2/0/0 default/read/poll queues

root@localhost:~# ls
mnt  test
root@localhost:~# cat ./test
1
root@localhost:~# [   99.371308] nvme nvme0: I/O 201 (I/O Cmd) QID 2 timeout, aborting
[   99.372610] nvme nvme0: Abort status: 0x0

root@localhost:~#
root@localhost:~# cat ./test [  130.091303] nvme nvme0: I/O 201 QID 2 timeout, reset controller
[  130.136615] nvme nvme0: 2/0/0 default/read/poll queues

1
root@localhost:~# echo 2 > ./test

[  160.811324] nvme nvme0: I/O 201 QID 2 timeout, disable controller
[  160.843610] I/O error, dev nvme0n1, sector 2884112 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  160.845169] I/O error, dev nvme0n1, sector 2884112 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  160.846325] Aborting journal on device nvme0n1p1-8.
[  160.846396] nvme nvme0: unable to set dbbuf
[  160.848017] nvme nvme0: failed to mark controller live state
[  160.848047] I/O error, dev nvme0n1, sector 532504 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  160.849102] nvme nvme0: Removing after probe failure status: -19
[  160.850177] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 94 starting block 66563)
[  160.850961] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm kworker/u4:1: Detected aborted journal
[  160.852306] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[  160.853598] EXT4-fs error (device nvme0n1p1): ext4_journal_check_start:83: comm kworker/u4:4: Detected aborted journal
[  160.854895] I/O error, dev nvme0n1, sector 2883584 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2
[  160.857402] Buffer I/O error on dev nvme0n1p1, logical block 327680, lost sync page write
[  160.858386] JBD2: I/O error when updating journal superblock for nvme0n1p1-8.
[  160.859265] I/O error, dev nvme0n1, sector 262144 op 0x1:(WRITE) flags 0x23800 phys_seg 1 prio class 2
root@localhost:~#
[  160.861812] Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page write
root@localhost:~# [  160.863545] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  160.863557] EXT4-fs (nvme0n1p1): previous I/O error to superblock detected
[  160.864327] EXT4-fs (nvme0n1p1): Remounting filesystem read-only
[  160.866214] EXT4-fs (nvme0n1p1): ext4_writepages: jbd2_start: 1024 pages, ino 95; err -30
[  160.871292] nvme0n1: detected capacity change from 6291456 to 0
[  160.871327] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 95 starting block 67414)
[  160.872211] Buffer I/O error on dev nvme0n1p1, logical block 0, lost sync page write
[  160.873556] Buffer I/O error on device nvme0n1p1, logical block 34646
[  160.874499] EXT4-fs (nvme0n1p1): I/O error while writing superblock
[  160.875297] EXT4-fs warning (device nvme0n1p1): ext4_end_bio:343: I/O error 10 writing to inode 95 starting block 67369)
[  160.876050] EXT4-fs (nvme0n1p1): Remounting filesystem read-only

For reference, these are my test steps:

  1. Start first propolis-server
    pfexec propolis-server run ./OVMF_CODE.fd 0.0.0.0:3000 0.0.0.0:4000
    
  2. Start second propolis-server (same host in my case)
    pfexec propolis-server run ./OVMF_CODE.fd 0.0.0.0:3001 0.0.0.0:4001
    
  3. Create a debian.toml file
    [main]
    name = "testvm-debian"
    cpus = 2
    memory = 4096
    bootrom = "./OVMF_CODE.fd"
    
    [block_dev.os]
    type = "file"
    path = "./debian-12-nocloud-amd64.raw"
    
    [dev.block0]
    driver = "pci-nvme"
    block_dev = "os"
    pci-path = "0.3.0"
  4. Start the VM
    propolis-cli --server 127.0.0.1 --port 3000 new -t ./debian.toml -c 2 -m 2048 test-vm
    propolis-cli --server 127.0.0.1 --port 3000 state run
    propolis-cli --server 127.0.0.1 --port 3000 serial
    
  5. Trigger a migration
    propolis-cli --server 127.0.0.1 --port 3000 migrate -p 3001 127.0.0.1
    
  6. Attempt to read and write files in the guest

@hawkw hawkw self-requested a review June 18, 2026 19:58
@iximeow

iximeow commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

after staring at this some more: it looks like there's a condition where we can import a VM, have a submission enqueued, resume devices, but incorrectly decide the queue is empty. the guest doesn't submit another I/O because the one it's waiting for is outstanding, so it discovers the disk is stuck. then all the normal "your disk stopped" happens. this happens with and without doorbell buffers, I've been looking with them disabled for simplicity.

as we saw earlier, migrating while the I/O is submitted but unprocessed does "unstick" it, but I don't quite follow how. where it's stuck, nvme_block_notify!() says queue 2 got a notify for one item, then block queue 0 (??? this should be 1!) polls, returns None, and that's it. then you migrate the VM, resume nudges all the queues, the I/O gets done, and we're back.

so... why does nvme_block_notify!() for queue 2 lead into a block_poll!() for I/O queue 0? it's clearly a fixed assignment strategy because when that worker goes idle it only checks the first queue, so it's in next_req() not next_req_any() where None would have it check other queues until there's a request. .. huh? ..

the comment included here gets at the problem, but I spotted this by
attaching to every propolis probe and pouring over all the output,
touching up each relevant probe a bit as-needed. the tell there is that
when waking a worker collection, eventually if you have a device with
two queues (such as in a VM with two vCPUs) you'll see one scroll by
that has a notify_workers of 0x00ab, while the other queue has a
notify_workers of 0x0055. this is clearly wrong: worker 0 has a fixed
assignment, but both minders believe that worker 0 will handle their
I/Os. the wrong queue wakes thread 0, which looks at its assigned queue
and sees no I/O, idles, and we never discover I/O on the queue that we'd
intended to wake a worker for.

the problem here is that when a worker idles, it sets a bit in the
minder's notify_workers, but then when the worker is reassigned we don't
clear that bit from the notify_workers bitmask. we *should* just flush
all the bits when reassigning queues, but there's not a handle to the
queue collection to do that. so instead, when attempting to wake queues,
accept we might see bits for workers we really shouldn't wake, and if
that worker has a fixed assignment that is not our queue, do not count
it towards waking the requested number of workers.

this involves more locking in the wake path than i want to add, but is
good enough to prove out that the thing works.
@iximeow

iximeow commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

67c2524 looks like it covers the issue from here. clippy hates it, i think the change itself is not the right way to spell it, but it describes the issue i see and a fix in the interim until i figure out plumbing in a QueueCollection or .... something?

@lgfa29

lgfa29 commented Jun 24, 2026

Copy link
Copy Markdown
Member

I ran a few tests using a similar setup as my previous test and the migrated disks seem to work fine now 🎉

Comment thread lib/propolis/src/block/attachment.rs Outdated
continue;
}
};
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, at this point we'll drop slot.state.lock(). we could reconfigure the worker at this point, so that slot.wake() just below will wake the worker which then goes and checks a different fixed queue anyway. an example here is if a guest has gone and enabled (or disabled) an I/O queue and moved assignments around.

this is an improvement only as long as queues are not being reassigned concurrently with I/O, and if there is I/O then we have the same issue of queues poking workers that have no hope of serving them.

I might want to revert this and land 1156 as a marginal improvement for now, and have an issue about the shape of this problem and notify_workers in the face of device reconfiguration...

Comment thread lib/propolis/src/block/attachment.rs Outdated
continue;
}
};
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, at this point we'll drop slot.state.lock(). we could reconfigure the worker at this point, so that slot.wake() just below will wake the worker which then goes and checks a different fixed queue anyway. an example here is if a guest has gone and enabled (or disabled) an I/O queue and moved assignments around.

this is an improvement only as long as queues are not being reassigned concurrently with I/O, and if there is I/O then we have the same issue of queues poking workers that have no hope of serving them.

I might want to revert this and land 1156 as a marginal improvement for now, and have an issue about the shape of this problem and notify_workers in the face of device reconfiguration...


fn update_assignment(&self, assign: &Assignment) {
/// Reconfigure this worker for a polling strategy matching the current
/// assignment. Returns if the worker has actually been modified; `false` if

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a super ultra nitpick, but perhaps

Suggested change
/// assignment. Returns if the worker has actually been modified; `false` if
/// assignment. Returns whether the worker has actually been modified; `false` if

since i feel "returns if" is misinterpretable as "does not return until"? even though it's probably obvious.

Comment on lines +998 to +1000
if slot.update_assignment(&assign) {
any_updated = true;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also be:

Suggested change
if slot.update_assignment(&assign) {
any_updated = true;
}
any_updated |= slot.update_assignment(&assign);

but it's up to you whether this is better or worse

let mut ctrl = self.state.lock().unwrap();
ctrl.import(input, self)?;
// Now that the controller is imported, update our mirror of its
// enablement bit.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmmmmaybe worth explicitly calling out that this is inside the lock?

Comment on lines +379 to +383
// the guest is *probably* an Alpine. `/` is a figment of
// its imagination; a tmpfs with the actual disk somewhere
// under /media, like /media/nvme0n1. try looking there, and
// if that fails we'll need to improve the test, phd, or
// both.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmao this rocks alpine is such a distro

Comment on lines +404 to +406
// ignore the status from `grep` because it returns 1 when the
// regex matches no line. we don't want to `check_err()`
// because it's useful for debugging to see what *did* match.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a grep flag that makes it not do that, or did i make that up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something that isn't working. migration Issues related to live migration. storage Related to storage devices/backends.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

NVMe devices don't tolerate migration

3 participants