-
Notifications
You must be signed in to change notification settings - Fork 39
forget less NVMe state across migration #1156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
9b10a83
67c2524
64f769e
72bcc9d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -232,6 +232,10 @@ impl QueueCollection { | |||||||||
| } | ||||||||||
| } | ||||||||||
|
|
||||||||||
| // Callers must hold the worker's lock while polling the next request for | ||||||||||
| // the selected device queue. This ensures that if the worker is to have its | ||||||||||
| // queue assignment changed, this is mutually exclusive with the worker | ||||||||||
| // indicating idleness to a QueueMinder. | ||||||||||
| fn next_req( | ||||||||||
| &self, | ||||||||||
| queue_select: QueueId, | ||||||||||
|
|
@@ -742,11 +746,15 @@ impl WorkerSlot { | |||||||||
| WaitForReq::new(self) | ||||||||||
| } | ||||||||||
|
|
||||||||||
| fn update_assignment(&self, assign: &Assignment) { | ||||||||||
| /// Reconfigure this worker for a polling strategy matching the current | ||||||||||
| /// assignment. Returns if the worker has actually been modified; `false` if | ||||||||||
| /// the current strategy is newer than provided (as in the case of racing | ||||||||||
| /// worker assignments). | ||||||||||
| fn update_assignment(&self, assign: &Assignment) -> bool { | ||||||||||
| let mut state = self.state.lock().unwrap(); | ||||||||||
| if state.assign_strat.newer_than(&assign.strategy) { | ||||||||||
| // We already have a newer assignment | ||||||||||
| return; | ||||||||||
| // We already have a newer assignment; nothing has been updated. | ||||||||||
| return false; | ||||||||||
| } | ||||||||||
| state.assign_strat = assign.strategy; | ||||||||||
| if assign.should_halt { | ||||||||||
|
|
@@ -759,7 +767,8 @@ impl WorkerSlot { | |||||||||
| PollAssignment::Idle | ||||||||||
| }; | ||||||||||
| } | ||||||||||
| self.wake(Some(state), None); | ||||||||||
|
|
||||||||||
| true | ||||||||||
| } | ||||||||||
|
|
||||||||||
| fn wake( | ||||||||||
|
|
@@ -983,8 +992,67 @@ impl WorkerCollection { | |||||||||
| let generation = assign.strategy.generation() as u64; | ||||||||||
| (devid.0, assign_name, generation) | ||||||||||
| }); | ||||||||||
|
|
||||||||||
| let mut any_updated = false; | ||||||||||
| for slot in self.workers.iter() { | ||||||||||
| if slot.update_assignment(&assign) { | ||||||||||
| any_updated = true; | ||||||||||
| } | ||||||||||
|
Comment on lines
+998
to
+1000
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could also be:
Suggested change
but it's up to you whether this is better or worse |
||||||||||
| } | ||||||||||
|
|
||||||||||
| if !any_updated { | ||||||||||
| return; | ||||||||||
| } | ||||||||||
|
|
||||||||||
| // At this point we have updated at least one worker. The worker may | ||||||||||
| // already have been active, its assignment may have been Fixed or | ||||||||||
| // FreeForAll. It may have become idle, setting `notify_workers` on some | ||||||||||
| // queue, and if the strategy was changed from Fixed(A) to Fixed(B) the | ||||||||||
| // queue with a notify bit for this worker may actually not be the queue | ||||||||||
| // this worker will try pulling I/Os from! In short, we've made a mess | ||||||||||
| // of things. Cleaning this up is not all bad, but subtle. | ||||||||||
| // | ||||||||||
| // First, we'll clear all notification bits. At this point, if any | ||||||||||
| // workers idle again, they will indicate idleness on the correct | ||||||||||
| // queues, at least. We may blow away some worker idleness bits that are | ||||||||||
| // correctly placed, along with any bits that might be incorrectly | ||||||||||
| // placed. | ||||||||||
| // | ||||||||||
| // Then, kick all the workers. We probably don't actually have I/O for | ||||||||||
| // all of them. But, when workers idle this time, they'll indicate | ||||||||||
| // *correct* idle bits and notifications will engage the correct | ||||||||||
| // workers. | ||||||||||
| // | ||||||||||
| // The strangest case here is probably if a queue is moved from Fixed to | ||||||||||
| // FreeForAll; the worker should have set a bit for its ID in all | ||||||||||
| // minders. This procedure will result in bits for all idle queues as | ||||||||||
| // appropriate once the worker is kicked and deposits idleness across | ||||||||||
| // the minders. | ||||||||||
| for slot in self.workers.iter() { | ||||||||||
| // The mechanics of this are a little paranoid: only worker slots | ||||||||||
| // themselves have a queue collection, and it *should* be the case | ||||||||||
| // that all queue collections are the same. But even if it wasn't, | ||||||||||
| // we need to walk all the queues that all the reassigned workers | ||||||||||
| // could access to clear all the notify_workers bits. | ||||||||||
| // | ||||||||||
| // Currently this means we walk all the queue minders for the device | ||||||||||
| // NUM_WORKERS-many times. Queue reassignment is already terribly | ||||||||||
| // expensive what with having to kick all the workers. Please do not | ||||||||||
| // reconfigure device queues while sensitive to I/O latency or | ||||||||||
| // throughput.. | ||||||||||
| let state = slot.state.lock().unwrap(); | ||||||||||
| if let Some(qcol) = state.queues.as_ref() { | ||||||||||
| for queue in qcol.queues.iter() { | ||||||||||
| let qs = queue.state.lock().unwrap(); | ||||||||||
| if let Some(minder) = qs.minder.as_ref() { | ||||||||||
| let _ = minder.take_notifications(); | ||||||||||
| } | ||||||||||
| } | ||||||||||
| } | ||||||||||
| } | ||||||||||
|
|
||||||||||
| for slot in self.workers.iter() { | ||||||||||
| slot.update_assignment(&assign); | ||||||||||
| slot.wake(None, None); | ||||||||||
| } | ||||||||||
| } | ||||||||||
| fn slot(&self, id: WorkerId) -> &WorkerSlot { | ||||||||||
|
|
||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1418,6 +1418,14 @@ impl MigrateMulti for PciNvme { | |
| output.push(ctrl.export().into())?; | ||
| drop(ctrl); | ||
|
|
||
| // We leave `is_enabled` and `device_id` behind here: | ||
| // * is_enabled is just a mirror of the controller's enabled bit. We'll | ||
| // recover it when importing controller state on the other side. | ||
| // * `device_id` is, like other `define_id` items, tied to non-migrated | ||
| // statics in the Propolis process. Migrating it is, absent other | ||
| // work, a bug waiting to happen (imagine ID gaps where devices with | ||
| // ID 0, 1, and 4 are migrated) | ||
|
|
||
| MigrateMulti::export(&self.pci_state, output, ctx)?; | ||
|
|
||
| Ok(()) | ||
|
|
@@ -1432,6 +1440,9 @@ impl MigrateMulti for PciNvme { | |
|
|
||
| let mut ctrl = self.state.lock().unwrap(); | ||
| ctrl.import(input, self)?; | ||
| // Now that the controller is imported, update our mirror of its | ||
| // enablement bit. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. mmmmmmaybe worth explicitly calling out that this is inside the lock? |
||
| self.is_enabled.store(ctrl.ctrl.cc.enabled(), Ordering::Release); | ||
| drop(ctrl); | ||
|
|
||
| MigrateMulti::import(&self.pci_state, offer, ctx)?; | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -318,14 +318,22 @@ async fn multiple_migrations(ctx: &TestCtx) { | |
| ); | ||
| } | ||
|
|
||
| #[phd_testcase] | ||
| async fn migration_smoke_test(ctx: &TestCtx) { | ||
| let vm = ctx.spawn_default_vm("migration_smoke_test_0").await?; | ||
| run_smoke_test(ctx, vm).await?; | ||
| } | ||
|
|
||
| async fn run_smoke_test(ctx: &TestCtx, mut source: TestVm) -> Result<()> { | ||
| source.launch().await?; | ||
| source.wait_to_boot().await?; | ||
| let lsout = | ||
| source.run_shell_command("ls foo.bar 2> /dev/null").check_err().await?; | ||
| assert_eq!(lsout, ""); | ||
|
|
||
| // create an empty file on the source VM. | ||
| // create an empty file on the source VM. `/` might be a tmpfs, and the | ||
| // guest may have no writable filesystem even. validation post-migration is | ||
| // a bit tricky, below. | ||
| source.run_shell_command("touch ./foo.bar").await?; | ||
| source.run_shell_command("sync ./foo.bar").await?; | ||
|
|
||
|
|
@@ -334,13 +342,77 @@ async fn run_smoke_test(ctx: &TestCtx, mut source: TestVm) -> Result<()> { | |
| &[Action::MigrateToPropolis(artifacts::DEFAULT_PROPOLIS_ARTIFACT)], | ||
| |target: &TestVm| { | ||
| Box::pin(async { | ||
| // forget about any pagecache data, dentries, inodes; we want to | ||
| // read these from the real backing storage, because we want to | ||
| // make sure it still works! | ||
| target | ||
| .run_shell_command("echo 3 > /proc/sys/vm/drop_caches") | ||
| .await | ||
| .expect("can drop caches"); | ||
|
|
||
| // the file should still exist on the target VM after migration. | ||
| let lsout = target | ||
| .run_shell_command("ls foo.bar") | ||
| .ignore_status() | ||
| .await | ||
| .expect("can try to run `ls foo.bar`"); | ||
| assert_eq!(lsout, "foo.bar"); | ||
|
|
||
| // we have a small conundrum: it's possible that the home | ||
| // directory we wrote into is a tmpfs anyway. so by succeeding | ||
| // this much, we may have really just tested memory is intact. | ||
|
Comment on lines
+361
to
+363
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hawkw i would like for you to look at this change just because i can imagine the eye roll and swear when you read this comment thanks in advance. it's so much more annoying than i thought it was.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🙄🤬 |
||
| // | ||
| // ensure we test *some* device emulation by at least trying to | ||
| // `ls` the thing we booted out. it *should* be a disk, so we'll | ||
| // exercise some device emulation machinery.. unfortunately | ||
| // finding a productive place to `ls` is a bit ridiculous: | ||
|
|
||
| let mount_grep = target | ||
| .run_shell_command("mount | grep ' / type tmpfs '") | ||
| .ignore_status() | ||
| .await | ||
| .expect("can check mounted filesystems and grep"); | ||
|
|
||
| let root_is_tmpfs = !mount_grep.is_empty(); | ||
|
|
||
| if root_is_tmpfs { | ||
| // the guest is *probably* an Alpine. `/` is a figment of | ||
| // its imagination; a tmpfs with the actual disk somewhere | ||
| // under /media, like /media/nvme0n1. try looking there, and | ||
| // if that fails we'll need to improve the test, phd, or | ||
| // both. | ||
|
Comment on lines
+379
to
+383
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lmao this rocks alpine is such a distro |
||
| target | ||
| .run_shell_command("ls /media/nvme0n1/boot") | ||
| .await | ||
| .expect("can ls the boot data"); | ||
| } else { | ||
| // the guest probably has a read-write root. the above, it | ||
| // turns out, probably also exercised the disk. check /boot | ||
| // anyway for good measure. | ||
| target | ||
| .run_shell_command("ls /boot") | ||
| .await | ||
| .expect("can ls the boot data"); | ||
| } | ||
|
|
||
| // the `ls` succeeded ... eventually. it's possible the disk | ||
| // *was* broken, the guest realized this, reset the disk, and | ||
| // that brought it back. if so, there's a line about resetting | ||
| // the controller in dmesg. so if we see that, we've actually | ||
| // failed to migrate reasonably. | ||
| // | ||
| // ignore the status from `grep` because it returns 1 when the | ||
| // regex matches no line. we don't want to `check_err()` | ||
| // because it's useful for debugging to see what *did* match. | ||
|
Comment on lines
+404
to
+406
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. isn't there a grep flag that makes it not do that, or did i make that up? |
||
| let disk_reset = target | ||
| .run_shell_command("dmesg | grep 'reset controller'") | ||
| .ignore_status() | ||
| .await | ||
| .expect("can grep dmesg"); | ||
|
|
||
| if !disk_reset.is_empty() { | ||
| panic!("controller reset during the test?! {}", disk_reset); | ||
| } | ||
| }) | ||
| }, | ||
| ) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a super ultra nitpick, but perhaps
since i feel "returns if" is misinterpretable as "does not return until"? even though it's probably obvious.