Skip to content

support NVMe Deallocate#1105

Open
ahrens wants to merge 3 commits intomasterfrom
mahrens/trim
Open

support NVMe Deallocate#1105
ahrens wants to merge 3 commits intomasterfrom
mahrens/trim

Conversation

@ahrens
Copy link
Copy Markdown

@ahrens ahrens commented Apr 8, 2026

This implements #990. It makes propolis advertise support for the "Dataset Management" NVMe command. It uses ioctl(DKIOCFREE) to pass these requests through to local disks (FileBackend). The requests are ignored on distributed disks (Crucible).

Note that our devices are NVME 1.0e, which specifies that the disk "may" deallocate all provided ranges, and "shall return all zeros, all ones, or the last data written to the associated LBA". The 1.0e spec has no mechanism for telling the guest which of these semantics is actually happening. Future work may include migrating to NVME 1.1 or later, which can use the DRB (Deallocation Read Behavior) field to tell the guest whether the blocks are actually zeroed or not.

This requires the changes for https://github.com/oxidecomputer/stlouis/issues/940 which are under review here.

A few details to be aware of or provide input on:

  1. I am always advertising support for the Dataset Management command (with a bit in the oncs field). I think this is OK since there is no live migration currently (even if the VM has no local disks). So each time a VM boots, it will be on a specific version of Propolis which either advertises Dataset Management, or not.
  2. block::Operation::Discard now means to discard multiple ranges from the client-provided list, not just one range. probes::block_begin_discard used to take an offset and length, but I have changed it to take the number of ranges instead. Is this OK? Are there consumers that need to change? We could fire a probe for each range, but then there would be multiple “begin” probes for one devqid, which could be confusing because begin/complete probes would not match up. Similar for probes::nvme_discard_enqueue.
  3. In VirtualDiskStats, should we add stats for Discard? How do we monitor these? Would we need to add support in consumers?

@ahrens ahrens requested review from iximeow and rmustacc April 8, 2026 17:34
@ahrens
Copy link
Copy Markdown
Author

ahrens commented Apr 8, 2026

testing: installed the ZFS changes and this Propolis on mb-1. In the linux guest VM, created a zpool on a local disk and used zpool trim and zpool set auotrim=on to cause it to issue Deallocate commands. I used dtrace to observe that vdev_disk_issue_trim() is called on the underlying physical disk.

Copy link
Copy Markdown
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this will be nice to have, thanks for working on this. hopefully none of the comments are too surprising!

Comment thread lib/propolis/src/hw/nvme/cmds.rs Outdated
Comment thread lib/propolis/src/hw/nvme/cmds.rs Outdated
Comment thread lib/propolis/src/hw/nvme/cmds.rs Outdated
Comment thread lib/propolis/src/hw/nvme/cmds.rs Outdated
Comment thread lib/propolis/src/hw/nvme/cmds.rs Outdated
Comment thread lib/propolis/src/hw/nvme/requests.rs Outdated
Comment thread lib/propolis/src/hw/nvme/requests.rs Outdated
Comment thread lib/propolis/src/block/file.rs
Comment thread bin/propolis-server/src/lib/stats/virtual_disk.rs Outdated
Comment thread lib/propolis/src/block/crucible.rs Outdated
@ahrens ahrens force-pushed the mahrens/trim branch 2 times, most recently from e370d49 to eed15c6 Compare April 13, 2026 19:47
Copy link
Copy Markdown
Member

@iximeow iximeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I'm a solid +1 on this now, thanks! if you're going to add a test about DatasetManagementCmd with bogus prp1/prp2 then I'll be happy to take a look at that here too. if not (or you have surprise issues), this is a nice improvement as-is.

@ahrens ahrens force-pushed the mahrens/trim branch 2 times, most recently from 5fdea7d to 05eafbf Compare April 15, 2026 17:31
Copy link
Copy Markdown

@rmustacc rmustacc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work on this.

Comment on lines +206 to +207
(self.starting_lba * lba_data_size) as ByteOffset,
(self.number_logical_blocks as u64 * lba_data_size) as ByteLen,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment about why it's okay to not check for overflow on user controlled data here and simply cast with as? I would have probably expected this to result in a failure at some point. Maybe this would fail with Invalid Field in Command or maybe NVMe controllers are defined to do wrap around at disk size or something similar?

If we think it should fail, we should probably check what other NVMe devices do.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine we'd see LBA Out of Range from an actual device? just from the NVMe spec that seems like the most relevant error for the offset/length being out of range. rather than the OS seeing the invalid range, returning an error, and that eventually being a Data Transfer Error? but I'm not sure what other devices do either.

/// Returns an Iterator that yields [`GuestRegion`]'s which contain the array of LBA ranges.
pub fn data<'a>(&self, mem: &'a MemCtx) -> PrpIter<'a> {
PrpIter::new(
u64::from(self.nr)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth a comment that the truncating as here is safe because self.nr is required to be a u8 and a u8 * 16 will generally not overflow a u64. It may be clearer to users if we don't have the ::from, though that may be required.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what's happening here is:

  1. self.nr (type u16, value 256 or less) is converted to a u64 with u64::from(), which is lossless
  2. size_of::<DatasetManagementRangeDefinition>() (type usize, value 16) is converted to a u64 with as u64. In the most general case, a usize could be larger than u64; is that the case in any compilation environment we care about?
  3. the two above u64's are multiplied together and passed to PrpIter::new()

The multiplication of two u64's can in general overflow, but in this case the maximum value is 256*16=4,096 which doesn't overflow. I'll add a comment explaining that.

Side note: In a project I worked on previously, we made a helper function that would convert a usize to a u64, with a compile time check that this is lossless. (We also used the cast_possible_truncation lint to disallow using as this way.) Is there something similar in Propolis, or would it be worth adding that?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_of::() (type usize, value 16) is converted to a u64 with as u64. In the most general case, a usize could be larger than u64; is that the case in any compilation environment we care about?

No, I don't think so. I don't see us ever being on a non-LP64 platform. Given this basically relies on bhyve and other practical archs (ARMv8+, RV64) are going to basically have a usize and u64 as the same in the ABI. I think @iximeow had something that basically forced that to be the case.

Side note: In a project I worked on previously, we made a helper function that would convert a usize to a u64, with a compile time check that this is lossless. (We also used the cast_possible_truncation lint to disallow using as this way.) Is there something similar in Propolis, or would it be worth adding that?

Yeah, probably? I think @iximeow has something similar here that'd probably be worth considering.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'd mentioned to Robert elsewhere that I kind of want to pull these helpers I'd written elsewhere into Propolis since we know we're only going to care about building this on 64-bit targets. then we can go shoo out all the as u64/as usize where it's really not as surprising to the reader as it could seem. I've started writing up a patch that moves a bunch of the storage-related stuff from usize to u64 first though (it will conflict with these changes. that's fine. you should go first :) )

for here a comment would be great, I'll have a separate patch for those in a bit (and I can tag you for review if you want @ahrens 😁 )

{
defs.into_iter().map(Ok).collect::<Vec<_>>().into_iter()
} else {
vec![Err("Failed to read LBA range")].into_iter()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding, how do these and backend errors get translated into NVMe errors?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they end up being slightly different, since the backend has the more generic block types in case the storage medium is behind say a virtio-block device.

for the NVMe command here, the caller collects these iterators into a Result<Vec<DatasetManagementRangeDefinition>, _> when translating to a Request in hw/nvme/requests.rs. that iteration stops at the first returned Err, returning that for the whole try_collect(). so an invalid range here turns into the NVMe Completion::generic_err(bits::STS_DATA_XFER_ERR) when the caller handles the error.

backend errors are more distant and just an opaque Success or Failure which end up at DeviceRequest::complete -> QueueMinder::complete -> self.complete_req_fn there -> nvme's DeviceQueue::complete. at that last step we go through impl From<block::Result> for Completion to get an NVMe completion to actually write back to the guest.

regions: Vec<GuestRegion>,
) -> Self {
Self { op: Operation::Read(off, len), regions }
Self { op: Operation::Read(off, len), regions, ranges: Vec::new() }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to be adding a new allocation to every read/write/flush here? Seems like this is a better use for an Option maybe to indicate that there's nothing there. Maybe not really appropriate for this change. @iximeow thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a Vec::new() (and Vec::with_capacity(0), iirc) won't allocate on their own, so yeah this is fine IMO. we're growing the size of struct Request by three usize to hold the Vec itself, but that's all in the typical read/write/flush case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, an empty Vec will only allocate if it's pushed to.

I do somewhat wonder if we ought to be replacing the vecs of ranges with a bitmap or some such but that seems out of scope.

if let Some(mech) = self.discard_mech {
dkioc::do_discard(&self.fp, mech, off as u64, len as u64)
for &(off, len) in &req.ranges {
dkioc::do_discard(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a comment about why we're not using the ioctl's ability to handle multiple ranges in a single go? I'm guessing because ZFS is going to chunk it up, but if this is a bunch of basically 4k ranges from an FS, that means a lot of ioctls when the drive would prefer to get more commands.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's true that this will result in more ioctls, but zfs will split it into one operation per range. I think we could/should pass all the ranges in a single ioctl if that would substantially improve performance. I can measure that if you'd like, is there any particular use case that I should look at?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ZFS is going to split it into one underlying disk operation per range then I'm not sure there's much value in doing something different here. A comment to future us is probably worthwhile. I don't think we have enough to know about the different general guest FS patterns to say we'd be getting enough deallocates in the same request to an underlying device to allow it to better optimize internal FTL rewrites.

block::Operation::Discard => {
if let Some(mech) = self.discard_mech {
dkioc::do_discard(&self.fp, mech, off as u64, len as u64)
for &(off, len) in &req.ranges {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is checking the offset/length for overflow and that it's actually within the virtual device? Is that being left to the kernel?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off and len are type usize; can as u64 truncate it on any systems we run on?

As for checking that it's actually within the virtual device, zfs will check and ignore any (part of the) range that's outside the zvol.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off and len don't inherently. However, off + len certainly can overflow and represent things outside of the device. I don't think we should silently ignore invalid block ranges unless we have evidence that normal devices do. I'm surprised that ZFS would just ignore that and not error on that to be honest.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be a bit more specific here, I think if someone gives us an invalid range that should cause an error and we shouldn't just acknowledge it. I'm not sure how hardware handles it and what the expected semantics should be, but this should maybe be all validated up front so we can distinguish it from a generic errno value. Overloaded non-semantic errnos for these aren't great, but that's long been the pattern of the dkio interfaces.

Comment on lines +80 to +82
Operation::Discard => {
// Discard is only partially wired up in backends we care about, and it's not
// clear what stats (if any) to report yet.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are real now, I think we should probably consider adding the following to match https://github.com/oxidecomputer/omicron/blob/main/oximeter/oximeter/schema/virtual-disk.toml:

  • discards
  • failed-discards
  • add this to io-latency and io-size and make sure this is a new io-kind of operation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the schema lives in Omicron, I think we'd want to land this, update the schema in Omicron, update the Omicron ref here in Propolis, then add the metrics. sound right to you too?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iximeow That's right! LMK if you need any help getting that sorted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants