Conversation
|
testing: installed the ZFS changes and this Propolis on mb-1. In the linux guest VM, created a zpool on a local disk and used |
iximeow
left a comment
There was a problem hiding this comment.
nice, this will be nice to have, thanks for working on this. hopefully none of the comments are too surprising!
e370d49 to
eed15c6
Compare
iximeow
left a comment
There was a problem hiding this comment.
FWIW I'm a solid +1 on this now, thanks! if you're going to add a test about DatasetManagementCmd with bogus prp1/prp2 then I'll be happy to take a look at that here too. if not (or you have surprise issues), this is a nice improvement as-is.
5fdea7d to
05eafbf
Compare
| (self.starting_lba * lba_data_size) as ByteOffset, | ||
| (self.number_logical_blocks as u64 * lba_data_size) as ByteLen, |
There was a problem hiding this comment.
Can we add a comment about why it's okay to not check for overflow on user controlled data here and simply cast with as? I would have probably expected this to result in a failure at some point. Maybe this would fail with Invalid Field in Command or maybe NVMe controllers are defined to do wrap around at disk size or something similar?
If we think it should fail, we should probably check what other NVMe devices do.
There was a problem hiding this comment.
I imagine we'd see LBA Out of Range from an actual device? just from the NVMe spec that seems like the most relevant error for the offset/length being out of range. rather than the OS seeing the invalid range, returning an error, and that eventually being a Data Transfer Error? but I'm not sure what other devices do either.
| { | ||
| defs.into_iter().map(Ok).collect::<Vec<_>>().into_iter() | ||
| } else { | ||
| vec![Err("Failed to read LBA range")].into_iter() |
There was a problem hiding this comment.
For my own understanding, how do these and backend errors get translated into NVMe errors?
There was a problem hiding this comment.
they end up being slightly different, since the backend has the more generic block types in case the storage medium is behind say a virtio-block device.
for the NVMe command here, the caller collects these iterators into a Result<Vec<DatasetManagementRangeDefinition>, _> when translating to a Request in hw/nvme/requests.rs. that iteration stops at the first returned Err, returning that for the whole try_collect(). so an invalid range here turns into the NVMe Completion::generic_err(bits::STS_DATA_XFER_ERR) when the caller handles the error.
backend errors are more distant and just an opaque Success or Failure which end up at DeviceRequest::complete -> QueueMinder::complete -> self.complete_req_fn there -> nvme's DeviceQueue::complete. at that last step we go through impl From<block::Result> for Completion to get an NVMe completion to actually write back to the guest.
| block::Operation::Discard => { | ||
| if let Some(mech) = self.discard_mech { | ||
| dkioc::do_discard(&self.fp, mech, off as u64, len as u64) | ||
| for &(off, len) in &req.ranges { |
There was a problem hiding this comment.
Who is checking the offset/length for overflow and that it's actually within the virtual device? Is that being left to the kernel?
There was a problem hiding this comment.
off and len are type usize; can as u64 truncate it on any systems we run on?
As for checking that it's actually within the virtual device, zfs will check and ignore any (part of the) range that's outside the zvol.
There was a problem hiding this comment.
off and len don't inherently. However, off + len certainly can overflow and represent things outside of the device. I don't think we should silently ignore invalid block ranges unless we have evidence that normal devices do. I'm surprised that ZFS would just ignore that and not error on that to be honest.
There was a problem hiding this comment.
To be a bit more specific here, I think if someone gives us an invalid range that should cause an error and we shouldn't just acknowledge it. I'm not sure how hardware handles it and what the expected semantics should be, but this should maybe be all validated up front so we can distinguish it from a generic errno value. Overloaded non-semantic errnos for these aren't great, but that's long been the pattern of the dkio interfaces.
| Operation::Discard => { | ||
| // Discard is only partially wired up in backends we care about, and it's not | ||
| // clear what stats (if any) to report yet. |
There was a problem hiding this comment.
Since these are real now, I think we should probably consider adding the following to match https://github.com/oxidecomputer/omicron/blob/main/oximeter/oximeter/schema/virtual-disk.toml:
- discards
- failed-discards
- add this to io-latency and io-size and make sure this is a new io-kind of operation
There was a problem hiding this comment.
since the schema lives in Omicron, I think we'd want to land this, update the schema in Omicron, update the Omicron ref here in Propolis, then add the metrics. sound right to you too?
There was a problem hiding this comment.
@iximeow That's right! LMK if you need any help getting that sorted.
This implements #990. It makes propolis advertise support for the "Dataset Management" NVMe command. It uses
ioctl(DKIOCFREE)to pass these requests through to local disks (FileBackend). The requests are ignored on distributed disks (Crucible).Note that our devices are NVME 1.0e, which specifies that the disk "may" deallocate all provided ranges, and "shall return all zeros, all ones, or the last data written to the associated LBA". The 1.0e spec has no mechanism for telling the guest which of these semantics is actually happening. Future work may include migrating to NVME 1.1 or later, which can use the DRB (Deallocation Read Behavior) field to tell the guest whether the blocks are actually zeroed or not.
This requires the changes for https://github.com/oxidecomputer/stlouis/issues/940 which are under review here.
A few details to be aware of or provide input on:
oncsfield). I think this is OK since there is no live migration currently (even if the VM has no local disks). So each time a VM boots, it will be on a specific version of Propolis which either advertises Dataset Management, or not.block::Operation::Discardnow means to discard multiple ranges from the client-provided list, not just one range.probes::block_begin_discardused to take an offset and length, but I have changed it to take the number of ranges instead. Is this OK? Are there consumers that need to change? We could fire a probe for each range, but then there would be multiple “begin” probes for one devqid, which could be confusing because begin/complete probes would not match up. Similar forprobes::nvme_discard_enqueue.VirtualDiskStats, should we add stats for Discard? How do we monitor these? Would we need to add support in consumers?