Skip to content

Latest commit

 

History

History
305 lines (254 loc) · 11.5 KB

File metadata and controls

305 lines (254 loc) · 11.5 KB

Using the Firecracker virtio-pmem device

What is a persistent memory device

Persistent memory is a type of non-volatile, CPU accessible (with usual load/store instructions) memory that does not lose its content on power loss. In other words all writes to the memory persist over the power cycle. In hardware this known as NVDIMM memory (Non Volatile Double Inline Memory Module).

What is a virtio-pmem device:

virtio-pmem is a device which emulates a persistent memory device without requiring a physical NVDIMM device be present on the host system. virtio-pmem is backed by a memory mapped file on the host side and is exposed to the guest kernel as a region in the guest physical memory. This allows the guest to directly access the host memory pages without a need to use guest driver or interact with VMM. From guest user-space perspective virtio-pmem devices are presented as normal block device like /dev/pmem0. This allows virtio-pmem to be used as rootfs device and make VM boot from it.

Note

Since virtio-pmem is located fully in memory, when used as a block device there is no need to use guest page cache for its operations. This behaviour can be configured by using DAX feature of the kernel.

  • To mount a device with DAX add --options=dax to the mount command.
  • To configure a root device with DAX append rootflags=dax to the kernel arguments.

DAX support is not uniform for all file systems. Check the kernel documentation for more information.

Prerequisites

In order to use virtio-pmem device, guest kernel needs to built with support for it. The full list of configuration options needed for virtio-pmem and DAX:

# Needed for DAX on aarch64. Will be ignored on x86_64
CONFIG_ARM64_PMEM=y

CONFIG_DEVICE_MIGRATION=y
CONFIG_ZONE_DEVICE=y
CONFIG_VIRTIO_PMEM=y
CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=y
CONFIG_ND_CLAIM=y
CONFIG_ND_BTT=y
CONFIG_BTT=y
CONFIG_ND_PFN=y
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_OF_PMEM=y
CONFIG_NVDIMM_KEYS=y
CONFIG_DAX=y
CONFIG_DEV_DAX=y
CONFIG_DEV_DAX_PMEM=y
CONFIG_DEV_DAX_KMEM=y
CONFIG_FS_DAX=y
CONFIG_FS_DAX_PMD=y

Configuration

Firecracker implementation exposes these config options for the virtio-pmem device:

  • id - id of the device for internal use
  • path_on_host - path to the backing file
  • root_device - toggle to use this device as root device. Device will be marked as rw in the kernel arguments
  • read_only - tells Firecracker to mmap the backing file in read-only mode. If this device is also configured as root_device, it will be marked as ro in the kernel arguments

Note

Devices will be exposed to the guest in the order in which they are configured with sequential names in the form of /dev/pmem{N} like: /dev/pmem0, /dev/pmem1 ...

Warning

Setting virtio-pmem device to read-only mode can lead to VM shutting down on any attempt to write to the device. This is because from guest kernel perspective virtio-pmem is always read-write capable. Use read-only mode only if you want to ensure the underlying file is never written to.

To mount the pmem device with read-only options add -o ro to the mount command.

The exact behaviour differs per platform:

  • x86_64 - if KVM is able to decode the write instruction used by the guest, it will return a MMIO_WRITE to the Firecracker where it will be discarded and the warning log will be printed.
  • aarch64 - the instruction emulation is much stricter. Writes will result in an internal KVM error which will be returned to Firecracker in a form of an ENOSYS error. This will make Firecracker stop the VM with appropriate log message.

Warning

virtio-pmem requires for the guest exposed memory region to be 2MB aligned. This requirement is transitively carried to the backing file of the virtio-pmem. Firecracker allows users to configure virtio-pmem with backing file of any size and fills the memory gap between the end of the file and the 2MB boundary with empty PRIVATE | ANONYMOUS memory pages. Users must be careful to not write to this memory gap since it will not be synchronized with backing file. This is not an issue if virtio-pmem is configured in read-only mode.

Config file

Configuration of the virtio-pmem device from config file follows similar pattern to virtio-block section. Here is an example configuration for a single virtio-pmem device:

"pmem": [
    {
      "id": "pmem0",
      "path_on_host": "./some_file",
      "root_device": true,
      "read_only": false
    }
]

API

Similar to other devices virtio-pmem can be configured with API calls. An example of configuration request:

curl --unix-socket $socket_location -i \
    -X PUT 'http://localhost/pmem/pmem0' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d "{
         \"id\": \"pmem0\",
         \"path_on_host\": \"./some_file\",
         \"root_device\": true,
         \"read_only\": false
    }"

Security

It is not recommended to use the same backing file for virtio-pmem across different VMs, as this causes the same physical pages to be mapped to different VMs, which could be exploited as a side channel by an attacker inside the microVM. Users that want to use virtio-pmem to share memory are encouraged to carefully evaluate the security risk according to their threat model.

Limiting msync write bandwidth

When a guest issues a flush request to the virtio-pmem device (via the VIRTIO_PMEM_REQ_TYPE_FLUSH), Firecracker calls msync(MS_SYNC) on the backing file to persist dirty pages to disk. A malicious guest can issue a high volume of flush requests, leading to excessive host I/O usage.

There are two ways to mitigate this:

Firecracker rate limiter

The virtio-pmem device supports a built-in rate limiter, identical to the one available for block devices. It throttles flush requests using two token buckets:

  • bandwidth — limits the total number of bytes sent to the msync per refill interval. Each flush consumes tokens equal to the full backing file size, because msync is called over the entire mapped region. For example, with a 256 MiB backing file and size set to 268435456 (256 MiB), at most one flush is allowed per refill_time milliseconds.
  • ops — limits the number of msync calls per refill interval (after coalescing multiple flush requests within a single queue notification into one call).

The rate limiter can be configured at device creation time. The following example allows at most 1 flush per second for a 256 MiB backing file (bandwidth.size = 256 MiB = 268435456 bytes), and at most 10 msync operations per second:

"pmem": [
    {
      "id": "pmem0",
      "path_on_host": "./backing_file_256m",
      "rate_limiter": {
        "bandwidth": { "size": 268435456, "refill_time": 1000 },
        "ops": { "size": 10, "refill_time": 1000 }
      }
    }
]

It can also be updated at runtime via the API:

curl --unix-socket $socket_location -i \
    -X PATCH 'http://localhost/pmem/pmem0' \
    -H 'Content-Type: application/json' \
    -d '{
         "id": "pmem0",
         "rate_limiter": {
           "bandwidth": { "size": 268435456, "refill_time": 1000 },
           "ops": { "size": 10, "refill_time": 1000 }
         }
    }'

Note

Since each flush always costs exactly one op and exactly file_size bytes, the bandwidth and ops buckets are correlated: setting bandwidth.size to file_size with a given refill_time is equivalent to setting ops.size to 1 with the same refill_time — both allow one flush per interval. In practice, configuring only one of the two buckets is sufficient. Use ops for a simple "N flushes per interval" limit, or bandwidth if you want to express the limit in terms of I/O throughput.

Cgroup v2 IO controller

Alternatively, the cgroup v2 IO controller can throttle write bandwidth on the block device that hosts the virtio-pmem backing file:

# Identify the block device MAJOR:MINOR for the backing file
dev=$(stat -c '%d' /path/to/backing_file)
echo "$((dev >> 8)):$((dev & 0xff))"

# Enable the io controller
echo "+io" | sudo tee /sys/fs/cgroup/<vm_cgroup>/cgroup.subtree_control

# Limit write bandwidth (e.g. 10 MB/s) on device MAJOR:MINOR
echo "MAJOR:MINOR wbps=10485760" | sudo tee /sys/fs/cgroup/<vm_cgroup>/io.max

Note

  • This requires cgroup v2 with a filesystem that supports cgroup-aware writeback (e.g. ext4, btrfs).
  • The limit applies to all I/O from the cgroup to that device, not only msync flushes.
  • When using the Jailer, the Firecracker process is already placed in a cgroup. You can configure io.max on that cgroup before starting the microVM.

Snapshot support

virtio-pmem works with snapshot functionality of Firecracker. Snapshot will contain the configuration options provided by the user. During restoration process, Firecracker will attempt to restore virtio-pmem device by opening same backing file as it was configured in the first place. This means all virtio-pmem backing files should be present in the same locations during restore as they were during initial virtio-pmem configuration.

Performance

Even though virtio-pmem allows for the direct access of host pages from the guest, the performance of the first access of each page will suffer from the internal KVM page fault which will have to set up Guest physical address to Host Virtual address translation. Consecutive accesses will not need to go through this process again.

Since the number of page faults correlate to the size of the pages used to back virtio-pmem memory, it is possible to use huge pages to reduce number of required page fault. This can be done by using tmpfs with transparent huge pages enabled or by using hugetblfs if virtio-pmem is used for memory sharing.

Memory usage

Note

virtio-pmem memory can be paged out by the host, because it is backed by a file with MAP_SHARED mapping type. To prevent this from happening, you can use vmtouch or similar tool to lock file pages from being evicted.

virtio-pmem resides in host memory and does increase the maximum possible memory usage of a VM since now VM can use all of its RAM and access all of the virtio-pmem memory. In order to minimize the overhead, it is highly recommended to use DAX mode to avoid unnecessary duplication of data in guest page cache.

As an example, a single VM with 128MB of memory booted from virtio-pmem device without DAX has RSS value of ~120MB, while with DAX it is ~96MB. The ~96MB is similar to memory usage of a VM booted using virtio-block as a root device.

In the case where multiple VMs have virtio-pmem devices that point to the same underlying file the memory overhead can be amortized since total maximum memory usage will only include a single instance of virtio-pmem memory.

As an example 2 VMs configured with 128MB of RAM without virtio-pmem devices can consume maximum of 128 + 128 = 256MB of host memory. If each of VMs will have a 100MB virtio-pmem device attached with shared backing file, the maximum memory consumption will be 128 + 128 + 100 = 356MB because 100MB of virtio-pmem will be shared between VMs.