Skip to content

Commit 91f2a7f

Browse files
committed
doc: pmem: add a note about limiting msync with cgroupsv2
Add a section in the pmem.md file describing a way of limiting I/O usage of `msync` calls from a virtio-pmem device. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
1 parent 5738786 commit 91f2a7f

1 file changed

Lines changed: 94 additions & 0 deletions

File tree

docs/pmem.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,100 @@ VMs, which could be exploited as a side channel by an attacker inside the
157157
microVM. Users that want to use `virtio-pmem` to share memory are encouraged to
158158
carefully evaluate the security risk according to their threat model.
159159

160+
### Limiting `msync` write bandwidth
161+
162+
When a guest issues a flush request to the `virtio-pmem` device (via the
163+
`VIRTIO_PMEM_REQ_TYPE_FLUSH`), Firecracker calls `msync(MS_SYNC)` on the backing
164+
file to persist dirty pages to disk. A malicious guest can issue a high volume
165+
of flush requests, leading to excessive host I/O usage.
166+
167+
There are two ways to mitigate this:
168+
169+
#### Firecracker rate limiter
170+
171+
The `virtio-pmem` device supports a built-in rate limiter, identical to the one
172+
available for block devices. It throttles flush requests using two token
173+
buckets:
174+
175+
- `bandwidth` — limits the total number of bytes sent to the `msync` per refill
176+
interval. Each flush consumes tokens equal to the **full backing file size**,
177+
because `msync` is called over the entire mapped region. For example, with a
178+
256 MiB backing file and `size` set to `268435456` (256 MiB), at most one
179+
flush is allowed per `refill_time` milliseconds.
180+
- `ops` — limits the number of `msync` calls per refill interval (after
181+
coalescing multiple flush requests within a single queue notification into one
182+
call).
183+
184+
The rate limiter can be configured at device creation time. The following
185+
example allows at most 1 flush per second for a 256 MiB backing file
186+
(`bandwidth.size` = 256 MiB = 268435456 bytes), and at most 10 `msync`
187+
operations per second:
188+
189+
```json
190+
"pmem": [
191+
{
192+
"id": "pmem0",
193+
"path_on_host": "./backing_file_256m",
194+
"rate_limiter": {
195+
"bandwidth": { "size": 268435456, "refill_time": 1000 },
196+
"ops": { "size": 10, "refill_time": 1000 }
197+
}
198+
}
199+
]
200+
```
201+
202+
It can also be updated at runtime via the API:
203+
204+
```console
205+
curl --unix-socket $socket_location -i \
206+
-X PATCH 'http://localhost/pmem/pmem0' \
207+
-H 'Content-Type: application/json' \
208+
-d '{
209+
"id": "pmem0",
210+
"rate_limiter": {
211+
"bandwidth": { "size": 268435456, "refill_time": 1000 },
212+
"ops": { "size": 10, "refill_time": 1000 }
213+
}
214+
}'
215+
```
216+
217+
> [!NOTE]
218+
>
219+
> Since each flush always costs exactly one op and exactly `file_size` bytes,
220+
> the `bandwidth` and `ops` buckets are correlated: setting `bandwidth.size` to
221+
> `file_size` with a given `refill_time` is equivalent to setting `ops.size` to
222+
> `1` with the same `refill_time` — both allow one flush per interval. In
223+
> practice, configuring only one of the two buckets is sufficient. Use `ops` for
224+
> a simple "N flushes per interval" limit, or `bandwidth` if you want to express
225+
> the limit in terms of I/O throughput.
226+
227+
#### Cgroup v2 IO controller
228+
229+
Alternatively, the **cgroup v2 IO controller** can throttle write bandwidth on
230+
the block device that hosts the `virtio-pmem` backing file:
231+
232+
```bash
233+
# Identify the block device MAJOR:MINOR for the backing file
234+
dev=$(stat -c '%d' /path/to/backing_file)
235+
echo "$((dev >> 8)):$((dev & 0xff))"
236+
237+
# Enable the io controller
238+
echo "+io" | sudo tee /sys/fs/cgroup/<vm_cgroup>/cgroup.subtree_control
239+
240+
# Limit write bandwidth (e.g. 10 MB/s) on device MAJOR:MINOR
241+
echo "MAJOR:MINOR wbps=10485760" | sudo tee /sys/fs/cgroup/<vm_cgroup>/io.max
242+
```
243+
244+
> [!NOTE]
245+
>
246+
> - This requires **cgroup v2** with a filesystem that supports cgroup-aware
247+
> writeback (e.g. ext4, btrfs).
248+
> - The limit applies to all I/O from the cgroup to that device, not only
249+
> `msync` flushes.
250+
> - When using the [Jailer](jailer.md), the Firecracker process is already
251+
> placed in a cgroup. You can configure `io.max` on that cgroup before
252+
> starting the microVM.
253+
160254
## Snapshot support
161255

162256
`virtio-pmem` works with snapshot functionality of Firecracker. Snapshot will

0 commit comments

Comments
 (0)