Skip to content

runtime: Add annotation-based block device mounting#40

Draft
antoine-gaillard wants to merge 36 commits into
datadogfrom
agaillard/block-mount-annotation
Draft

runtime: Add annotation-based block device mounting#40
antoine-gaillard wants to merge 36 commits into
datadogfrom
agaillard/block-mount-annotation

Conversation

@antoine-gaillard
Copy link
Copy Markdown

@antoine-gaillard antoine-gaillard commented Jan 21, 2026

Problem: CSI block mode PVCs are passed as raw device nodes. Containers must mount them manually, requiring privileged: true or CAP_SYS_ADMIN. Additionally, new volumes without a snapshot (e.g. fresh EBS) arrive unformatted, and the guest rootfs doesn't ship mkfs/blkid.

Solution: New annotation instructs kata-agent to mount the device:

annotations:
  io.katacontainers.volume.block-mounts: '{"/dev/xvda": {"mount": "/data", "fstype": "ext4"}}'

How it works:

1. Block device hotplugged via existing volumeDevices path
2. Runtime checks host-side device with blkid — if unformatted, runs mkfs.<fstype> on the host before passing to the agent
3. Runtime parses annotation → creates grpc.Storage objects
4. Device removed from OCI spec (no raw device in container)
5. Bind mount added → container sees mounted filesystem

Options:

- mount: destination path (required)
- fstype: ext4, xfs (default: ext4)
- options: mount options (default: ["rw"])
- fsGroup: optional gid for ownership

Testing:

- 27 unit tests (parsing, validation, storage creation, host-side formatting)
- Live tested on cluster with CSI block PVC + fresh unformatted EBS and EBS with existing data

@antoine-gaillard antoine-gaillard force-pushed the agaillard/block-mount-annotation branch from 01bc551 to f6b741c Compare January 22, 2026 07:15
Comment thread src/runtime/virtcontainers/kata_agent.go Outdated
Copy link
Copy Markdown

@zaymat zaymat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm. The logic to check we're not mount devices outside of what's defined in the pod spec seems ok.

Small nit on optimization.

Now let's do the Rust part 😁

var storages []*grpc.Storage
devicesToRemove := make(map[string]bool)

for devicePath, mountConfig := range blockMounts {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small optimization: if you instead go through container's devices and check if you found an entry for dev.ContainerPath in the blockMounts map, you go through each device once.

With the current option, you have a (# device)*(# auto mount) complexity.

zaymat and others added 24 commits March 17, 2026 10:04
microVM sandbox resources are computed from pod sandbox annotations.
In particular, the number of vCPU is calculated by using CPU quota
divided by CPU period. However, on clusters where CFS quotas are disabled,
or if the pod doesn't specify any limit, the compute size is 0.
When using resource hot pluging, the value value will be the size of the
CPU set, which doesn't impact the performance of the microVM pod. But when
using static sandbox management, the computed value will be 0 and the
microVM will be dramatically undersized.

This change takes into account CPU shares while computing the number of vCPU,
and default the CPU Shares/1024 in case CPU quota and/or periods are zeros.
Co-authored-by: Maxime VISONNEAU <maxime.visonneau@gmail.com>
- Add scratch-based Dockerfile for kata data volume
- Move Dockerfile to docker/ subdir and fix config file handling
- Fix Dockerfile to extract only essential kata files
- Add containerd runtime dropin configuration files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
An early call to closing the stdin channel made the stdout & stderr also closed.
This waits for stdout & stderr to be properly finished by reading the whole buffer before closing everything.
On the other, this also fixes a race condition where it was impossible to run multiple execs until the other one was over.
This moves the lock only where it is necessary without locking exec processes.

Fixes kata-containers#10387

Signed-off-by: Maxime Bertin <mbertin@luccasoftware.com>
Co-authored-by: Maxime Bertin <mbertin@luccasoftware.com>
The WORKFLOW_TOKEN no longer exists, so artefact uploads fail. Use
the built-in token instead.
safchain and others added 7 commits March 17, 2026 10:04
Signed-off-by: Hadrien Patte <hadrien.patte@datadoghq.com>
Add support for [`netkit`](https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=22360fad5889cbefe1eca695b0cc0273ab280b56) network devices similarly to how `veth` devices are currently handled.

Signed-off-by: Hadrien Patte <hadrien.patte@datadoghq.com>
Netkit devices in L3 mode have no MAC address and require IP routing
instead of L2 bridging. Since L3 routing is not currently implemented,
reject these devices early with a clear error message directing users
to use netkit L2 mode or veth devices instead.

Signed-off-by: Hadrien Patte <hadrien.patte@datadoghq.com>
antoine-gaillard and others added 5 commits March 30, 2026 09:37
Add support for mounting block devices (volumeDevices) as filesystems
inside the guest VM via annotation. This allows CSI block mode PVCs to
be automatically mounted by kata-agent, eliminating the need for
privileged containers.

Annotation format:
  io.katacontainers.volume.block-mounts: '{"<devicePath>": {"mount": "<path>", "fstype": "<fs>"}}'

Example:
  io.katacontainers.volume.block-mounts: '{"/dev/xvda": {"mount": "/data", "fstype": "ext4"}}'

Supported options:
- mount: destination path in container (required)
- fstype: filesystem type - ext4, xfs, btrfs (default: ext4)
- options: mount options array (default: ["rw"])
- fsGroup: optional gid for filesystem group ownership

Implementation:
- Delegates driver/source selection to handleBlockVolume() for DeviceBlock
  to ensure proper struct-based detection (e.g., blockDrive.Pmem takes
  precedence over config)
- Extracts PCIPath directly for VhostUserBlk devices
- removeDevicesFromOCISpec is a plain function (no receiver needed)
- Comprehensive test coverage including pmem device test proving struct-based
  detection works (nvdimm driver despite VirtioBlock config)

Fixes netkit_endpoint_test.go compilation by using proper constructors:
- Use PciPathFromString() instead of struct literal
- Use CcwDeviceFrom() instead of struct literal

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ount

Fresh ephemeral EBS volumes arrive without a filesystem. The kata-agent
fails with EINVAL when trying to mount an unformatted device, and the
guest rootfs does not ship mkfs or blkid. Format on the host side in
createAnnotationBlockStorages using the host device path from BlockDrive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Iterate container devices once with map lookup instead of nested loops.
Detect duplicate container devices and report unmatched annotation keys
with device paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants