|
| 1 | +# Annotation-Based Block Device Mounting |
| 2 | + |
| 3 | +## Motivation |
| 4 | + |
| 5 | +Kubernetes `volumeDevices` exposes block devices as raw `/dev/` paths inside containers. |
| 6 | +In a Kata Containers VM this means the block device is hotplugged and presented to the |
| 7 | +guest as a character/block device node, but no filesystem is mounted. The application must |
| 8 | +format and mount the device itself. |
| 9 | + |
| 10 | +Many workloads (databases, caches, ML training) need a *mounted filesystem*, not a raw |
| 11 | +device. Today, achieving this requires either: |
| 12 | + |
| 13 | +1. Using `volumeMounts` with Filesystem-mode PVCs, which routes through virtio-fs and |
| 14 | + loses the performance benefits of direct block access. |
| 15 | +2. Using the [direct-assigned volume](direct-blk-device-assignment.md) mechanism, which |
| 16 | + requires a CSI driver integration and the `kata-ctl direct-volume` CLI. |
| 17 | + |
| 18 | +Neither option works well when: |
| 19 | +- The CSI driver cannot be modified (e.g. upstream AWS EBS CSI). |
| 20 | +- The volume must remain in Block mode for other consumers in the same cluster. |
| 21 | +- A lightweight, annotation-only solution is preferred over CSI plugin changes. |
| 22 | + |
| 23 | +## Proposed Solution |
| 24 | + |
| 25 | +A new pod annotation `io.katacontainers.volume.block-mounts` lets users declare that |
| 26 | +specific `volumeDevices` should be mounted as filesystems inside the guest VM. The Kata |
| 27 | +runtime intercepts these devices during container creation, creates `Storage` gRPC objects |
| 28 | +for the kata-agent, and adds bind mounts to the OCI spec. |
| 29 | + |
| 30 | +The block devices are still hotplugged through the standard Kubernetes `volumeDevices` |
| 31 | +path. This annotation only changes what happens *after* hotplug: instead of passing the |
| 32 | +device as a raw `/dev/` node, the runtime instructs the agent to mount it. |
| 33 | + |
| 34 | +### Annotation Format |
| 35 | + |
| 36 | +The annotation value is a JSON object. Keys are container device paths (matching |
| 37 | +`volumeDevices[].devicePath`), values are mount configuration objects: |
| 38 | + |
| 39 | +```json |
| 40 | +{ |
| 41 | + "/dev/vdb": { |
| 42 | + "mount": "/data", |
| 43 | + "fstype": "ext4", |
| 44 | + "options": ["rw", "noatime"], |
| 45 | + "fsGroup": 1000 |
| 46 | + }, |
| 47 | + "/dev/vdc": { |
| 48 | + "mount": "/cache", |
| 49 | + "fstype": "xfs" |
| 50 | + } |
| 51 | +} |
| 52 | +``` |
| 53 | + |
| 54 | +| Field | Type | Required | Default | Description | |
| 55 | +|-----------|----------|----------|----------|-------------| |
| 56 | +| `mount` | string | yes | - | Absolute path where the filesystem is mounted inside the container | |
| 57 | +| `fstype` | string | no | `ext4` | Filesystem type. Must be `ext4` or `xfs` | |
| 58 | +| `options` | []string | no | `["rw"]` | Mount options passed to the agent | |
| 59 | +| `fsGroup` | int64 | no | - | If set, ownership of the mount is changed to this GID | |
| 60 | + |
| 61 | +### Validation Rules |
| 62 | + |
| 63 | +- Device paths must start with `/dev/`. |
| 64 | +- Mount destinations must be absolute paths. |
| 65 | +- Filesystem type must be `ext4`, `xfs`, or empty (defaults to `ext4`). |
| 66 | +- Every annotated device must match a `volumeDevices` entry in the container spec. |
| 67 | +- Duplicate container device paths are rejected. |
| 68 | + |
| 69 | +## Implementation Details |
| 70 | + |
| 71 | +### Runtime (Go) - `kata_agent.go` |
| 72 | + |
| 73 | +The implementation adds three stages to container creation: |
| 74 | + |
| 75 | +#### 1. Device Filtering (`appendDevices`) |
| 76 | + |
| 77 | +When building the gRPC device list, the runtime parses the block mount annotation and |
| 78 | +skips any device whose `ContainerPath` appears in the annotation. This prevents the |
| 79 | +device from being passed as a raw `/dev/` node to the guest. |
| 80 | + |
| 81 | +#### 2. Storage Creation (`createAnnotationBlockStorages`) |
| 82 | + |
| 83 | +For each annotated device, the runtime: |
| 84 | + |
| 85 | +1. Looks up the device in the device manager. |
| 86 | +2. Delegates driver selection to `handleBlockVolume()`, which inspects the `BlockDrive` |
| 87 | + struct fields (e.g. `Pmem`, `PCIPath`, `DevNo`) to determine the correct storage |
| 88 | + driver (`blk`, `nvdimm`, `virtio-scsi`, etc.). This avoids duplicating the |
| 89 | + driver-selection logic. |
| 90 | +3. If the host-side block device has no filesystem (detected via `blkid`), it formats the |
| 91 | + device using `mkfs.<fstype>`. This handles fresh ephemeral volumes (e.g. unformatted |
| 92 | + EBS volumes) where the guest rootfs does not ship filesystem tools. |
| 93 | +4. Constructs a `Storage` gRPC object with the filesystem type, mount options, and a |
| 94 | + base64-encoded guest mount point under the sandbox storage directory. |
| 95 | +5. Adds a bind mount to the OCI spec pointing from the guest mount point to the |
| 96 | + user-specified container path. |
| 97 | + |
| 98 | +#### 3. OCI Spec Cleanup (`removeDevicesFromOCISpec`) |
| 99 | + |
| 100 | +After processing, the annotated devices are removed from `spec.Linux.Devices` since they |
| 101 | +are no longer raw device nodes. |
| 102 | + |
| 103 | +### Driver Selection |
| 104 | + |
| 105 | +The runtime delegates to `handleBlockVolume()` rather than reading |
| 106 | +`HypervisorConfig.BlockDeviceDriver` directly. This function uses struct-based detection: |
| 107 | + |
| 108 | +``` |
| 109 | +BlockDrive.Pmem == true -> nvdimm driver |
| 110 | +BlockDeviceDriver == VirtioCCW -> blk-ccw driver |
| 111 | +BlockDeviceDriver == VirtioBlk -> blk driver |
| 112 | +BlockDeviceDriver == VirtioMmio -> mmio-blk driver |
| 113 | +BlockDeviceDriver == VirtioSCSI -> scsi driver |
| 114 | +``` |
| 115 | + |
| 116 | +This ensures correct driver selection for all device types, including pmem devices that |
| 117 | +may be configured alongside a different default block driver. |
| 118 | + |
| 119 | +### Host-Side Formatting |
| 120 | + |
| 121 | +`formatBlockDeviceIfNeeded()` runs on the host before the device reaches the guest: |
| 122 | + |
| 123 | +1. Runs `blkid -p <device>` to check for an existing filesystem. |
| 124 | +2. If no filesystem is found, runs `mkfs.<fstype> <device>`. |
| 125 | +3. Only `ext4` and `xfs` are allowed (enforced by annotation validation). |
| 126 | + |
| 127 | +This is necessary because fresh block volumes (e.g. newly provisioned EBS, local SSDs) |
| 128 | +arrive unformatted, and the guest rootfs typically does not include `mkfs` or `blkid`. |
| 129 | + |
| 130 | +### Annotation Lookup |
| 131 | + |
| 132 | +The runtime checks container-level annotations first, falling back to sandbox-level |
| 133 | +annotations. This allows the annotation to be set at either the pod or container level. |
| 134 | + |
| 135 | +## End User Interface |
| 136 | + |
| 137 | +### Kubernetes Pod Spec |
| 138 | + |
| 139 | +```yaml |
| 140 | +apiVersion: v1 |
| 141 | +kind: Pod |
| 142 | +metadata: |
| 143 | + name: block-mount-example |
| 144 | + annotations: |
| 145 | + io.katacontainers.volume.block-mounts: | |
| 146 | + {"/dev/xvda": {"mount": "/data", "fstype": "ext4", "options": ["rw", "noatime"]}} |
| 147 | +spec: |
| 148 | + runtimeClassName: kata |
| 149 | + containers: |
| 150 | + - name: app |
| 151 | + image: myapp:latest |
| 152 | + volumeDevices: |
| 153 | + - name: data-vol |
| 154 | + devicePath: /dev/xvda |
| 155 | + volumes: |
| 156 | + - name: data-vol |
| 157 | + persistentVolumeClaim: |
| 158 | + claimName: my-block-pvc |
| 159 | +--- |
| 160 | +apiVersion: v1 |
| 161 | +kind: PersistentVolumeClaim |
| 162 | +metadata: |
| 163 | + name: my-block-pvc |
| 164 | +spec: |
| 165 | + accessModes: |
| 166 | + - ReadWriteOncePod |
| 167 | + volumeMode: Block |
| 168 | + storageClassName: ebs-sc |
| 169 | + resources: |
| 170 | + requests: |
| 171 | + storage: 100Gi |
| 172 | +``` |
| 173 | +
|
| 174 | +### containerd Configuration |
| 175 | +
|
| 176 | +The containerd config must allow Kata annotations to pass through (this is typically |
| 177 | +already configured for Kata deployments): |
| 178 | +
|
| 179 | +```toml |
| 180 | +[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata] |
| 181 | + runtime_type = "io.containerd.kata.v2" |
| 182 | + pod_annotations = ["io.katacontainers.*"] |
| 183 | +``` |
| 184 | + |
| 185 | +## Comparison with Direct-Assigned Volumes |
| 186 | + |
| 187 | +| Aspect | Block Mount Annotation | Direct-Assigned Volume | |
| 188 | +|--------|----------------------|----------------------| |
| 189 | +| CSI driver changes | None | Required (`kata-ctl direct-volume add`) | |
| 190 | +| Volume mode | Block (`volumeDevices`) | Filesystem (`volumeMounts`) | |
| 191 | +| Resize support | No | Yes (via `kata-ctl direct-volume resize`) | |
| 192 | +| Stats collection | No | Yes (via `kata-ctl direct-volume stats`) | |
| 193 | +| Configuration | Pod annotation | `mountInfo.json` on host filesystem | |
| 194 | +| Use case | Simple block-to-mount conversion | Full CSI integration with lifecycle management | |
| 195 | + |
| 196 | +## Limitations |
| 197 | + |
| 198 | +1. Only `ext4` and `xfs` filesystem types are supported. |
| 199 | +2. Volume resize and stats collection are not supported (use direct-assigned volumes for |
| 200 | + these features). |
| 201 | +3. The annotation applies to the Go runtime (`runtime-go`) only. Runtime-rs support is |
| 202 | + planned. |
| 203 | +4. Host-side formatting requires `blkid` and `mkfs.<fstype>` to be available on the host. |
0 commit comments