Skip to content

Commit 4861c33

Browse files
committed
Add doc
1 parent 177ed55 commit 4861c33

3 files changed

Lines changed: 209 additions & 0 deletions

File tree

docs/design/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ Kata Containers design documents:
1717
- [Metrics in Rust Runtime(runtime-rs)](kata-metrics-in-runtime-rs.md)
1818
- [Design for Kata Containers `Lazyload` ability with `nydus`](kata-nydus-design.md)
1919
- [Design for direct-assigned volume](direct-blk-device-assignment.md)
20+
- [Design for annotation-based block device mounting](annotation-block-device-mounts.md)
2021
- [Design for core-scheduling](core-scheduling.md)
2122
- [Virtualization Reference Architecture](kata-vra.md)
2223
---
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
# Annotation-Based Block Device Mounting
2+
3+
## Motivation
4+
5+
Kubernetes `volumeDevices` exposes block devices as raw `/dev/` paths inside containers.
6+
In a Kata Containers VM this means the block device is hotplugged and presented to the
7+
guest as a character/block device node, but no filesystem is mounted. The application must
8+
format and mount the device itself.
9+
10+
Many workloads (databases, caches, ML training) need a *mounted filesystem*, not a raw
11+
device. Today, achieving this requires either:
12+
13+
1. Using `volumeMounts` with Filesystem-mode PVCs, which routes through virtio-fs and
14+
loses the performance benefits of direct block access.
15+
2. Using the [direct-assigned volume](direct-blk-device-assignment.md) mechanism, which
16+
requires a CSI driver integration and the `kata-ctl direct-volume` CLI.
17+
18+
Neither option works well when:
19+
- The CSI driver cannot be modified (e.g. upstream AWS EBS CSI).
20+
- The volume must remain in Block mode for other consumers in the same cluster.
21+
- A lightweight, annotation-only solution is preferred over CSI plugin changes.
22+
23+
## Proposed Solution
24+
25+
A new pod annotation `io.katacontainers.volume.block-mounts` lets users declare that
26+
specific `volumeDevices` should be mounted as filesystems inside the guest VM. The Kata
27+
runtime intercepts these devices during container creation, creates `Storage` gRPC objects
28+
for the kata-agent, and adds bind mounts to the OCI spec.
29+
30+
The block devices are still hotplugged through the standard Kubernetes `volumeDevices`
31+
path. This annotation only changes what happens *after* hotplug: instead of passing the
32+
device as a raw `/dev/` node, the runtime instructs the agent to mount it.
33+
34+
### Annotation Format
35+
36+
The annotation value is a JSON object. Keys are container device paths (matching
37+
`volumeDevices[].devicePath`), values are mount configuration objects:
38+
39+
```json
40+
{
41+
"/dev/vdb": {
42+
"mount": "/data",
43+
"fstype": "ext4",
44+
"options": ["rw", "noatime"],
45+
"fsGroup": 1000
46+
},
47+
"/dev/vdc": {
48+
"mount": "/cache",
49+
"fstype": "xfs"
50+
}
51+
}
52+
```
53+
54+
| Field | Type | Required | Default | Description |
55+
|-----------|----------|----------|----------|-------------|
56+
| `mount` | string | yes | - | Absolute path where the filesystem is mounted inside the container |
57+
| `fstype` | string | no | `ext4` | Filesystem type. Must be `ext4` or `xfs` |
58+
| `options` | []string | no | `["rw"]` | Mount options passed to the agent |
59+
| `fsGroup` | int64 | no | - | If set, ownership of the mount is changed to this GID |
60+
61+
### Validation Rules
62+
63+
- Device paths must start with `/dev/`.
64+
- Mount destinations must be absolute paths.
65+
- Filesystem type must be `ext4`, `xfs`, or empty (defaults to `ext4`).
66+
- Every annotated device must match a `volumeDevices` entry in the container spec.
67+
- Duplicate container device paths are rejected.
68+
69+
## Implementation Details
70+
71+
### Runtime (Go) - `kata_agent.go`
72+
73+
The implementation adds three stages to container creation:
74+
75+
#### 1. Device Filtering (`appendDevices`)
76+
77+
When building the gRPC device list, the runtime parses the block mount annotation and
78+
skips any device whose `ContainerPath` appears in the annotation. This prevents the
79+
device from being passed as a raw `/dev/` node to the guest.
80+
81+
#### 2. Storage Creation (`createAnnotationBlockStorages`)
82+
83+
For each annotated device, the runtime:
84+
85+
1. Looks up the device in the device manager.
86+
2. Delegates driver selection to `handleBlockVolume()`, which inspects the `BlockDrive`
87+
struct fields (e.g. `Pmem`, `PCIPath`, `DevNo`) to determine the correct storage
88+
driver (`blk`, `nvdimm`, `virtio-scsi`, etc.). This avoids duplicating the
89+
driver-selection logic.
90+
3. If the host-side block device has no filesystem (detected via `blkid`), it formats the
91+
device using `mkfs.<fstype>`. This handles fresh ephemeral volumes (e.g. unformatted
92+
EBS volumes) where the guest rootfs does not ship filesystem tools.
93+
4. Constructs a `Storage` gRPC object with the filesystem type, mount options, and a
94+
base64-encoded guest mount point under the sandbox storage directory.
95+
5. Adds a bind mount to the OCI spec pointing from the guest mount point to the
96+
user-specified container path.
97+
98+
#### 3. OCI Spec Cleanup (`removeDevicesFromOCISpec`)
99+
100+
After processing, the annotated devices are removed from `spec.Linux.Devices` since they
101+
are no longer raw device nodes.
102+
103+
### Driver Selection
104+
105+
The runtime delegates to `handleBlockVolume()` rather than reading
106+
`HypervisorConfig.BlockDeviceDriver` directly. This function uses struct-based detection:
107+
108+
```
109+
BlockDrive.Pmem == true -> nvdimm driver
110+
BlockDeviceDriver == VirtioCCW -> blk-ccw driver
111+
BlockDeviceDriver == VirtioBlk -> blk driver
112+
BlockDeviceDriver == VirtioMmio -> mmio-blk driver
113+
BlockDeviceDriver == VirtioSCSI -> scsi driver
114+
```
115+
116+
This ensures correct driver selection for all device types, including pmem devices that
117+
may be configured alongside a different default block driver.
118+
119+
### Host-Side Formatting
120+
121+
`formatBlockDeviceIfNeeded()` runs on the host before the device reaches the guest:
122+
123+
1. Runs `blkid -p <device>` to check for an existing filesystem.
124+
2. If no filesystem is found, runs `mkfs.<fstype> <device>`.
125+
3. Only `ext4` and `xfs` are allowed (enforced by annotation validation).
126+
127+
This is necessary because fresh block volumes (e.g. newly provisioned EBS, local SSDs)
128+
arrive unformatted, and the guest rootfs typically does not include `mkfs` or `blkid`.
129+
130+
### Annotation Lookup
131+
132+
The runtime checks container-level annotations first, falling back to sandbox-level
133+
annotations. This allows the annotation to be set at either the pod or container level.
134+
135+
## End User Interface
136+
137+
### Kubernetes Pod Spec
138+
139+
```yaml
140+
apiVersion: v1
141+
kind: Pod
142+
metadata:
143+
name: block-mount-example
144+
annotations:
145+
io.katacontainers.volume.block-mounts: |
146+
{"/dev/xvda": {"mount": "/data", "fstype": "ext4", "options": ["rw", "noatime"]}}
147+
spec:
148+
runtimeClassName: kata
149+
containers:
150+
- name: app
151+
image: myapp:latest
152+
volumeDevices:
153+
- name: data-vol
154+
devicePath: /dev/xvda
155+
volumes:
156+
- name: data-vol
157+
persistentVolumeClaim:
158+
claimName: my-block-pvc
159+
---
160+
apiVersion: v1
161+
kind: PersistentVolumeClaim
162+
metadata:
163+
name: my-block-pvc
164+
spec:
165+
accessModes:
166+
- ReadWriteOncePod
167+
volumeMode: Block
168+
storageClassName: ebs-sc
169+
resources:
170+
requests:
171+
storage: 100Gi
172+
```
173+
174+
### containerd Configuration
175+
176+
The containerd config must allow Kata annotations to pass through (this is typically
177+
already configured for Kata deployments):
178+
179+
```toml
180+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
181+
runtime_type = "io.containerd.kata.v2"
182+
pod_annotations = ["io.katacontainers.*"]
183+
```
184+
185+
## Comparison with Direct-Assigned Volumes
186+
187+
| Aspect | Block Mount Annotation | Direct-Assigned Volume |
188+
|--------|----------------------|----------------------|
189+
| CSI driver changes | None | Required (`kata-ctl direct-volume add`) |
190+
| Volume mode | Block (`volumeDevices`) | Filesystem (`volumeMounts`) |
191+
| Resize support | No | Yes (via `kata-ctl direct-volume resize`) |
192+
| Stats collection | No | Yes (via `kata-ctl direct-volume stats`) |
193+
| Configuration | Pod annotation | `mountInfo.json` on host filesystem |
194+
| Use case | Simple block-to-mount conversion | Full CSI integration with lifecycle management |
195+
196+
## Limitations
197+
198+
1. Only `ext4` and `xfs` filesystem types are supported.
199+
2. Volume resize and stats collection are not supported (use direct-assigned volumes for
200+
these features).
201+
3. The annotation applies to the Go runtime (`runtime-go`) only. Runtime-rs support is
202+
planned.
203+
4. Host-side formatting requires `blkid` and `mkfs.<fstype>` to be available on the host.

docs/how-to/how-to-set-sandbox-config-kata.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,11 @@ There are several kinds of Kata configurations and they are listed below.
100100
| `io.katacontainers.config.hypervisor.block_device_num_queues` | `usize` | The number of queues to use for block devices (runtime-rs only) |
101101
| `io.katacontainers.config.hypervisor.block_device_queue_size` | uint32 | The size of the of the queue to use for block devices (runtime-rs only) |
102102

103+
## Volume Options
104+
| Key | Value Type | Comments |
105+
|-------| ----- | ----- |
106+
| `io.katacontainers.volume.block-mounts` | string (JSON) | Specifies block devices (from `volumeDevices`) that should be mounted as filesystems inside the guest VM instead of being passed as raw devices. Value is a JSON object mapping device paths to mount configurations. See [design doc](../design/annotation-block-device-mounts.md) for details |
107+
103108
## Container Options
104109
| Key | Value Type | Comments |
105110
|-------| ----- | ----- |

0 commit comments

Comments
 (0)