CDI mode (cdi-cri) mounts libcuda.so at sysroot path inside containers on aarch64 aws-k8s-1.34-nvidia

**Image I'm using:**

Bottlerocket v1.56.0, variant `aws-k8s-1.34-nvidia`, architecture `aarch64` EC2 instance family: g5g (NVIDIA T4G GPU, aarch64), NVIDIA driver `580.126.09` Selected via Karpenter EC2NodeClass:
```yaml
amiSelectorTerms:
  - alias: bottlerocket@v1.56.0
```

**What I expected to happen:**

GPU containers should have `libcuda.so` mounted at the standard path:
```bash
/usr/lib/aarch64-linux-gnu/libcuda.so.580.126.09
```
This is the behavior observed on the same Bottlerocket version (v1.56.0) with the `aws-k8s-1.33-nvidia` variant, which defaults to volume-mounts mode. In that mode, `nvidia-container-cli` correctly mounts the host's CUDA driver as a real file at the standard path inside the container. The sysroot path does not appear inside the container at all.


**What actually happened:**

On `aws-k8s-1.34-nvidia`, which defaults to `cdi-cri` mode (introduced by bottlerocket-os/bottlerocket PR #4475), the CDI spec mounts `libcuda.so` at the raw Bottlerocket sysroot path inside containers:
```bash
/aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/nvidia/tesla/libcuda.so.580.126.09
```
The host's ldcache on K8s 1.34 shows:

```bash
libcuda.so.1 (libc6,AArch64) => /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/nvidia/tesla/libcuda.so.1
```
`/usr/lib/aarch64-linux-gnu/` does not exist on the host. The CDI spec generation uses this ldcache path without remapping it to a standard container path, so the sysroot path is injected as-is into GPU containers.

Applications that locate `libcuda.so` via ldconfig, standard dlopen search paths, or CUDA toolkit conventions fail to find it. The same container image (confirmed identical SHA digest) works correctly on `aws-k8s-1.33-nvidia`.



**How to reproduce the problem:**

Create a Karpenter NodePool targeting g5g instances on a Kubernetes 1.34 EKS cluster with `bottlerocket@v1.56.0`
Do not set `device-list-strategy` in userData — allow the variant default `cdi-cri` to apply
Schedule a GPU pod (`nvidia.com/gpu: "1"`) on one of these nodes
Inside the container, run:
```bash
# Library appears at sysroot path, not at standard path
find / -name "libcuda.so*" 2>/dev/null
# Host ldcache (run via SSM on the node via `sheltie`)
ldconfig -p | grep libcuda
# => libcuda.so.1 (libc6,AArch64) => /aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/nvidia/tesla/libcuda.so.1
# This directory does not exist on the host
ls /usr/lib/aarch64-linux-gnu/
# => No such file or directory
```

Workaround:

Explicitly setting `device-list-strategy = 'volume-mounts'` in the EC2NodeClass userData restores correct behavior. In this mode, nvidia-container-cli mounts the host's CUDA driver as a real 96MB file at the standard path inside the container, with no sysroot path present inside the container:
```yaml
[settings.kubelet-device-plugins.nvidia]
'device-list-strategy' = 'volume-mounts'
```
Confirmed inside container after applying workaround:
```bash
ls -la /usr/lib/aarch64-linux-gnu/libcuda.so*
# lrwxrwxrwx libcuda.so -> libcuda.so.1
# lrwxrwxrwx libcuda.so.1 -> libcuda.so.580.126.09
# -rwxr-xr-x libcuda.so.580.126.09  (96MB real file, not a symlink)
ls /aarch64-bottlerocket-linux-gnu/
# => no such directory (sysroot not present in container)
```
Root cause hypothesis:

The CDI spec generated by `nvidia-ctk` on Bottlerocket aarch64 uses the paths reported by the host's ldcache (`/aarch64-bottlerocket-linux-gnu/sys-root/usr/lib/nvidia/tesla/`), and maps them into containers at the same path without remapping to standard library locations. By contrast, `nvidia-container-cli` (legacy/volume-mounts mode) correctly mounts the host's driver library at the standard `/usr/lib/aarch64-linux-gnu/` path inside the container.

The path transform needed to remap sysroot paths to standard container paths already exists in nvidia-container-toolkit (pkg/nvcdi/transform/root, referenced in NVIDIA/nvidia-container-toolkit#1116) but is not applied during Bottlerocket's CDI spec generation for aarch64 sysroot layouts.

Related:

bottlerocket-os/bottlerocket#4475 — introduced cdi-cri as default for aws-k8s-1.34+ variants
bottlerocket-os/bottlerocket-core-kit#501 — fixed ldcache parsing for aarch64 (CDI spec generation now succeeds, but path remapping is a separate issue)
bottlerocket-os/bottlerocket-core-kit#459 — CDI + legacy mode support implementation
NVIDIA/nvidia-container-toolkit#1045 — upstream aarch64 ldcache fix
NVIDIA/nvidia-container-toolkit#1116 — path consistency discussion for container OS deployments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDI mode (cdi-cri) mounts libcuda.so at sysroot path inside containers on aarch64 aws-k8s-1.34-nvidia #4811

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CDI mode (cdi-cri) mounts libcuda.so at sysroot path inside containers on aarch64 aws-k8s-1.34-nvidia #4811

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions