Describe the bug
nvidia-container-toolkit generates CDI specs with version to 0.5.0 that only contain a single management.nvidia.com/gpu=all device entry - no UUID or index-based entries are generated.
This causes failures when the device plugin operates in CDI mode, as it annotates pods with specific GPU UUIDs (management.nvidia.com/gpu=GPU-xxxx) that cannot be resolved from the gpu=all-only spec.
Additional notes: I came across this related bug - #1239. But removing runtimeClassName doesn't resolve the issue. I tried setting to runtimeClassName to nvidia or nvidia-cdi as well (although the issue does mention that this shouldn't be required with the native generation of CDI spec).
To Reproduce
Install a kuberntes cluster with component versions noted below, and deploy a test pod.
apiVersion: v1
kind: Pod
metadata:
name: test-cdi-device-plugin
namespace: cdi-test
spec:
tolerations:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
restartPolicy: Never
containers:
- name: gpu-test
image: nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04
command: ["sh", "-c", "nvidia-smi -L"]
resources:
limits:
nvidia.com/gpu: 1
The container never starts with error -
Normal Pulling 12s kubelet spec.containers{gpu-test}: Pulling image "nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04"
Normal Pulled 8s kubelet spec.containers{gpu-test}: Successfully pulled image "nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04" in 4.301s (4.301s including waiting). Image size: 139357825 bytes.
Normal Created 8s kubelet spec.containers{gpu-test}: Container created
Warning Failed 8s kubelet spec.containers{gpu-test}: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-67fa9265-2b87-d8ce-e708-13e93416062c
Observations
The nvidia-container-toolkit pods generate CDI specs on certain platforms withhostPath fields in device nodes, which bumps the CDI spec version to 0.5.0. These 0.5.0 specs only contain a single management.nvidia.com/gpu=all device entry - no UUID or index-based entries are generated.
Snippet of the (failing case) `0.5.0` CDI spec
cdiVersion: 0.5.0
kind: management.nvidia.com/gpu
devices:
- name: all
containerEdits:
deviceNodes:
- path: /dev/nvidia-modeset
hostPath: /dev/nvidia-modeset
- path: /dev/nvidia-uvm
hostPath: /dev/nvidia-uvm
- path: /dev/nvidia-uvm-tools
hostPath: /dev/nvidia-uvm-tools
- path: /dev/nvidia0
hostPath: /dev/nvidia0
- path: /dev/nvidia1
hostPath: /dev/nvidia1
- path: /dev/nvidia2
hostPath: /dev/nvidia2
- path: /dev/nvidia3
hostPath: /dev/nvidia3
- path: /dev/nvidia4
hostPath: /dev/nvidia4
- path: /dev/nvidia5
hostPath: /dev/nvidia5
- path: /dev/nvidia6
hostPath: /dev/nvidia6
- path: /dev/nvidia7
hostPath: /dev/nvidia7
- path: /dev/nvidiactl
hostPath: /dev/nvidiactl
- path: /dev/nvidia-caps/nvidia-cap0
hostPath: /dev/nvidia-caps/nvidia-cap0
- path: /dev/nvidia-caps/nvidia-cap1
hostPath: /dev/nvidia-caps/nvidia-cap1
- path: /dev/nvidia-caps/nvidia-cap2
hostPath: /dev/nvidia-caps/nvidia-cap2
Snippet of (working case) `0.3.0` CDI spec
cdiVersion: 0.3.0
kind: management.nvidia.com/gpu
devices:
- name: GPU-1da763bb-efa9-5bb1-0740-7ad1ffe7e78e
containerEdits:
deviceNodes:
- path: /dev/nvidia7
major: 195
minor: 7
fileMode: 438
permissions: rwm
- path: /dev/dri/card8
major: 226
minor: 8
fileMode: 432
permissions: rwm
gid: 44
- path: /dev/dri/renderD136
major: 226
minor: 136
fileMode: 432
permissions: rwm
gid: 992
hooks:
- hookName: createContainer
path: /usr/local/nvidia/toolkit/nvidia-cdi-hook
args:
- nvidia-cdi-hook
- create-symlinks
- --link
- ../card8::/dev/dri/by-path/pci-0000:63:00.0-card
- --link
- ../renderD136::/dev/dri/by-path/pci-0000:63:00.0-render
env:
- NVIDIA_CTK_DEBUG=false
- name: GPU-39b958b0-8067-e3d6-aced-7e130721f922
containerEdits:
deviceNodes:
- path: /dev/nvidia0
major: 195
fileMode: 438
permissions: rwm
- path: /dev/dri/card0
major: 226
fileMode: 432
permissions: rwm
gid: 44
- path: /dev/dri/renderD129
major: 226
Expected behavior
The test pod requesting GPU devices is succsessfully completed with logs -
$ k logs test-cdi-device-plugin -n cdi-test
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-06714eaa-5321-72be-2b23-479ee6b6221a)
Environment (please provide the following information):
nvidia-container-toolkit version: v1.18.1
- NVIDIA Driver Version: 580.126.20
- Host OS: Ubuntu 24.04.4 LTS
- Kernel Version: 6.8.0-1046-nvidia
- Container Runtime Version: bundled with v1.35.2+rke2r1
- CPU Architecture x86_64 and arm64
- GPU Model(s): GH100, GB300
If applicable, also provide:
Warning Failed 3s kubelet spec.containers{gpu-test}: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-fd24a321-803d-5ba4-0747-2f56f62bb66a
Describe the bug
nvidia-container-toolkitgenerates CDI specs with version to0.5.0that only contain a singlemanagement.nvidia.com/gpu=alldevice entry - no UUID or index-based entries are generated.This causes failures when the device plugin operates in CDI mode, as it annotates pods with specific GPU UUIDs (management.nvidia.com/gpu=GPU-xxxx) that cannot be resolved from the gpu=all-only spec.
Additional notes: I came across this related bug - #1239. But removing
runtimeClassNamedoesn't resolve the issue. I tried setting toruntimeClassNameto nvidia or nvidia-cdi as well (although the issue does mention that this shouldn't be required with the native generation of CDI spec).To Reproduce
Install a kuberntes cluster with component versions noted below, and deploy a test pod.
The container never starts with error -
Observations
The
nvidia-container-toolkitpods generate CDI specs on certain platforms withhostPathfields in device nodes, which bumps the CDI spec version to0.5.0. These0.5.0specs only contain a singlemanagement.nvidia.com/gpu=alldevice entry - no UUID or index-based entries are generated.Snippet of the (failing case) `0.5.0` CDI spec
cdiVersion: 0.5.0
kind: management.nvidia.com/gpu
devices:
- name: all
containerEdits:
deviceNodes:
- path: /dev/nvidia-modeset
hostPath: /dev/nvidia-modeset
- path: /dev/nvidia-uvm
hostPath: /dev/nvidia-uvm
- path: /dev/nvidia-uvm-tools
hostPath: /dev/nvidia-uvm-tools
- path: /dev/nvidia0
hostPath: /dev/nvidia0
- path: /dev/nvidia1
hostPath: /dev/nvidia1
- path: /dev/nvidia2
hostPath: /dev/nvidia2
- path: /dev/nvidia3
hostPath: /dev/nvidia3
- path: /dev/nvidia4
hostPath: /dev/nvidia4
- path: /dev/nvidia5
hostPath: /dev/nvidia5
- path: /dev/nvidia6
hostPath: /dev/nvidia6
- path: /dev/nvidia7
hostPath: /dev/nvidia7
- path: /dev/nvidiactl
hostPath: /dev/nvidiactl
- path: /dev/nvidia-caps/nvidia-cap0
hostPath: /dev/nvidia-caps/nvidia-cap0
- path: /dev/nvidia-caps/nvidia-cap1
hostPath: /dev/nvidia-caps/nvidia-cap1
- path: /dev/nvidia-caps/nvidia-cap2
hostPath: /dev/nvidia-caps/nvidia-cap2
Snippet of (working case) `0.3.0` CDI spec
cdiVersion: 0.3.0
kind: management.nvidia.com/gpu
devices:
- name: GPU-1da763bb-efa9-5bb1-0740-7ad1ffe7e78e
containerEdits:
deviceNodes:
- path: /dev/nvidia7
major: 195
minor: 7
fileMode: 438
permissions: rwm
- path: /dev/dri/card8
major: 226
minor: 8
fileMode: 432
permissions: rwm
gid: 44
- path: /dev/dri/renderD136
major: 226
minor: 136
fileMode: 432
permissions: rwm
gid: 992
hooks:
- hookName: createContainer
path: /usr/local/nvidia/toolkit/nvidia-cdi-hook
args:
- nvidia-cdi-hook
- create-symlinks
- --link
- ../card8::/dev/dri/by-path/pci-0000:63:00.0-card
- --link
- ../renderD136::/dev/dri/by-path/pci-0000:63:00.0-render
env:
- NVIDIA_CTK_DEBUG=false
- name: GPU-39b958b0-8067-e3d6-aced-7e130721f922
containerEdits:
deviceNodes:
- path: /dev/nvidia0
major: 195
fileMode: 438
permissions: rwm
- path: /dev/dri/card0
major: 226
fileMode: 432
permissions: rwm
gid: 44
- path: /dev/dri/renderD129
major: 226
Expected behavior
The test pod requesting GPU devices is succsessfully completed with logs -
Environment (please provide the following information):
nvidia-container-toolkitversion: v1.18.1If applicable, also provide:
Kubernetes Distro and Version: K8s v1.35.2+rke2r1
NVIDIA GPU Operator version: v25.10.1
Container logs