Skip to content

[Bug]: CDI v0.5.0 specs without UUID device names break device plugin in CDI mode #1895

Description

@aditighag

Describe the bug
nvidia-container-toolkit generates CDI specs with version to 0.5.0 that only contain a single management.nvidia.com/gpu=all device entry - no UUID or index-based entries are generated.

This causes failures when the device plugin operates in CDI mode, as it annotates pods with specific GPU UUIDs (management.nvidia.com/gpu=GPU-xxxx) that cannot be resolved from the gpu=all-only spec.

Additional notes: I came across this related bug - #1239. But removing runtimeClassName doesn't resolve the issue. I tried setting to runtimeClassName to nvidia or nvidia-cdi as well (although the issue does mention that this shouldn't be required with the native generation of CDI spec).

To Reproduce
Install a kuberntes cluster with component versions noted below, and deploy a test pod.

  apiVersion: v1
  kind: Pod
  metadata:
    name: test-cdi-device-plugin
    namespace: cdi-test
  spec:
    tolerations:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule
    restartPolicy: Never
    containers:
    - name: gpu-test
      image: nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04
      command: ["sh", "-c", "nvidia-smi -L"]
      resources:
        limits:
          nvidia.com/gpu: 1

The container never starts with error -

  Normal   Pulling    12s   kubelet            spec.containers{gpu-test}: Pulling image "nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04"
  Normal   Pulled     8s    kubelet            spec.containers{gpu-test}: Successfully pulled image "nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04" in 4.301s (4.301s including waiting). Image size: 139357825 bytes.
  Normal   Created    8s    kubelet            spec.containers{gpu-test}: Container created
  Warning  Failed     8s    kubelet            spec.containers{gpu-test}: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-67fa9265-2b87-d8ce-e708-13e93416062c

Observations
The nvidia-container-toolkit pods generate CDI specs on certain platforms withhostPath fields in device nodes, which bumps the CDI spec version to 0.5.0. These 0.5.0 specs only contain a single management.nvidia.com/gpu=all device entry - no UUID or index-based entries are generated.

Snippet of the (failing case) `0.5.0` CDI spec

cdiVersion: 0.5.0
kind: management.nvidia.com/gpu
devices:
- name: all
containerEdits:
deviceNodes:
- path: /dev/nvidia-modeset
hostPath: /dev/nvidia-modeset
- path: /dev/nvidia-uvm
hostPath: /dev/nvidia-uvm
- path: /dev/nvidia-uvm-tools
hostPath: /dev/nvidia-uvm-tools
- path: /dev/nvidia0
hostPath: /dev/nvidia0
- path: /dev/nvidia1
hostPath: /dev/nvidia1
- path: /dev/nvidia2
hostPath: /dev/nvidia2
- path: /dev/nvidia3
hostPath: /dev/nvidia3
- path: /dev/nvidia4
hostPath: /dev/nvidia4
- path: /dev/nvidia5
hostPath: /dev/nvidia5
- path: /dev/nvidia6
hostPath: /dev/nvidia6
- path: /dev/nvidia7
hostPath: /dev/nvidia7
- path: /dev/nvidiactl
hostPath: /dev/nvidiactl
- path: /dev/nvidia-caps/nvidia-cap0
hostPath: /dev/nvidia-caps/nvidia-cap0
- path: /dev/nvidia-caps/nvidia-cap1
hostPath: /dev/nvidia-caps/nvidia-cap1
- path: /dev/nvidia-caps/nvidia-cap2
hostPath: /dev/nvidia-caps/nvidia-cap2

Snippet of (working case) `0.3.0` CDI spec

cdiVersion: 0.3.0
kind: management.nvidia.com/gpu
devices:
- name: GPU-1da763bb-efa9-5bb1-0740-7ad1ffe7e78e
containerEdits:
deviceNodes:
- path: /dev/nvidia7
major: 195
minor: 7
fileMode: 438
permissions: rwm
- path: /dev/dri/card8
major: 226
minor: 8
fileMode: 432
permissions: rwm
gid: 44
- path: /dev/dri/renderD136
major: 226
minor: 136
fileMode: 432
permissions: rwm
gid: 992
hooks:
- hookName: createContainer
path: /usr/local/nvidia/toolkit/nvidia-cdi-hook
args:
- nvidia-cdi-hook
- create-symlinks
- --link
- ../card8::/dev/dri/by-path/pci-0000:63:00.0-card
- --link
- ../renderD136::/dev/dri/by-path/pci-0000:63:00.0-render
env:
- NVIDIA_CTK_DEBUG=false
- name: GPU-39b958b0-8067-e3d6-aced-7e130721f922
containerEdits:
deviceNodes:
- path: /dev/nvidia0
major: 195
fileMode: 438
permissions: rwm
- path: /dev/dri/card0
major: 226
fileMode: 432
permissions: rwm
gid: 44
- path: /dev/dri/renderD129
major: 226

Expected behavior
The test pod requesting GPU devices is succsessfully completed with logs -

$ k logs test-cdi-device-plugin -n cdi-test
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-06714eaa-5321-72be-2b23-479ee6b6221a)

Environment (please provide the following information):

  • nvidia-container-toolkit version: v1.18.1
  • NVIDIA Driver Version: 580.126.20
  • Host OS: Ubuntu 24.04.4 LTS
  • Kernel Version: 6.8.0-1046-nvidia
  • Container Runtime Version: bundled with v1.35.2+rke2r1
  • CPU Architecture x86_64 and arm64
  • GPU Model(s): GH100, GB300

If applicable, also provide:

  • Kubernetes Distro and Version: K8s v1.35.2+rke2r1

  • NVIDIA GPU Operator version: v25.10.1

  • Container logs

  Warning  Failed   3s    kubelet  spec.containers{gpu-test}: Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=GPU-fd24a321-803d-5ba4-0747-2f56f62bb66a

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

Fields

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions