1. Overview
The GPU module is a Kubernetes operator that runs inside each SKR cluster and installs the NVIDIA GPU Operator with cluster-appropriate Helm values. It is built with controller-runtime, exposes a single API group, and is deployed by KLM.
User flow: enable the module from BTP Cockpit, KLM creates a Gpu CR, the operator detects GPU nodes, installs the NVIDIA stack via Helm, and reports status. A pod requesting nvidia.com/gpu then schedules and runs.
2. Scope
- Install the NVIDIA GPU Operator via embedded Helm chart (Helm Go SDK).
- Detect GPU nodes from machine type and OS labels (AWS, GCP, Azure).
- Configure the Garden Linux driver path with
usePrecompiled: true and the gardenlinux-nvidia-installer image.
- Lifecycle: install, upgrade on values or chart bump, uninstall via finalizer, drift correction.
- Status aggregation: component health rolled up into
Gpu.status.
- DCGM exporter enabled by default.
3. Architecture
| Decision |
Choice |
| Operator location |
Inside SKR, kyma-system namespace. Deployed by KLM. |
| Pattern |
Controller-runtime + kubebuilder, single API group gpu.kyma-project.io/v1beta1. |
| Install mechanism |
Helm Go SDK. |
| CR shape |
One Gpu per cluster, name default in kyma-system. |
| Driver loading |
NVIDIA GPU Operator + gardenlinux-nvidia-installer precompiled image. |
| Status model |
state enum plus metav1.Condition rollup. |
4. Driver model
Garden Linux ships no build tools on running nodes. NVIDIA's default driver container compiles modules at runtime, which fails on Garden Linux. The fix is the upstream gardenlinux-nvidia-installer project, which pre-compiles modules at GL image-build time and publishes them as container images. The module configures NVIDIA GPU Operator to load those images.
Helm values the module emits on a Garden-Linux-only cluster:
cdi:
enabled: true
default: true
toolkit:
enabled: true
installDir: /opt/nvidia
driver:
enabled: true
usePrecompiled: true
version: "590"
repository: ghcr.io/gardenlinux/gardenlinux-nvidia-installer/1.7.0
imagePullPolicy: Always
node-feature-discovery:
worker:
config:
sources:
custom:
- name: gardenlinux-version
matchFeatures:
- feature: system.osrelease
matchExpressions:
GARDENLINUX_VERSION: { op: Exists }
The gardenlinux-nvidia-installer repo publishes a known-good combination per release tag. The module adopts whole tags.
5. Sovereign cloud - TBD
6. Decisions
- Image mirror coverage for sovereign regions.
7. Delivery
Work proceeds in phases. Each phase must complete before the next starts.
- Project scaffold: repo, CI, leader election, health probes.
- API and CRD:
Gpu types, validation markers, generated manifests.
- Node detection: machine type registry, OS detection, provider detection.
- Helm install (MVP gate): enable the module on a real Shoot, NVIDIA stack runs,
cuda-vectoradd smoke pod completes.
- Status aggregation: component health rolled up into
Gpu.status.
- Configuration and lifecycle: spec changes, drift correction, deletion flow, mixed-OS.
- Module packaging: Deployment, RBAC, ModuleTemplate, Dockerfile, KLM integration.
- Testing: unit, integration on Kind, nightly E2E on real GPU cluster.
- Documentation: user guide, CRD reference, troubleshooting, ops guide.
1. Overview
The GPU module is a Kubernetes operator that runs inside each SKR cluster and installs the NVIDIA GPU Operator with cluster-appropriate Helm values. It is built with controller-runtime, exposes a single API group, and is deployed by KLM.
User flow: enable the module from BTP Cockpit, KLM creates a
GpuCR, the operator detects GPU nodes, installs the NVIDIA stack via Helm, and reports status. A pod requestingnvidia.com/gputhen schedules and runs.2. Scope
usePrecompiled: trueand thegardenlinux-nvidia-installerimage.Gpu.status.3. Architecture
kyma-systemnamespace. Deployed by KLM.gpu.kyma-project.io/v1beta1.Gpuper cluster, namedefaultinkyma-system.gardenlinux-nvidia-installerprecompiled image.stateenum plusmetav1.Conditionrollup.4. Driver model
Garden Linux ships no build tools on running nodes. NVIDIA's default driver container compiles modules at runtime, which fails on Garden Linux. The fix is the upstream
gardenlinux-nvidia-installerproject, which pre-compiles modules at GL image-build time and publishes them as container images. The module configures NVIDIA GPU Operator to load those images.Helm values the module emits on a Garden-Linux-only cluster:
The
gardenlinux-nvidia-installerrepo publishes a known-good combination per release tag. The module adopts whole tags.5. Sovereign cloud - TBD
6. Decisions
7. Delivery
Work proceeds in phases. Each phase must complete before the next starts.
Gputypes, validation markers, generated manifests.cuda-vectoraddsmoke pod completes.Gpu.status.