Skip to content

Architecture Concept Review: Kyma GPU Module #12

@vrdc-sap

Description

@vrdc-sap

1. Overview

The GPU module is a Kubernetes operator that runs inside each SKR cluster and installs the NVIDIA GPU Operator with cluster-appropriate Helm values. It is built with controller-runtime, exposes a single API group, and is deployed by KLM.

User flow: enable the module from BTP Cockpit, KLM creates a Gpu CR, the operator detects GPU nodes, installs the NVIDIA stack via Helm, and reports status. A pod requesting nvidia.com/gpu then schedules and runs.

2. Scope

  • Install the NVIDIA GPU Operator via embedded Helm chart (Helm Go SDK).
  • Detect GPU nodes from machine type and OS labels (AWS, GCP, Azure).
  • Configure the Garden Linux driver path with usePrecompiled: true and the gardenlinux-nvidia-installer image.
  • Lifecycle: install, upgrade on values or chart bump, uninstall via finalizer, drift correction.
  • Status aggregation: component health rolled up into Gpu.status.
  • DCGM exporter enabled by default.

3. Architecture

Decision Choice
Operator location Inside SKR, kyma-system namespace. Deployed by KLM.
Pattern Controller-runtime + kubebuilder, single API group gpu.kyma-project.io/v1beta1.
Install mechanism Helm Go SDK.
CR shape One Gpu per cluster, name default in kyma-system.
Driver loading NVIDIA GPU Operator + gardenlinux-nvidia-installer precompiled image.
Status model state enum plus metav1.Condition rollup.

4. Driver model

Garden Linux ships no build tools on running nodes. NVIDIA's default driver container compiles modules at runtime, which fails on Garden Linux. The fix is the upstream gardenlinux-nvidia-installer project, which pre-compiles modules at GL image-build time and publishes them as container images. The module configures NVIDIA GPU Operator to load those images.

Helm values the module emits on a Garden-Linux-only cluster:

cdi:
  enabled: true
  default: true
toolkit:
  enabled: true
  installDir: /opt/nvidia
driver:
  enabled: true
  usePrecompiled: true
  version: "590"
  repository: ghcr.io/gardenlinux/gardenlinux-nvidia-installer/1.7.0
  imagePullPolicy: Always
node-feature-discovery:
  worker:
    config:
      sources:
        custom:
          - name: gardenlinux-version
            matchFeatures:
              - feature: system.osrelease
                matchExpressions:
                  GARDENLINUX_VERSION: { op: Exists }

The gardenlinux-nvidia-installer repo publishes a known-good combination per release tag. The module adopts whole tags.

5. Sovereign cloud - TBD

6. Decisions

  • Image mirror coverage for sovereign regions.

7. Delivery

Work proceeds in phases. Each phase must complete before the next starts.

  1. Project scaffold: repo, CI, leader election, health probes.
  2. API and CRD: Gpu types, validation markers, generated manifests.
  3. Node detection: machine type registry, OS detection, provider detection.
  4. Helm install (MVP gate): enable the module on a real Shoot, NVIDIA stack runs, cuda-vectoradd smoke pod completes.
  5. Status aggregation: component health rolled up into Gpu.status.
  6. Configuration and lifecycle: spec changes, drift correction, deletion flow, mixed-OS.
  7. Module packaging: Deployment, RBAC, ModuleTemplate, Dockerfile, KLM integration.
  8. Testing: unit, integration on Kind, nightly E2E on real GPU cluster.
  9. Documentation: user guide, CRD reference, troubleshooting, ops guide.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions