Architecture Concept Review: Kyma GPU Module

## 1. Overview

The GPU module is a Kubernetes operator that runs inside each SKR cluster and installs the NVIDIA GPU Operator with cluster-appropriate Helm values. It is built with controller-runtime, exposes a single API group, and is deployed by KLM.

User flow: enable the module from BTP Cockpit, KLM creates a `Gpu` CR, the operator detects GPU nodes, installs the NVIDIA stack via Helm, and reports status. A pod requesting `nvidia.com/gpu` then schedules and runs.

## 2. Scope

- Install the NVIDIA GPU Operator via embedded Helm chart (Helm Go SDK).
- Detect GPU nodes from machine type and OS labels (AWS, GCP, Azure).
- Configure the Garden Linux driver path with `usePrecompiled: true` and the `gardenlinux-nvidia-installer` image.
- Lifecycle: install, upgrade on values or chart bump, uninstall via finalizer, drift correction.
- Status aggregation: component health rolled up into `Gpu.status`.
- DCGM exporter enabled by default.

## 3. Architecture

| Decision | Choice |
|---|---|
| Operator location | Inside SKR, `kyma-system` namespace. Deployed by KLM. |
| Pattern | Controller-runtime + kubebuilder, single API group `gpu.kyma-project.io/v1beta1`. |
| Install mechanism | Helm Go SDK. |
| CR shape | One `Gpu` per cluster, name `default` in `kyma-system`. |
| Driver loading | NVIDIA GPU Operator + `gardenlinux-nvidia-installer` precompiled image. |
| Status model | `state` enum plus `metav1.Condition` rollup. |

## 4. Driver model

Garden Linux ships no build tools on running nodes. NVIDIA's default driver container compiles modules at runtime, which fails on Garden Linux. The fix is the upstream `gardenlinux-nvidia-installer` project, which pre-compiles modules at GL image-build time and publishes them as container images. The module configures NVIDIA GPU Operator to load those images.

Helm values the module emits on a Garden-Linux-only cluster:

```yaml
cdi:
  enabled: true
  default: true
toolkit:
  enabled: true
  installDir: /opt/nvidia
driver:
  enabled: true
  usePrecompiled: true
  version: "590"
  repository: ghcr.io/gardenlinux/gardenlinux-nvidia-installer/1.7.0
  imagePullPolicy: Always
node-feature-discovery:
  worker:
    config:
      sources:
        custom:
          - name: gardenlinux-version
            matchFeatures:
              - feature: system.osrelease
                matchExpressions:
                  GARDENLINUX_VERSION: { op: Exists }
```

The `gardenlinux-nvidia-installer` repo publishes a known-good combination per release tag. The module adopts whole tags.

## 5. Sovereign cloud - TBD

## 6. Decisions

- Image mirror coverage for sovereign regions.


## 7. Delivery

Work proceeds in phases. Each phase must complete before the next starts.

1. Project scaffold: repo, CI, leader election, health probes.
2. API and CRD: `Gpu` types, validation markers, generated manifests.
3. Node detection: machine type registry, OS detection, provider detection.
4. Helm install (MVP gate): enable the module on a real Shoot, NVIDIA stack runs, `cuda-vectoradd` smoke pod completes.
5. Status aggregation: component health rolled up into `Gpu.status`.
6. Configuration and lifecycle: spec changes, drift correction, deletion flow, mixed-OS.
7. Module packaging: Deployment, RBAC, ModuleTemplate, Dockerfile, KLM integration.
8. Testing: unit, integration on Kind, nightly E2E on real GPU cluster.
9. Documentation: user guide, CRD reference, troubleshooting, ops guide.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Concept Review: Kyma GPU Module #12

1. Overview

2. Scope

3. Architecture

4. Driver model

5. Sovereign cloud - TBD

6. Decisions

7. Delivery

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Decision	Choice
Operator location	Inside SKR, `kyma-system` namespace. Deployed by KLM.
Pattern	Controller-runtime + kubebuilder, single API group `gpu.kyma-project.io/v1beta1`.
Install mechanism	Helm Go SDK.
CR shape	One `Gpu` per cluster, name `default` in `kyma-system`.
Driver loading	NVIDIA GPU Operator + `gardenlinux-nvidia-installer` precompiled image.
Status model	`state` enum plus `metav1.Condition` rollup.

Architecture Concept Review: Kyma GPU Module #12

Description

1. Overview

2. Scope

3. Architecture

4. Driver model

5. Sovereign cloud - TBD

6. Decisions

7. Delivery

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions