@@ -466,7 +466,9 @@ oc patch ingresscontroller/default -n openshift-ingress-operator --type=merge \
466466
467467Until the secret is populated, the router keeps serving the installer default; after issuance, HAProxy reload picks up the Let’s Encrypt chain.
468468
469- # # Day-2: Add GPU Node
469+ # # Day-2: Add GPU node
470+
471+ Extra worker that merges the same **`conf/worker.ign`** as the other workers (hostname-only delta in Butane). Pick a GPU **flavor** and **AZ** that exist in your project.
470472
471473=== "Download"
472474
@@ -476,86 +478,83 @@ Until the secret is populated, the router keeps serving the installer default; a
476478
477479=== "ign-gpu-worker-0.rcc"
478480
479- ` ` ` json
481+ ` ` ` yaml
480482 --8<-- "content/cluster-installation/stackit/ign-gpu-worker-0.rcc"
481483 ` ` `
482484
483485` ` ` shell
484- stackit server create \
485- --assume-yes \
486- --availability-zone eu01-1 \
487- --machine-type n2.14d.g1 \
488- --name " cluster-a-gpu-worker-0" \
489- --boot-volume-source-type image \
490- --boot-volume-source-id 6055861d-6641-4a45-b00e-fcfb250d65e6 \
491- --boot-volume-delete-on-termination \
492- --boot-volume-size 120 \
493- --network-id 459afb3e-54fa-45d4-a972-ae39ec370761 \
494- --user-data @<(butane -d . -r " ign-gpu-worker-0.rcc" )
486+ stackit server create \
487+ --assume-yes \
488+ --availability-zone eu01-1 \
489+ --machine-type n2.14d.g1 \
490+ --name cluster-a-gpu-worker-0 \
491+ --boot-volume-source-type image \
492+ --boot-volume-source-id <RHCOS_IMAGE_ID> \
493+ --boot-volume-delete-on-termination \
494+ --boot-volume-size 120 \
495+ --network-id <NETWORK_ID> \
496+ --user-data @<(butane -d . -r ign-gpu-worker-0.rcc)
495497` ` `
496498
497- Wait and approve CSR
499+ When the node registers, approve any lingering **Pending** CSRs :
498500
499501` ` ` shell
500502export KUBECONFIG="$PWD/conf/auth/kubeconfig"
501503oc get csr | awk '/Pending/{print $1}' | xargs oc adm certificate approve
502504` ` `
503505
504- Install Nvidia GPU Operator : [NVIDIA GPU Operator on Red Hat OpenShift Container Platform](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html)
506+ Install the **NVIDIA GPU Operator** (catalog channel + `ClusterPolicy` per your OCP version) : [NVIDIA GPU Operator on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html).
507+
508+ Sanity-check after the operator has prepared the node runtime :
505509
506510` ` ` shell
507- oc create -f - <<EOF
511+ oc new-project gpu-test
512+ oc apply -f - <<'EOF'
508513apiVersion: v1
509514kind: Pod
510515metadata:
511516 name: nvidia-smi
517+ namespace: gpu-test
512518spec:
519+ restartPolicy: Never
513520 containers:
514- - image: registry.redhat.io/rhai/base-image-cuda-13.0-rhel9:3.3.1-1775076057
515- name: nvidia-smi
516- command: [ nvidia-smi ]
517- resources:
518- limits:
519- nvidia.com/gpu: 1
520- requests:
521- nvidia.com/gpu: 1
521+ - name: nvidia-smi
522+ image: registry.redhat.io/rhai/base-image-cuda-13.0-rhel9:3.3.1-1775076057
523+ command: [nvidia-smi]
524+ resources:
525+ limits:
526+ nvidia.com/gpu: "1"
527+ requests:
528+ nvidia.com/gpu: "1"
522529EOF
530+ oc wait -n gpu-test --for=condition=Ready pod/nvidia-smi --timeout=120s
531+ oc logs -n gpu-test nvidia-smi
532+ ` ` `
533+
534+ Example output (hardware-dependent) :
523535
524- $ oc logs nvidia-smi
536+ ` ` ` text
525537Tue May 5 13:27:54 2026
526538+-----------------------------------------------------------------------------------------+
527539| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
528540+-----------------------------------------+------------------------+----------------------+
529- | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
530- | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
531- | | | MIG M. |
532- |=========================================+========================+======================|
533541| 0 NVIDIA L40S On | 00000000:05:00.0 Off | 0 |
534542| N/A 29C P8 36W / 350W | 0MiB / 46068MiB | 0% Default |
535- | | | N/A |
536- +-----------------------------------------+------------------------+----------------------+
537-
538- +-----------------------------------------------------------------------------------------+
539- | Processes: |
540- | GPU GI CI PID Type Process name GPU Memory |
541- | ID ID Usage |
542- |=========================================================================================|
543- | No running processes found |
544543+-----------------------------------------------------------------------------------------+
545544` ` `
546545
547- # # Day-2: Deploy Cloud Controller Manager (CCM)
546+ Pin the CUDA **image digest** in production; the tag above is illustrative.
548547
549- [Upstream documentation](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/deployment.md )
548+ # # Day-2: Cloud Controller Manager (CCM )
550549
551- * Create an Service Account at STACKIT called `ccm-and-csi`
552- * Create Service account keys and download the json file
553- * Assign editor role for the entire project.
550+ **Prereq** — STACKIT service account (reuse for CSI below):
554551
555- Deployment steps :
552+ * Create a service account (e.g. `ccm-and-csi`), download the key JSON.
553+ * Grant **Editor** on the project that owns `projectId` / `networkId` in `cloud.yaml`.
556554
557555` ` ` shell
558- oc create secret generic -n kube-system stackit-cloud-secret --from-file=sa_key.json=<service account json>
556+ oc create secret generic -n kube-system stackit-cloud-secret \
557+ --from-file=sa_key.json=./stackit-ccm-sa.json
559558` ` `
560559
561560=== "Download"
@@ -566,108 +565,92 @@ oc create secret generic -n kube-system stackit-cloud-secret --from-file=sa_key.
566565
567566=== "cloud.yaml"
568567
569- ` ` ` json
568+ ` ` ` yaml
570569 --8<-- "content/cluster-installation/stackit/cloud.yaml"
571570 ` ` `
572571
573- Adjust cloud.yaml and put into configmap :
574-
575572` ` ` shell
576- oc create configmap -n kube-system stackit-cloud-config --from-file=cloud.yaml
573+ oc create configmap -n kube-system stackit-cloud-config \
574+ --from-file=cloud.yaml=./cloud.yaml
577575` ` `
578576
579- Deploy cloud controller manager :
577+ Upstream RBAC + `Service` (pin a **commit** if you do not want `main` drifting) :
580578
581579` ` ` shell
582- oc apply -f https://raw.githubusercontent.com/stackitcloud/cloud-provider-stackit/refs/heads/main/deploy/cloud-controller-manager/rbac.yaml
583- oc apply -f https://github.com/stackitcloud/cloud-provider-stackit/raw/refs/heads/main/deploy/cloud-controller-manager/service.yaml
580+ CCM_BASE=https://raw.githubusercontent.com/stackitcloud/cloud-provider-stackit/main/deploy/cloud-controller-manager
581+ oc apply -f "${CCM_BASE}/rbac.yaml"
582+ oc apply -f "${CCM_BASE}/service.yaml"
584583` ` `
585584
586585=== "Apply"
587586
588587 ` ` ` shell
589- oc apply -f{{ page.canonical_url }}ccm-and-csi-deployment.yaml
588+ oc apply -f {{ page.canonical_url }}ccm-and-csi-deployment.yaml
590589 ` ` `
591590
592591=== "ccm-and-csi-deployment.yaml"
593592
594- ` ` ` json
593+ ` ` ` yaml
595594 --8<-- "content/cluster-installation/stackit/ccm-and-csi-deployment.yaml"
596595 ` ` `
597596
598- ???+ failure
597+ ???+ warning "CCM panic observed"
598+
599+ With **`cloud-controller-manager:v1.36.0`** this deployment **segfaulted** right after startup (nil deref). Root cause not chased here. **CSI dynamic provisioning still worked** without a healthy CCM — storage does not depend on cloud `Service` LBs — but confirm for your image / region before assuming that split is always safe.
599600
600601 ```
601602 starting Controller
602603 I0507 12:55:45.723462 1 serving.go:411] Generated self-signed cert in-memory
603- W0507 12:55:45.723531 1 client_config.go:683] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
604+ W0507 12:55:45.723531 1 client_config.go:683] Neither --kubeconfig nor --master was
605+ specified. Using the inClusterConfig. This might not work.
604606 panic : runtime error: invalid memory address or nil pointer dereference
605607 [signal SIGSEGV : segmentation violation code=0x1 addr=0x18 pc=0x4fe685]
606608 ` ` `
607609
608610 **No further investigation, because it looks CSI works without CCM**
609611
610- ## Day-2: Container Storage Interface ( CSI) Driver
612+ ## Day-2: STACKIT CSI driver
611613
612- [Upstream documentation ](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/csi-driver.md)
614+ [Upstream CSI doc ](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/csi-driver.md). Reuse the **same** STACKIT SA and ` cloud.yaml` as in the CCM section.
613615
614- * Create an Service Account at STACKIT called ` ccm-and-csi` (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
615- * Create Service account keys and download the json file (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
616- * Assign editor role for the entire project. (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
617-
618- Let's deploy the csi-driver into namespace/project `stackit-csi-driver`
616+ The overlay sets **`namespace : stackit-csi-driver`** on all resources from the remote base. Put **`stackit-cloud-secret`** and **`stackit-cloud-config` in that namespace** — not `kube-system` — or the controller never mounts credentials.
619617
620618` ` ` shell
621619oc new-project stackit-csi-driver
620+ oc create secret generic -n stackit-csi-driver stackit-cloud-secret \
621+ --from-file=sa_key.json=./stackit-ccm-sa.json
622+ oc create configmap -n stackit-csi-driver stackit-cloud-config \
623+ --from-file=cloud.yaml=./cloud.yaml
622624` ` `
623625
624- ` ` ` shell
625- oc create secret generic -n kube-system stackit-cloud-secret --from-file=sa_key.json=<service account json>
626- ` ` `
627-
628- === "Download"
629-
630- ` ` ` shell
631- curl -L -O {{ page.canonical_url }}cloud.yaml
632- ` ` `
633-
634- === "cloud.yaml"
635-
636- ` ` ` json
637- --8<-- "content/cluster-installation/stackit/cloud.yaml"
638- ` ` `
639-
640- Adjust cloud.yaml and put into configmap :
641-
642- ` ` ` shell
643- oc create configmap stackit-cloud-config --from-file=cloud.yaml
644- ` ` `
645-
646- Allow the CSI node componentes to run privileged, looks like it only need hostpath and hostnetwork. It's recommended to pick and/or create a more precise security context constraint.
626+ Node plugin needs host paths / devices; grant **`privileged`** to the node SA (narrow with a custom SCC later if you need to) :
647627
648628` ` ` shell
649- oc adm policy add-scc-to-user privileged -z csi-stackit-node-sa
629+ oc -n stackit-csi-driver adm policy add-scc-to-user privileged -z csi-stackit-node-sa
650630` ` `
651631
652- Download and apply `kustomization.yaml` to deploy csi driver into specific namespace and propper image url
632+ Working directory for `kustomize build` / `oc apply -k` : remote base is **HTTPS** (no `git@` SSH key). Image tag override stays in the local overlay.
653633
654634=== "Download"
655635
656636 ` ` ` shell
657- curl -L -O {{ page.canonical_url }}kustomization.yaml
637+ mkdir -p stackit-csi && cd stackit-csi
638+ curl -L -O {{ page.canonical_url }}csi/kustomization.yaml
658639 ` ` `
659640
660641=== "kustomization.yaml"
661642
662- ` ` ` json
663- --8<-- "content/cluster-installation/stackit/kustomization.yaml"
643+ ` ` ` yaml
644+ --8<-- "content/cluster-installation/stackit/csi/ kustomization.yaml"
664645 ` ` `
665646
666647` ` ` shell
667648oc apply -k .
668649` ` `
669650
670- # ## Let's try to storage
651+ # ## Validate provisioning
652+
653+ ` StorageClass` **`premium-perf4-stackit`** ships with the upstream CSI bundle (name matches [their examples](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/csi-driver.md)).
671654
672655` ` ` shell
673656oc new-project storage-test
@@ -681,6 +664,6 @@ oc new-project storage-test
681664
682665=== "lets-try-storage.yaml"
683666
684- ` ` ` json
667+ ` ` ` yaml
685668 --8<-- "content/cluster-installation/stackit/lets-try-storage.yaml"
686669 ` ` `
0 commit comments