Skip to content

Commit d66d347

Browse files
committed
Update
1 parent 30b5a42 commit d66d347

4 files changed

Lines changed: 133 additions & 94 deletions

File tree

content/cluster-installation/stackit/csi/kustomization.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@ kind: Kustomization
33

44
namespace: stackit-csi-driver
55
resources:
6-
- git@github.com:stackitcloud/cloud-provider-stackit/deploy/csi-plugin/
6+
# HTTPS so `oc apply -k` works without a GitHub SSH key (pin ref for reproducibility).
7+
- https://github.com/stackitcloud/cloud-provider-stackit.git//deploy/csi-plugin?ref=main
78

89
images:
910
- name: ghcr.io/stackitcloud/cloud-provider-stackit/stackit-csi-plugin:release-v1.34
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
variant: fcos
2+
version: 1.5.0
3+
ignition:
4+
config:
5+
merge:
6+
- local: "conf/worker.ign"
7+
storage:
8+
files:
9+
- path: /etc/hostname
10+
overwrite: true
11+
contents:
12+
source: data:,gpu-worker-0
13+
mode: 420

content/cluster-installation/stackit/index.md

Lines changed: 76 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -466,7 +466,9 @@ oc patch ingresscontroller/default -n openshift-ingress-operator --type=merge \
466466

467467
Until the secret is populated, the router keeps serving the installer default; after issuance, HAProxy reload picks up the Let’s Encrypt chain.
468468

469-
## Day-2: Add GPU Node
469+
## Day-2: Add GPU node
470+
471+
Extra worker that merges the same **`conf/worker.ign`** as the other workers (hostname-only delta in Butane). Pick a GPU **flavor** and **AZ** that exist in your project.
470472

471473
=== "Download"
472474

@@ -476,86 +478,83 @@ Until the secret is populated, the router keeps serving the installer default; a
476478

477479
=== "ign-gpu-worker-0.rcc"
478480

479-
```json
481+
```yaml
480482
--8<-- "content/cluster-installation/stackit/ign-gpu-worker-0.rcc"
481483
```
482484

483485
```shell
484-
stackit server create \
485-
--assume-yes \
486-
--availability-zone eu01-1 \
487-
--machine-type n2.14d.g1 \
488-
--name "cluster-a-gpu-worker-0" \
489-
--boot-volume-source-type image \
490-
--boot-volume-source-id 6055861d-6641-4a45-b00e-fcfb250d65e6 \
491-
--boot-volume-delete-on-termination \
492-
--boot-volume-size 120 \
493-
--network-id 459afb3e-54fa-45d4-a972-ae39ec370761 \
494-
--user-data @<(butane -d . -r "ign-gpu-worker-0.rcc")
486+
stackit server create \
487+
--assume-yes \
488+
--availability-zone eu01-1 \
489+
--machine-type n2.14d.g1 \
490+
--name cluster-a-gpu-worker-0 \
491+
--boot-volume-source-type image \
492+
--boot-volume-source-id <RHCOS_IMAGE_ID> \
493+
--boot-volume-delete-on-termination \
494+
--boot-volume-size 120 \
495+
--network-id <NETWORK_ID> \
496+
--user-data @<(butane -d . -r ign-gpu-worker-0.rcc)
495497
```
496498

497-
Wait and approve CSR
499+
When the node registers, approve any lingering **Pending** CSRs:
498500

499501
```shell
500502
export KUBECONFIG="$PWD/conf/auth/kubeconfig"
501503
oc get csr | awk '/Pending/{print $1}' | xargs oc adm certificate approve
502504
```
503505

504-
Install Nvidia GPU Operator: [NVIDIA GPU Operator on Red Hat OpenShift Container Platform](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html)
506+
Install the **NVIDIA GPU Operator** (catalog channel + `ClusterPolicy` per your OCP version): [NVIDIA GPU Operator on OpenShift](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html).
507+
508+
Sanity-check after the operator has prepared the node runtime:
505509

506510
```shell
507-
oc create -f - <<EOF
511+
oc new-project gpu-test
512+
oc apply -f - <<'EOF'
508513
apiVersion: v1
509514
kind: Pod
510515
metadata:
511516
name: nvidia-smi
517+
namespace: gpu-test
512518
spec:
519+
restartPolicy: Never
513520
containers:
514-
- image: registry.redhat.io/rhai/base-image-cuda-13.0-rhel9:3.3.1-1775076057
515-
name: nvidia-smi
516-
command: [ nvidia-smi ]
517-
resources:
518-
limits:
519-
nvidia.com/gpu: 1
520-
requests:
521-
nvidia.com/gpu: 1
521+
- name: nvidia-smi
522+
image: registry.redhat.io/rhai/base-image-cuda-13.0-rhel9:3.3.1-1775076057
523+
command: [nvidia-smi]
524+
resources:
525+
limits:
526+
nvidia.com/gpu: "1"
527+
requests:
528+
nvidia.com/gpu: "1"
522529
EOF
530+
oc wait -n gpu-test --for=condition=Ready pod/nvidia-smi --timeout=120s
531+
oc logs -n gpu-test nvidia-smi
532+
```
533+
534+
Example output (hardware-dependent):
523535

524-
$ oc logs nvidia-smi
536+
```text
525537
Tue May 5 13:27:54 2026
526538
+-----------------------------------------------------------------------------------------+
527539
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
528540
+-----------------------------------------+------------------------+----------------------+
529-
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
530-
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
531-
| | | MIG M. |
532-
|=========================================+========================+======================|
533541
| 0 NVIDIA L40S On | 00000000:05:00.0 Off | 0 |
534542
| N/A 29C P8 36W / 350W | 0MiB / 46068MiB | 0% Default |
535-
| | | N/A |
536-
+-----------------------------------------+------------------------+----------------------+
537-
538-
+-----------------------------------------------------------------------------------------+
539-
| Processes: |
540-
| GPU GI CI PID Type Process name GPU Memory |
541-
| ID ID Usage |
542-
|=========================================================================================|
543-
| No running processes found |
544543
+-----------------------------------------------------------------------------------------+
545544
```
546545

547-
## Day-2: Deploy Cloud Controller Manager (CCM)
546+
Pin the CUDA **image digest** in production; the tag above is illustrative.
548547

549-
[Upstream documentation](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/deployment.md)
548+
## Day-2: Cloud Controller Manager (CCM)
550549

551-
* Create an Service Account at STACKIT called `ccm-and-csi`
552-
* Create Service account keys and download the json file
553-
* Assign editor role for the entire project.
550+
**Prereq** — STACKIT service account (reuse for CSI below):
554551

555-
Deployment steps:
552+
* Create a service account (e.g. `ccm-and-csi`), download the key JSON.
553+
* Grant **Editor** on the project that owns `projectId` / `networkId` in `cloud.yaml`.
556554

557555
```shell
558-
oc create secret generic -n kube-system stackit-cloud-secret --from-file=sa_key.json=<service account json>
556+
oc create secret generic -n kube-system stackit-cloud-secret \
557+
--from-file=sa_key.json=./stackit-ccm-sa.json
559558
```
560559

561560
=== "Download"
@@ -566,108 +565,92 @@ oc create secret generic -n kube-system stackit-cloud-secret --from-file=sa_key.
566565

567566
=== "cloud.yaml"
568567

569-
```json
568+
```yaml
570569
--8<-- "content/cluster-installation/stackit/cloud.yaml"
571570
```
572571

573-
Adjust cloud.yaml and put into configmap:
574-
575572
```shell
576-
oc create configmap -n kube-system stackit-cloud-config --from-file=cloud.yaml
573+
oc create configmap -n kube-system stackit-cloud-config \
574+
--from-file=cloud.yaml=./cloud.yaml
577575
```
578576

579-
Deploy cloud controller manager:
577+
Upstream RBAC + `Service` (pin a **commit** if you do not want `main` drifting):
580578

581579
```shell
582-
oc apply -f https://raw.githubusercontent.com/stackitcloud/cloud-provider-stackit/refs/heads/main/deploy/cloud-controller-manager/rbac.yaml
583-
oc apply -f https://github.com/stackitcloud/cloud-provider-stackit/raw/refs/heads/main/deploy/cloud-controller-manager/service.yaml
580+
CCM_BASE=https://raw.githubusercontent.com/stackitcloud/cloud-provider-stackit/main/deploy/cloud-controller-manager
581+
oc apply -f "${CCM_BASE}/rbac.yaml"
582+
oc apply -f "${CCM_BASE}/service.yaml"
584583
```
585584

586585
=== "Apply"
587586

588587
```shell
589-
oc apply -f{{ page.canonical_url }}ccm-and-csi-deployment.yaml
588+
oc apply -f {{ page.canonical_url }}ccm-and-csi-deployment.yaml
590589
```
591590

592591
=== "ccm-and-csi-deployment.yaml"
593592

594-
```json
593+
```yaml
595594
--8<-- "content/cluster-installation/stackit/ccm-and-csi-deployment.yaml"
596595
```
597596

598-
???+ failure
597+
???+ warning "CCM panic observed"
598+
599+
With **`cloud-controller-manager:v1.36.0`** this deployment **segfaulted** right after startup (nil deref). Root cause not chased here. **CSI dynamic provisioning still worked** without a healthy CCM — storage does not depend on cloud `Service` LBs — but confirm for your image / region before assuming that split is always safe.
599600

600601
```
601602
starting Controller
602603
I0507 12:55:45.723462 1 serving.go:411] Generated self-signed cert in-memory
603-
W0507 12:55:45.723531 1 client_config.go:683] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
604+
W0507 12:55:45.723531 1 client_config.go:683] Neither --kubeconfig nor --master was
605+
specified. Using the inClusterConfig. This might not work.
604606
panic: runtime error: invalid memory address or nil pointer dereference
605607
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4fe685]
606608
```
607609
608610
**No further investigation, because it looks CSI works without CCM**
609611
610-
## Day-2: Container Storage Interface (CSI) Driver
612+
## Day-2: STACKIT CSI driver
611613
612-
[Upstream documentation](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/csi-driver.md)
614+
[Upstream CSI doc](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/csi-driver.md). Reuse the **same** STACKIT SA and `cloud.yaml` as in the CCM section.
613615

614-
* Create an Service Account at STACKIT called `ccm-and-csi` (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
615-
* Create Service account keys and download the json file (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
616-
* Assign editor role for the entire project. (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
617-
618-
Let's deploy the csi-driver into namespace/project `stackit-csi-driver`
616+
The overlay sets **`namespace: stackit-csi-driver`** on all resources from the remote base. Put **`stackit-cloud-secret`** and **`stackit-cloud-config` in that namespace** — not `kube-system` — or the controller never mounts credentials.
619617

620618
```shell
621619
oc new-project stackit-csi-driver
620+
oc create secret generic -n stackit-csi-driver stackit-cloud-secret \
621+
--from-file=sa_key.json=./stackit-ccm-sa.json
622+
oc create configmap -n stackit-csi-driver stackit-cloud-config \
623+
--from-file=cloud.yaml=./cloud.yaml
622624
```
623625

624-
```shell
625-
oc create secret generic -n kube-system stackit-cloud-secret --from-file=sa_key.json=<service account json>
626-
```
627-
628-
=== "Download"
629-
630-
```shell
631-
curl -L -O {{ page.canonical_url }}cloud.yaml
632-
```
633-
634-
=== "cloud.yaml"
635-
636-
```json
637-
--8<-- "content/cluster-installation/stackit/cloud.yaml"
638-
```
639-
640-
Adjust cloud.yaml and put into configmap:
641-
642-
```shell
643-
oc create configmap stackit-cloud-config --from-file=cloud.yaml
644-
```
645-
646-
Allow the CSI node componentes to run privileged, looks like it only need hostpath and hostnetwork. It's recommended to pick and/or create a more precise security context constraint.
626+
Node plugin needs host paths / devices; grant **`privileged`** to the node SA (narrow with a custom SCC later if you need to):
647627

648628
```shell
649-
oc adm policy add-scc-to-user privileged -z csi-stackit-node-sa
629+
oc -n stackit-csi-driver adm policy add-scc-to-user privileged -z csi-stackit-node-sa
650630
```
651631

652-
Download and apply `kustomization.yaml` to deploy csi driver into specific namespace and propper image url
632+
Working directory for `kustomize build` / `oc apply -k`: remote base is **HTTPS** (no `git@` SSH key). Image tag override stays in the local overlay.
653633

654634
=== "Download"
655635

656636
```shell
657-
curl -L -O {{ page.canonical_url }}kustomization.yaml
637+
mkdir -p stackit-csi && cd stackit-csi
638+
curl -L -O {{ page.canonical_url }}csi/kustomization.yaml
658639
```
659640

660641
=== "kustomization.yaml"
661642

662-
```json
663-
--8<-- "content/cluster-installation/stackit/kustomization.yaml"
643+
```yaml
644+
--8<-- "content/cluster-installation/stackit/csi/kustomization.yaml"
664645
```
665646

666647
```shell
667648
oc apply -k .
668649
```
669650

670-
### Let's try to storage
651+
### Validate provisioning
652+
653+
`StorageClass` **`premium-perf4-stackit`** ships with the upstream CSI bundle (name matches [their examples](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/csi-driver.md)).
671654

672655
```shell
673656
oc new-project storage-test
@@ -681,6 +664,6 @@ oc new-project storage-test
681664

682665
=== "lets-try-storage.yaml"
683666

684-
```json
667+
```yaml
685668
--8<-- "content/cluster-installation/stackit/lets-try-storage.yaml"
686669
```
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
kind: PersistentVolumeClaim
2+
apiVersion: v1
3+
metadata:
4+
name: pvc
5+
spec:
6+
accessModes:
7+
- ReadWriteOnce
8+
resources:
9+
requests:
10+
storage: 1Gi
11+
storageClassName: premium-perf4-stackit
12+
---
13+
kind: Deployment
14+
apiVersion: apps/v1
15+
metadata:
16+
name: ubi9
17+
spec:
18+
replicas: 1
19+
selector:
20+
matchLabels:
21+
app: ubi9
22+
template:
23+
metadata:
24+
creationTimestamp: null
25+
labels:
26+
app: ubi9
27+
spec:
28+
volumes:
29+
- name: pvc
30+
persistentVolumeClaim:
31+
claimName: pvc
32+
containers:
33+
- name: ubi
34+
image: 'registry.access.redhat.com/ubi9/ubi-micro:latest'
35+
volumeMounts:
36+
- name: pvc
37+
mountPath: /pvc
38+
command:
39+
- /bin/sh
40+
- '-c'
41+
- |
42+
sleep infinity

0 commit comments

Comments
 (0)