Skip to content

Commit 30b5a42

Browse files
committed
Add more Day-2
1 parent 28c6dc6 commit 30b5a42

5 files changed

Lines changed: 295 additions & 0 deletions

File tree

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
apiVersion: apps/v1
2+
kind: Deployment
3+
metadata:
4+
name: stackit-cloud-controller-manager
5+
namespace: kube-system
6+
spec:
7+
replicas: 2
8+
selector:
9+
matchLabels:
10+
app: stackit-cloud-controller-manager
11+
template:
12+
metadata:
13+
labels:
14+
app: stackit-cloud-controller-manager
15+
spec:
16+
serviceAccountName: stackit-cloud-controller-manager
17+
containers:
18+
- name: stackit-cloud-controller-manager
19+
image: ghcr.io/stackitcloud/cloud-provider-stackit/cloud-controller-manager:v1.36.0
20+
args:
21+
# CCM flags
22+
- --cloud-provider=stackit
23+
- --webhook-secure-port=0
24+
- --concurrent-service-syncs=3
25+
- --controllers=service-lb-controller
26+
- --authorization-always-allow-paths=/metrics
27+
- --leader-elect=true
28+
- --leader-elect-resource-name=stackit-cloud-controller-manager
29+
env:
30+
- name: STACKIT_SERVICE_ACCOUNT_KEY_PATH
31+
value: /etc/serviceaccount/sa_key.json
32+
ports:
33+
- containerPort: 10258
34+
hostPort: 10258
35+
name: https
36+
protocol: TCP
37+
- containerPort: 9090
38+
hostPort: 9090
39+
name: metrics
40+
protocol: TCP
41+
resources:
42+
limits:
43+
cpu: "0.5"
44+
memory: 500Mi
45+
requests:
46+
cpu: "0.1"
47+
memory: 100Mi
48+
volumeMounts:
49+
- mountPath: /etc/config
50+
name: cloud-config
51+
- mountPath: /etc/serviceaccount
52+
name: cloud-secret
53+
volumes:
54+
- name: cloud-config
55+
configMap:
56+
name: stackit-cloud-config
57+
- name: cloud-secret
58+
secret:
59+
secretName: stackit-cloud-secret
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
# cloud.yaml
2+
global:
3+
projectId: <project id>
4+
region: eu01
5+
loadBalancer:
6+
networkId: <network id>
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
apiVersion: kustomize.config.k8s.io/v1beta1
2+
kind: Kustomization
3+
4+
namespace: stackit-csi-driver
5+
resources:
6+
- git@github.com:stackitcloud/cloud-provider-stackit/deploy/csi-plugin/
7+
8+
images:
9+
- name: ghcr.io/stackitcloud/cloud-provider-stackit/stackit-csi-plugin:release-v1.34
10+
newName: ghcr.io/stackitcloud/cloud-provider-stackit/stackit-csi-plugin
11+
newTag: v1.36.0

content/cluster-installation/stackit/index.md

Lines changed: 219 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -465,3 +465,222 @@ oc patch ingresscontroller/default -n openshift-ingress-operator --type=merge \
465465
```
466466

467467
Until the secret is populated, the router keeps serving the installer default; after issuance, HAProxy reload picks up the Let’s Encrypt chain.
468+
469+
## Day-2: Add GPU Node
470+
471+
=== "Download"
472+
473+
```shell
474+
curl -L -O {{ page.canonical_url }}ign-gpu-worker-0.rcc
475+
```
476+
477+
=== "ign-gpu-worker-0.rcc"
478+
479+
```json
480+
--8<-- "content/cluster-installation/stackit/ign-gpu-worker-0.rcc"
481+
```
482+
483+
```shell
484+
stackit server create \
485+
--assume-yes \
486+
--availability-zone eu01-1 \
487+
--machine-type n2.14d.g1 \
488+
--name "cluster-a-gpu-worker-0" \
489+
--boot-volume-source-type image \
490+
--boot-volume-source-id 6055861d-6641-4a45-b00e-fcfb250d65e6 \
491+
--boot-volume-delete-on-termination \
492+
--boot-volume-size 120 \
493+
--network-id 459afb3e-54fa-45d4-a972-ae39ec370761 \
494+
--user-data @<(butane -d . -r "ign-gpu-worker-0.rcc")
495+
```
496+
497+
Wait and approve CSR
498+
499+
```shell
500+
export KUBECONFIG="$PWD/conf/auth/kubeconfig"
501+
oc get csr | awk '/Pending/{print $1}' | xargs oc adm certificate approve
502+
```
503+
504+
Install Nvidia GPU Operator: [NVIDIA GPU Operator on Red Hat OpenShift Container Platform](https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/index.html)
505+
506+
```shell
507+
oc create -f - <<EOF
508+
apiVersion: v1
509+
kind: Pod
510+
metadata:
511+
name: nvidia-smi
512+
spec:
513+
containers:
514+
- image: registry.redhat.io/rhai/base-image-cuda-13.0-rhel9:3.3.1-1775076057
515+
name: nvidia-smi
516+
command: [ nvidia-smi ]
517+
resources:
518+
limits:
519+
nvidia.com/gpu: 1
520+
requests:
521+
nvidia.com/gpu: 1
522+
EOF
523+
524+
$ oc logs nvidia-smi
525+
Tue May 5 13:27:54 2026
526+
+-----------------------------------------------------------------------------------------+
527+
| NVIDIA-SMI 580.126.20 Driver Version: 580.126.20 CUDA Version: 13.0 |
528+
+-----------------------------------------+------------------------+----------------------+
529+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
530+
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
531+
| | | MIG M. |
532+
|=========================================+========================+======================|
533+
| 0 NVIDIA L40S On | 00000000:05:00.0 Off | 0 |
534+
| N/A 29C P8 36W / 350W | 0MiB / 46068MiB | 0% Default |
535+
| | | N/A |
536+
+-----------------------------------------+------------------------+----------------------+
537+
538+
+-----------------------------------------------------------------------------------------+
539+
| Processes: |
540+
| GPU GI CI PID Type Process name GPU Memory |
541+
| ID ID Usage |
542+
|=========================================================================================|
543+
| No running processes found |
544+
+-----------------------------------------------------------------------------------------+
545+
```
546+
547+
## Day-2: Deploy Cloud Controller Manager (CCM)
548+
549+
[Upstream documentation](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/deployment.md)
550+
551+
* Create an Service Account at STACKIT called `ccm-and-csi`
552+
* Create Service account keys and download the json file
553+
* Assign editor role for the entire project.
554+
555+
Deployment steps:
556+
557+
```shell
558+
oc create secret generic -n kube-system stackit-cloud-secret --from-file=sa_key.json=<service account json>
559+
```
560+
561+
=== "Download"
562+
563+
```shell
564+
curl -L -O {{ page.canonical_url }}cloud.yaml
565+
```
566+
567+
=== "cloud.yaml"
568+
569+
```json
570+
--8<-- "content/cluster-installation/stackit/cloud.yaml"
571+
```
572+
573+
Adjust cloud.yaml and put into configmap:
574+
575+
```shell
576+
oc create configmap -n kube-system stackit-cloud-config --from-file=cloud.yaml
577+
```
578+
579+
Deploy cloud controller manager:
580+
581+
```shell
582+
oc apply -f https://raw.githubusercontent.com/stackitcloud/cloud-provider-stackit/refs/heads/main/deploy/cloud-controller-manager/rbac.yaml
583+
oc apply -f https://github.com/stackitcloud/cloud-provider-stackit/raw/refs/heads/main/deploy/cloud-controller-manager/service.yaml
584+
```
585+
586+
=== "Apply"
587+
588+
```shell
589+
oc apply -f{{ page.canonical_url }}ccm-and-csi-deployment.yaml
590+
```
591+
592+
=== "ccm-and-csi-deployment.yaml"
593+
594+
```json
595+
--8<-- "content/cluster-installation/stackit/ccm-and-csi-deployment.yaml"
596+
```
597+
598+
???+ failure
599+
600+
```
601+
starting Controller
602+
I0507 12:55:45.723462 1 serving.go:411] Generated self-signed cert in-memory
603+
W0507 12:55:45.723531 1 client_config.go:683] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
604+
panic: runtime error: invalid memory address or nil pointer dereference
605+
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4fe685]
606+
```
607+
608+
**No further investigation, because it looks CSI works without CCM**
609+
610+
## Day-2: Container Storage Interface (CSI) Driver
611+
612+
[Upstream documentation](https://github.com/stackitcloud/cloud-provider-stackit/blob/main/docs/csi-driver.md)
613+
614+
* Create an Service Account at STACKIT called `ccm-and-csi` (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
615+
* Create Service account keys and download the json file (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
616+
* Assign editor role for the entire project. (already done in `[Day-2: Deploy Cloud Controller Manager (CCM)` )
617+
618+
Let's deploy the csi-driver into namespace/project `stackit-csi-driver`
619+
620+
```shell
621+
oc new-project stackit-csi-driver
622+
```
623+
624+
```shell
625+
oc create secret generic -n kube-system stackit-cloud-secret --from-file=sa_key.json=<service account json>
626+
```
627+
628+
=== "Download"
629+
630+
```shell
631+
curl -L -O {{ page.canonical_url }}cloud.yaml
632+
```
633+
634+
=== "cloud.yaml"
635+
636+
```json
637+
--8<-- "content/cluster-installation/stackit/cloud.yaml"
638+
```
639+
640+
Adjust cloud.yaml and put into configmap:
641+
642+
```shell
643+
oc create configmap stackit-cloud-config --from-file=cloud.yaml
644+
```
645+
646+
Allow the CSI node componentes to run privileged, looks like it only need hostpath and hostnetwork. It's recommended to pick and/or create a more precise security context constraint.
647+
648+
```shell
649+
oc adm policy add-scc-to-user privileged -z csi-stackit-node-sa
650+
```
651+
652+
Download and apply `kustomization.yaml` to deploy csi driver into specific namespace and propper image url
653+
654+
=== "Download"
655+
656+
```shell
657+
curl -L -O {{ page.canonical_url }}kustomization.yaml
658+
```
659+
660+
=== "kustomization.yaml"
661+
662+
```json
663+
--8<-- "content/cluster-installation/stackit/kustomization.yaml"
664+
```
665+
666+
```shell
667+
oc apply -k .
668+
```
669+
670+
### Let's try to storage
671+
672+
```shell
673+
oc new-project storage-test
674+
```
675+
676+
=== "Apply"
677+
678+
```shell
679+
oc apply -f {{ page.canonical_url }}lets-try-storage.yaml
680+
```
681+
682+
=== "lets-try-storage.yaml"
683+
684+
```json
685+
--8<-- "content/cluster-installation/stackit/lets-try-storage.yaml"
686+
```
81 KB
Loading

0 commit comments

Comments
 (0)