You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/gaudi/USAGE.md
+43-36Lines changed: 43 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
## Requirements
2
2
3
-
- Kubernetes v1.32+, and optionally [some cluster parameters](../../hack/clusterconfig.yaml) for advanced features, see [Cluster Setup](../CLUSTER_SETUP.md)
3
+
- Kubernetes v1.34+, and optionally [some cluster parameters](../../hack/clusterconfig.yaml) for advanced features, see [Cluster Setup](../CLUSTER_SETUP.md)
4
4
- Container runtime needs to support CDI:
5
5
- CRI-O v1.23.0 or newer
6
6
- Containerd v1.7 or newer with CDI enabled
@@ -40,19 +40,22 @@ To restrict the deployment to Gaudi-enabled nodes, follow these steps:
40
40
Follow [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) documentation to install and configure NFD in your cluster.
After NFD is installed and running, make sure the target node is labeled with:
51
+
52
+
After NFD is installed and running, the nodes with Gaudi accelerators will be labeled with:
52
53
```bash
53
54
intel.feature.node.kubernetes.io/gaudi: "true"
54
55
```
55
56
57
+
The Gaudi DRA driver will be deployed to nodes that have such labels.
58
+
56
59
When deploying custom-built resource driver image, change `image:` lines in
57
60
[resource-driver](../../deployments/gaudi/base/resource-driver.yaml) to match its location.
58
61
@@ -79,11 +82,11 @@ Example contents of the ResourceSlice object:
79
82
<details>
80
83
81
84
```bash
82
-
$ kubectl get resourceSlices/rpl-s-gaudi.intel.com-x8m4h -o yaml
85
+
$ kubectl get resourceslices.resource.k8s.io rpl-s-gaudi.intel.com-x8m4h -o yaml
83
86
apiVersion: resource.k8s.io/v1
84
87
kind: ResourceSlice
85
88
metadata:
86
-
creationTimestamp: "2024-09-23T13:03:21Z"
89
+
creationTimestamp: "2026-04-15T07:17:43Z"
87
90
generateName: rpl-s-gaudi.intel.com-
88
91
generation: 1
89
92
name: rpl-s-gaudi.intel.com-x8m4h
@@ -92,38 +95,42 @@ metadata:
92
95
controller: true
93
96
kind: Node
94
97
name: rpl-s
95
-
uid: 0894e000-e7a3-49ad-8749-04b27be61c03
96
-
resourceVersion: "2047239"
97
-
uid: 92fb64c7-219e-4cef-9be9-5233b589d7bd
98
+
uid: 3a243a6b-e6db-4613-94f2-169f938c87ae
99
+
resourceVersion: "6266269"
100
+
uid: d093f601-9ad5-4234-91d8-410733e32784
98
101
spec:
99
102
devices:
100
-
- basic:
101
-
attributes:
102
-
healthy:
103
-
bool: true
104
-
model:
105
-
string: Gaudi2
106
-
pciRoot:
107
-
string: "40"
108
-
serial:
109
-
string: AN01234567
110
-
name: 0000-43-00-0-0x1020
111
-
- basic:
112
-
attributes:
113
-
healthy:
114
-
bool: true
115
-
model:
116
-
string: Gaudi2
117
-
pciRoot:
118
-
string: "40"
119
-
serial:
120
-
string: AN12345678
103
+
- attributes:
104
+
healthy:
105
+
bool: true
106
+
model:
107
+
string: Gaudi2
108
+
pciRoot:
109
+
string: "01"
110
+
resource.kubernetes.io/pcieRoot:
111
+
string: pci0000:01
112
+
serial:
113
+
string: ""
114
+
name: 0000-a0-00-0-0x1020
115
+
- attributes:
116
+
healthy:
117
+
bool: true
118
+
model:
119
+
string: Gaudi2
120
+
pciRoot:
121
+
string: "02"
122
+
resource.kubernetes.io/pcieRoot:
123
+
string: pci0000:02
124
+
serial:
125
+
string: ""
126
+
name: 0000-b0-00-0-0x1020
121
127
driver: gaudi.intel.com
122
128
nodeName: rpl-s
123
129
pool:
124
-
generation: 0
130
+
generation: 1
125
131
name: rpl-s
126
132
resourceSliceCount: 1
133
+
127
134
```
128
135
129
136
</details>
@@ -320,8 +327,8 @@ Unlike with normal Gaudi ResourceClaims:
320
327
321
328
## Health monitoring support
322
329
323
-
Starting from v0.7.0 Gaudi DRA driver supports health monitoring with `-m` command-line parameter (enabled
324
-
in default deployment configuration) through HLML library. When Gaudi accelerator becomes unhealthy, the
330
+
Starting from v0.7.0 Gaudi DRA driver supports health monitoring with `-m` command-line parameter (enabled
331
+
in default deployment configuration) through HLML library. When Gaudi accelerator becomes unhealthy, the
325
332
DeviceTaintRule is created (to evict current workloads and prevent this device from being allocated), and
326
333
respective device's `healthy` field in ResourceSlice is changed to false.
327
334
@@ -342,9 +349,9 @@ This approach, however, does not allow influencing the Pods that are / were usin
342
349
343
350
In K8s v1.33 [DeviceTaintRule](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#device-taints-and-tolerations) concept was introduced that allows scheduler to handle ResourceSlice devices similarly to how K8s Node Taints and Tolerations allow.
344
351
345
-
Starting from v0.7.0 Gaudi DRA driver leverages DeviceTaintRules if Gaudi accelerator health degrades.
346
-
DeviceTaintRule created as a result of degraded health has "NoSchedule" effect, which implies "NoExecute"
347
-
and results in Pod eviction for devices with degraded health. Workloads that need to access tainted devices
352
+
Starting from v0.7.0 Gaudi DRA driver leverages DeviceTaintRules if Gaudi accelerator health degrades.
353
+
DeviceTaintRule created as a result of degraded health has "NoSchedule" effect, which implies "NoExecute"
354
+
and results in Pod eviction for devices with degraded health. Workloads that need to access tainted devices
348
355
need to have [taint toleration in ResourceClaim](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations).
0 commit comments