Skip to content

Commit 2e6b83b

Browse files
committed
Update docs
Signed-off-by: Alexey Fomenko <alexey.fomenko@intel.com>
1 parent 1eacf77 commit 2e6b83b

14 files changed

Lines changed: 172 additions & 272 deletions

doc/cdi-spec-generator/BUILD.md

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,23 @@
11
# How to build Intel CDI Spec Generator
2-
A pre-compiled binary is already available for download, eliminating the need for manual building. See documentation [README.md](README.md#Releases)
32

43
## Prerequisites
5-
- Go 1.22
4+
- Go 1.25
65

76
## Building
87
1. Clone the repository
98
```bash
109
git clone https://github.com/intel/intel-resource-drivers-for-kubernetes.git
11-
cd intel-resource-drivers-for-kubernetes/cmd/cdi-specs-generator
10+
cd intel-resource-drivers-for-kubernetes
1211
```
1312

1413
2. Build the executable
1514
```bash
16-
go build -o intel-cdi-specs-generator main.go
15+
make bin/intel-cdi-specs-generator
1716
```
18-
This command will generate an executable named intel-cdi-specs-generator in the current directory.
17+
This command will generate an executable named intel-cdi-specs-generator in the `bin` directory.
1918

2019
## Verification
2120
To verify that the build was successful, you can check the version of the tool by running:
2221
```bash
23-
intel-cdi-specs-generator --version
22+
bin/intel-cdi-specs-generator --version
2423
```

doc/cdi-spec-generator/README.md

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -29,11 +29,5 @@ intel-cdi-specs-generator gpu
2929
```
3030
This command will detect supported GPUs on the system, and ensure that there is a CDI device record for each of them.
3131

32-
3332
## Building
3433
- [How to build CDI Spec Generator](BUILD.md)
35-
36-
## Releases
37-
The binary is available for download in the releases section:
38-
- [Intel Resource Drivers for Kubernetes releases](https://github.com/intel/intel-resource-drivers-for-kubernetes/releases)
39-
- [CDI Spec Generator v0.1.0](https://github.com/intel/intel-resource-drivers-for-kubernetes/releases/tag/specs-generator-v0.1.0)

doc/device-faker/README.md

Lines changed: 42 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ $ cat /tmp/gpu-template-3524438793.json
8585
"model": "0x56c0",
8686
"modelname": "",
8787
"familyname": "",
88+
"meiname": "mei0",
8889
"cardidx": 0,
8990
"renderdidx": 128,
9091
"memorymib": 1024,
@@ -96,8 +97,9 @@ $ cat /tmp/gpu-template-3524438793.json
9697
"vfindex": 0,
9798
"provisioned": false,
9899
"driver": "i915",
100+
"currentdriver": "",
99101
"pciroot": "pci0000:01",
100-
"healthy": false,
102+
"health": "",
101103
"healthstatus": null
102104
},
103105
"card1": {
@@ -106,6 +108,7 @@ $ cat /tmp/gpu-template-3524438793.json
106108
"model": "0xe20b",
107109
"modelname": "",
108110
"familyname": "",
111+
"meiname": "mei1",
109112
"cardidx": 1,
110113
"renderdidx": 129,
111114
"memorymib": 2048,
@@ -117,8 +120,9 @@ $ cat /tmp/gpu-template-3524438793.json
117120
"vfindex": 0,
118121
"provisioned": false,
119122
"driver": "xe",
123+
"currentdriver": "",
120124
"pciroot": "pci0000:02",
121-
"healthy": false,
125+
"health": "",
122126
"healthstatus": null
123127
}
124128
}
@@ -138,25 +142,27 @@ Sample output and fake file-system contents
138142

139143
```shell
140144
$ device-faker -t /tmp/gpu-template-3524438793.json gpu
141-
fake file system: /tmp/test-3985488568
142-
fake sysfs: /tmp/test-3985488568/sysfs
143-
fake devfs: /tmp/test-3985488568/dev
144-
fake CDI: /tmp/test-3985488568/cdi
145+
fake file system: /tmp/test-2503111759/
146+
fake sysfs: /tmp/test-2503111759/sysfs
147+
fake devfs: /tmp/test-2503111759/dev
148+
fake CDI: /tmp/test-2503111759/cdi
145149

146-
$ sudo tree /tmp/test-3985488568
147-
/tmp/test-3985488568
150+
$ sudo tree /tmp/test-2503111759/
151+
/tmp/test-2503111759/
148152
├── cdi
149153
├── dev
150-
│   └── dri
151-
│   ├── by-path
152-
│   │   ├── pci-0000:03:00.0-card -> ../card0
153-
│   │   ├── pci-0000:03:00.0-render -> ../renderD128
154-
│   │   ├── pci-0000:04:00.1-card -> ../card1
155-
│   │   └── pci-0000:04:00.1-render -> ../renderD129
156-
│   ├── card0
157-
│   ├── card1
158-
│   ├── renderD128
159-
│   └── renderD129
154+
│   ├── dri
155+
│   │   ├── by-path
156+
│   │   │   ├── pci-0000:03:00.0-card -> ../card0
157+
│   │   │   ├── pci-0000:03:00.0-render -> ../renderD128
158+
│   │   │   ├── pci-0000:04:00.1-card -> ../card1
159+
│   │   │   └── pci-0000:04:00.1-render -> ../renderD129
160+
│   │   ├── card0
161+
│   │   ├── card1
162+
│   │   ├── renderD128
163+
│   │   └── renderD129
164+
│   ├── mei0
165+
│   └── mei1
160166
├── kubelet-plugin
161167
│   ├── plugins
162168
│   │   └── gpu.intel.com
@@ -175,9 +181,12 @@ $ sudo tree /tmp/test-3985488568
175181
│   ├── 0000:04:00.1 -> ../../../../devices/pci0000:02/0000:04:00.1
176182
│   └── bind
177183
├── class
178-
│   └── drm
179-
│   ├── card0 -> /tmp/test-1963481256/sysfs/bus/pci/drivers/i915/0000:03:00.0/drm/card0
180-
│   └── card1 -> /tmp/test-1963481256/sysfs/bus/pci/drivers/xe/0000:04:00.1/drm/card1
184+
│   ├── drm
185+
│   │   ├── card0 -> /tmp/test-2503111759/sysfs/bus/pci/drivers/i915/0000:03:00.0/drm/card0
186+
│   │   └── card1 -> /tmp/test-2503111759/sysfs/bus/pci/drivers/xe/0000:04:00.1/drm/card1
187+
│   └── mei
188+
│   ├── mei0 -> ../../devices/pci0000:01/0000:03:00.0/i915.mei-gscfi.2304/mei/mei0
189+
│   └── mei1 -> ../../devices/pci0000:02/0000:04:00.1/xe.mei-gscfi.768/mei/mei1
181190
└── devices
182191
├── pci0000:01
183192
│   └── 0000:03:00.0
@@ -253,18 +262,24 @@ $ sudo tree /tmp/test-3985488568
253262
│   │   │   ├── lmem_quota
254263
│   │   │   └── preempt_timeout_us
255264
│   │   └── renderD128
265+
│   ├── i915.mei-gscfi.2304
266+
│   │   └── mei
267+
│   │   └── mei0
256268
│   ├── sriov_drivers_autoprobe
257269
│   ├── sriov_numvfs
258270
│   └── sriov_totalvfs
259271
└── pci0000:02
260272
└── 0000:04:00.1
261273
├── device
262-
└── drm
263-
├── card1
264-
│   └── lmem_total_bytes
265-
└── renderD129
266-
267-
53 directories, 66 files
274+
├── drm
275+
│   ├── card1
276+
│   │   └── lmem_total_bytes
277+
│   └── renderD129
278+
└── xe.mei-gscfi.768
279+
└── mei
280+
└── mei1
281+
282+
62 directories, 68 files
268283
```
269284

270285
</details>

doc/gaudi/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@ Supported Kubernetes versions are listed below:
2222
| v0.3.0 | Kubernetes v1.32+ | unsupported | Structured Parameters |
2323
| v0.4.0 | Kubernetes v1.32+ | unsupported | Structured Parameters |
2424
| v0.5.0 | Kubernetes v1.33-v1.34 | unsupported | Structured Parameters |
25-
| v0.6.0 | Kubernetes v1.32+ | supported | Structured Parameters |
25+
| v0.6.0 | Kubernetes v1.32+ | unsupported | Structured Parameters |
26+
| v0.7.0 | Kubernetes v1.34+ | supported | Structured Parameters |
2627

2728
## Documentation
2829

doc/gaudi/USAGE.md

Lines changed: 43 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
## Requirements
22

3-
- Kubernetes v1.32+, and optionally [some cluster parameters](../../hack/clusterconfig.yaml) for advanced features, see [Cluster Setup](../CLUSTER_SETUP.md)
3+
- Kubernetes v1.34+, and optionally [some cluster parameters](../../hack/clusterconfig.yaml) for advanced features, see [Cluster Setup](../CLUSTER_SETUP.md)
44
- Container runtime needs to support CDI:
55
- CRI-O v1.23.0 or newer
66
- Containerd v1.7 or newer with CDI enabled
@@ -40,19 +40,22 @@ To restrict the deployment to Gaudi-enabled nodes, follow these steps:
4040
Follow [Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) documentation to install and configure NFD in your cluster.
4141

4242
```bash
43-
kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.17.3"
43+
kubectl apply -k "https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.18.3"
4444
```
4545

46-
2. Apply NFD Rules:
46+
2. Deploy the Gaudi DRA driver and NFD Rules together:
4747

4848
```bash
4949
kubectl apply -k deployments/gaudi/overlays/nfd_labeled_nodes/
5050
```
51-
After NFD is installed and running, make sure the target node is labeled with:
51+
52+
After NFD is installed and running, the nodes with Gaudi accelerators will be labeled with:
5253
```bash
5354
intel.feature.node.kubernetes.io/gaudi: "true"
5455
```
5556

57+
The Gaudi DRA driver will be deployed to nodes that have such labels.
58+
5659
When deploying custom-built resource driver image, change `image:` lines in
5760
[resource-driver](../../deployments/gaudi/base/resource-driver.yaml) to match its location.
5861

@@ -79,11 +82,11 @@ Example contents of the ResourceSlice object:
7982
<details>
8083

8184
```bash
82-
$ kubectl get resourceSlices/rpl-s-gaudi.intel.com-x8m4h -o yaml
85+
$ kubectl get resourceslices.resource.k8s.io rpl-s-gaudi.intel.com-x8m4h -o yaml
8386
apiVersion: resource.k8s.io/v1
8487
kind: ResourceSlice
8588
metadata:
86-
creationTimestamp: "2024-09-23T13:03:21Z"
89+
creationTimestamp: "2026-04-15T07:17:43Z"
8790
generateName: rpl-s-gaudi.intel.com-
8891
generation: 1
8992
name: rpl-s-gaudi.intel.com-x8m4h
@@ -92,38 +95,42 @@ metadata:
9295
controller: true
9396
kind: Node
9497
name: rpl-s
95-
uid: 0894e000-e7a3-49ad-8749-04b27be61c03
96-
resourceVersion: "2047239"
97-
uid: 92fb64c7-219e-4cef-9be9-5233b589d7bd
98+
uid: 3a243a6b-e6db-4613-94f2-169f938c87ae
99+
resourceVersion: "6266269"
100+
uid: d093f601-9ad5-4234-91d8-410733e32784
98101
spec:
99102
devices:
100-
- basic:
101-
attributes:
102-
healthy:
103-
bool: true
104-
model:
105-
string: Gaudi2
106-
pciRoot:
107-
string: "40"
108-
serial:
109-
string: AN01234567
110-
name: 0000-43-00-0-0x1020
111-
- basic:
112-
attributes:
113-
healthy:
114-
bool: true
115-
model:
116-
string: Gaudi2
117-
pciRoot:
118-
string: "40"
119-
serial:
120-
string: AN12345678
103+
- attributes:
104+
healthy:
105+
bool: true
106+
model:
107+
string: Gaudi2
108+
pciRoot:
109+
string: "01"
110+
resource.kubernetes.io/pcieRoot:
111+
string: pci0000:01
112+
serial:
113+
string: ""
114+
name: 0000-a0-00-0-0x1020
115+
- attributes:
116+
healthy:
117+
bool: true
118+
model:
119+
string: Gaudi2
120+
pciRoot:
121+
string: "02"
122+
resource.kubernetes.io/pcieRoot:
123+
string: pci0000:02
124+
serial:
125+
string: ""
126+
name: 0000-b0-00-0-0x1020
121127
driver: gaudi.intel.com
122128
nodeName: rpl-s
123129
pool:
124-
generation: 0
130+
generation: 1
125131
name: rpl-s
126132
resourceSliceCount: 1
133+
127134
```
128135

129136
</details>
@@ -320,8 +327,8 @@ Unlike with normal Gaudi ResourceClaims:
320327

321328
## Health monitoring support
322329

323-
Starting from v0.7.0 Gaudi DRA driver supports health monitoring with `-m` command-line parameter (enabled
324-
in default deployment configuration) through HLML library. When Gaudi accelerator becomes unhealthy, the
330+
Starting from v0.7.0 Gaudi DRA driver supports health monitoring with `-m` command-line parameter (enabled
331+
in default deployment configuration) through HLML library. When Gaudi accelerator becomes unhealthy, the
325332
DeviceTaintRule is created (to evict current workloads and prevent this device from being allocated), and
326333
respective device's `healthy` field in ResourceSlice is changed to false.
327334

@@ -342,9 +349,9 @@ This approach, however, does not allow influencing the Pods that are / were usin
342349

343350
In K8s v1.33 [DeviceTaintRule](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#device-taints-and-tolerations) concept was introduced that allows scheduler to handle ResourceSlice devices similarly to how K8s Node Taints and Tolerations allow.
344351

345-
Starting from v0.7.0 Gaudi DRA driver leverages DeviceTaintRules if Gaudi accelerator health degrades.
346-
DeviceTaintRule created as a result of degraded health has "NoSchedule" effect, which implies "NoExecute"
347-
and results in Pod eviction for devices with degraded health. Workloads that need to access tainted devices
352+
Starting from v0.7.0 Gaudi DRA driver leverages DeviceTaintRules if Gaudi accelerator health degrades.
353+
DeviceTaintRule created as a result of degraded health has "NoSchedule" effect, which implies "NoExecute"
354+
and results in Pod eviction for devices with degraded health. Workloads that need to access tainted devices
348355
need to have [taint toleration in ResourceClaim](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#device-taints-and-tolerations).
349356

350357
## Known issues

doc/gpu/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,8 @@ Supported Kubernetes versions are listed below:
3434
| v0.6.0 | Kubernetes v1.31 | unsupported | Structured Parameters |
3535
| v0.7.0 | Kubernetes v1.32+ | unsupported | Structured Parameters |
3636
| v0.8.0 | Kubernetes v1.33-v1.34 | unsupported | Structured Parameters |
37-
| v0.9.0 | Kubernetes v1.32+ | supported | Structured Parameters |
37+
| v0.9.0 | Kubernetes v1.32+ | unsupported | Structured Parameters |
38+
| v0.10.0 | Kubernetes v1.34+ | supported | Structured Parameters |
3839

3940
## Documentation
4041

0 commit comments

Comments
 (0)