Skip to content

Commit db6ebeb

Browse files
committed
address review
Signed-off-by: William Yao <william2000yao@gmail.com>
1 parent 4ceb0ec commit db6ebeb

13 files changed

Lines changed: 101 additions & 149 deletions

README.md

Lines changed: 35 additions & 58 deletions
Original file line numberDiff line numberDiff line change
@@ -108,86 +108,56 @@ And show the initial state of available GPU devices on the worker node:
108108
kubectl get resourceslice -o yaml
109109
```
110110

111-
You should see 2 GPUs (gpu-0, gpu-1) on the worker node, each with model
111+
You should see 8 GPUs (gpu-0 through gpu-7) on the worker node, each with model
112112
`LATEST-GPU-MODEL` and 80Gi of memory.
113113

114-
Next, deploy some example apps to see DRA in action. The default configuration
115-
provides 2 GPUs per node, which is enough to run each example individually.
116-
Each example file has detailed comments at the top explaining what it
117-
demonstrates and how to verify the results.
118-
119-
**Example 1: Exclusive GPU access**
120-
121-
Two pods each requesting their own distinct GPU:
114+
Next, deploy some example apps to see DRA in action. Each example file has
115+
detailed comments at the top explaining what it demonstrates and how to verify
116+
the results:
122117
```bash
123-
kubectl apply -f demo/two-pods-one-gpu-each.yaml
124-
kubectl wait --for=condition=Ready pod/pod0 pod/pod1 -n two-pods-one-gpu-each --timeout=60s
125-
```
126-
127-
Check that each pod got a different GPU:
128-
```bash
129-
for pod in pod0 pod1; do
130-
echo "${pod}:"
131-
kubectl logs -n two-pods-one-gpu-each ${pod} -c ctr0 | grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"
132-
done
118+
kubectl apply -f demo/basic-two-pods-one-gpu-each.yaml
119+
kubectl apply -f demo/basic-shared-gpu-across-containers.yaml
120+
kubectl apply -f demo/basic-gpu-sharing-strategies.yaml
133121
```
134122

135-
Clean up before the next example:
123+
Wait for all pods to be ready:
136124
```bash
137-
kubectl delete -f demo/two-pods-one-gpu-each.yaml
125+
kubectl wait --for=condition=Ready pod/pod0 pod/pod1 -n basic-two-pods-one-gpu-each --timeout=60s
126+
kubectl wait --for=condition=Ready pod/pod0 -n basic-shared-gpu-across-containers --timeout=60s
127+
kubectl wait --for=condition=Ready pod/pod0 -n basic-gpu-sharing-strategies --timeout=60s
138128
```
139129

140-
**Example 2: Shared GPU across containers**
141-
142-
Two containers in one pod sharing a single GPU:
130+
Then check the pod logs to see which GPUs were allocated:
143131
```bash
144-
kubectl apply -f demo/shared-gpu-across-containers.yaml
145-
kubectl wait --for=condition=Ready pod/pod0 -n shared-gpu-across-containers --timeout=60s
146-
```
132+
# basic-two-pods-one-gpu-each: each pod should have 1 GPU with a distinct ID
133+
echo "basic-two-pods-one-gpu-each:"
134+
for pod in pod0 pod1; do
135+
echo " ${pod}:"
136+
kubectl logs -n basic-two-pods-one-gpu-each ${pod} -c ctr0 | grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"
137+
done
147138

148-
Check that both containers see the same GPU with TimeSlicing:
149-
```bash
139+
# basic-shared-gpu-across-containers: both containers should show the same GPU ID
140+
echo "basic-shared-gpu-across-containers:"
150141
for ctr in ctr0 ctr1; do
151-
echo "pod0 ${ctr}:"
152-
kubectl logs -n shared-gpu-across-containers pod0 -c ${ctr} | grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM"
142+
echo " pod0 ${ctr}:"
143+
kubectl logs -n basic-shared-gpu-across-containers pod0 -c ${ctr} | grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM"
153144
done
154-
```
155145

156-
Clean up before the next example:
157-
```bash
158-
kubectl delete -f demo/shared-gpu-across-containers.yaml
159-
```
160-
161-
**Example 3: GPU sharing strategies**
162-
163-
Two GPUs configured with different sharing modes (TimeSlicing and SpacePartitioning):
164-
```bash
165-
kubectl apply -f demo/gpu-sharing-strategies.yaml
166-
kubectl wait --for=condition=Ready pod/pod0 -n gpu-sharing-strategies --timeout=60s
167-
```
168-
169-
Check that ts-ctr0/ts-ctr1 share one GPU with TimeSlicing and sp-ctr0/sp-ctr1
170-
share another with SpacePartitioning:
171-
```bash
146+
# basic-gpu-sharing-strategies: ts-ctr0/ts-ctr1 share one GPU (TimeSlicing),
147+
# sp-ctr0/sp-ctr1 share another (SpacePartitioning)
148+
echo "basic-gpu-sharing-strategies:"
172149
for ctr in ts-ctr0 ts-ctr1 sp-ctr0 sp-ctr1; do
173-
echo "pod0 ${ctr}:"
174-
kubectl logs -n gpu-sharing-strategies pod0 -c ${ctr} | grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM"
150+
echo " pod0 ${ctr}:"
151+
kubectl logs -n basic-gpu-sharing-strategies pod0 -c ${ctr} | grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM"
175152
done
176153
```
177154

178-
Clean up:
179-
```bash
180-
kubectl delete -f demo/gpu-sharing-strategies.yaml
181-
```
182-
183155
In this example resource driver, no "actual" GPUs are made available to any
184156
containers. Instead, a set of environment variables are set in each container
185157
to indicate which GPUs *would* have been injected into them by a real resource
186158
driver and how they *would* have been configured.
187159

188160
For the full list of all 8 available examples, see [`demo/README.md`](demo/README.md).
189-
To run multiple examples at the same time, increase `kubeletPlugin.numDevices`
190-
when installing the Helm chart.
191161

192162
### Demo DRA Admin Access Feature
193163
This example driver includes support for the [DRA AdminAccess feature](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#admin-access), which allows administrators to gain privileged access to devices already in use by other users. This example demonstrates the end-to-end flow by setting the `DRA_ADMIN_ACCESS` environment variable. A driver managing real devices could use this to expose host hardware information.
@@ -205,7 +175,14 @@ To run this demo:
205175

206176
### Clean Up
207177

208-
Once you are done, delete the `kind` cluster:
178+
Once you have verified everything is running correctly, delete the example apps:
179+
```bash
180+
kubectl delete -f demo/basic-two-pods-one-gpu-each.yaml
181+
kubectl delete -f demo/basic-shared-gpu-across-containers.yaml
182+
kubectl delete -f demo/basic-gpu-sharing-strategies.yaml
183+
```
184+
185+
Finally, delete the `kind` cluster:
209186
```bash
210187
./demo/delete-cluster.sh
211188
```

demo/README.md

Lines changed: 8 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,31 +1,14 @@
11
# Demo Examples
22

33
This directory contains example workloads that demonstrate different ways to
4-
request and configure GPU devices using Dynamic Resource Allocation (DRA).
4+
request and configure devices using Dynamic Resource Allocation (DRA).
55

6-
## Quick Start
6+
Examples prefixed with `basic-` are featured in the
7+
[main README walkthrough](../README.md) and are a good starting point for
8+
learning about DRA.
79

8-
The following three examples are featured in the [main README walkthrough](../README.md)
9-
and are designed to run together with the default cluster configuration (2 GPUs):
10-
11-
| Example | Description | GPUs |
12-
|---|---|---|
13-
| [two-pods-one-gpu-each.yaml](two-pods-one-gpu-each.yaml) | Two pods each get their own exclusive GPU | 2 |
14-
| [shared-gpu-across-containers.yaml](shared-gpu-across-containers.yaml) | Two containers in one pod share a single GPU | 1 |
15-
| [gpu-sharing-strategies.yaml](gpu-sharing-strategies.yaml) | TimeSlicing and SpacePartitioning on two GPUs | 2 |
16-
17-
## All Examples
18-
19-
| Example | Description | GPUs | Key Concept |
20-
|---|---|---|---|
21-
| [two-pods-one-gpu-each.yaml](two-pods-one-gpu-each.yaml) | Two pods, each requesting one exclusive GPU | 2 | ResourceClaimTemplate basics |
22-
| [one-pod-two-gpus.yaml](one-pod-two-gpus.yaml) | One container requesting two distinct GPUs | 2 | Multiple requests in a claim |
23-
| [shared-gpu-across-containers.yaml](shared-gpu-across-containers.yaml) | Two containers sharing one GPU within a pod | 1 | Intra-pod GPU sharing |
24-
| [shared-global-claim.yaml](shared-global-claim.yaml) | Two pods sharing a GPU via a pre-created ResourceClaim | 1 | ResourceClaim vs ResourceClaimTemplate |
25-
| [gpu-sharing-strategies.yaml](gpu-sharing-strategies.yaml) | TimeSlicing and SpacePartitioning configuration | 2 | Opaque driver config (GpuConfig) |
26-
| [initcontainer-shared-gpu.yaml](initcontainer-shared-gpu.yaml) | initContainer and container sharing a GPU | 1 | initContainer support |
27-
| [admin-access.yaml](admin-access.yaml) | Admin access to all GPUs with elevated privileges | All | DRA AdminAccess feature |
28-
| [cel-selector.yaml](cel-selector.yaml) | Selecting a GPU by model and memory using CEL | 1 | CEL expression selectors |
10+
Each example file has detailed comments at the top explaining what it
11+
demonstrates, what output to expect, and the driver and cluster requirements.
2912

3013
## Running Examples
3114

@@ -43,9 +26,7 @@ kubectl delete -f demo/<example-name>.yaml
4326

4427
## Notes
4528

46-
- The default Helm chart configures **2 GPUs** per node, which is enough to run
47-
any single example (except `admin-access.yaml` which uses all available GPUs).
48-
- To run multiple examples simultaneously, increase `kubeletPlugin.numDevices`
49-
in the Helm values.
29+
- The default Helm chart configures **8 GPUs** per node, which is enough to run
30+
several examples simultaneously.
5031
- Each example creates its own namespace, so examples don't interfere with
5132
each other's resource names.

demo/admin-access.yaml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,20 @@
99
# - The namespace must have the label:
1010
# resource.kubernetes.io/admin-access: "true"
1111
# - The request must set adminAccess: true
12-
# - allocationMode: All is used here to access all available GPUs
12+
# - allocationMode: All is used here to access all available GPUs on a Node.
13+
# Admins typically require access to all devices on a node to perform
14+
# maintenance or monitoring.
1315
#
1416
# Expected: The container has DRA_ADMIN_ACCESS=true and GPU_DEVICE env vars
1517
# for all available GPUs. Check with:
1618
# kubectl logs -n admin-access pod0 -c ctr0 | grep DRA_ADMIN_ACCESS
1719
# kubectl logs -n admin-access pod0 -c ctr0 | grep GPU_DEVICE
1820
#
19-
# Resources created:
20-
# - 1 Namespace (with admin-access label)
21-
# - 1 ResourceClaimTemplate (multiple-gpus-admin)
22-
# - 1 Pod (pod0) with 1 container
21+
# Cluster requirements:
22+
# Kubernetes 1.34+
23+
# Feature gate: DRAAdminAccess
2324
#
24-
# GPUs required: all available (uses allocationMode: All)
25+
# GPUs required: all available on a Node (uses allocationMode: All)
2526

2627
---
2728
apiVersion: v1
Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,26 +11,25 @@
1111
# Expected: ts-ctr0 and ts-ctr1 share one GPU with SHARING_STRATEGY=TimeSlicing
1212
# and TIMESLICE_INTERVAL=Long. sp-ctr0 and sp-ctr1 share a different GPU with
1313
# SHARING_STRATEGY=SpacePartitioning and PARTITION_COUNT=10. Check with:
14-
# kubectl logs -n gpu-sharing-strategies pod0 -c ts-ctr0 | grep GPU_DEVICE
15-
# kubectl logs -n gpu-sharing-strategies pod0 -c sp-ctr0 | grep GPU_DEVICE
14+
# kubectl logs -n basic-gpu-sharing-strategies pod0 -c ts-ctr0 | grep GPU_DEVICE
15+
# kubectl logs -n basic-gpu-sharing-strategies pod0 -c sp-ctr0 | grep GPU_DEVICE
1616
#
17-
# Resources created:
18-
# - 1 ResourceClaimTemplate (multiple-gpus) with 2 requests + config
19-
# - 1 Pod (pod0) with 4 containers
17+
# Cluster requirements:
18+
# Kubernetes 1.34+
2019
#
2120
# GPUs required: 2
2221

2322
---
2423
apiVersion: v1
2524
kind: Namespace
2625
metadata:
27-
name: gpu-sharing-strategies
26+
name: basic-gpu-sharing-strategies
2827

2928
---
3029
apiVersion: resource.k8s.io/v1
3130
kind: ResourceClaimTemplate
3231
metadata:
33-
namespace: gpu-sharing-strategies
32+
namespace: basic-gpu-sharing-strategies
3433
name: multiple-gpus
3534
spec:
3635
spec:
@@ -68,7 +67,7 @@ spec:
6867
apiVersion: v1
6968
kind: Pod
7069
metadata:
71-
namespace: gpu-sharing-strategies
70+
namespace: basic-gpu-sharing-strategies
7271
name: pod0
7372
spec:
7473
containers:

demo/shared-gpu-across-containers.yaml renamed to demo/basic-shared-gpu-across-containers.yaml

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -3,28 +3,27 @@
33
# One pod, two containers.
44
# Each asking for shared access to a single GPU.
55
#
6-
# Expected: Both containers see the same GPU with TimeSlicing. Check with:
7-
# kubectl logs -n shared-gpu-across-containers pod0 -c ctr0 | grep GPU_DEVICE
8-
# kubectl logs -n shared-gpu-across-containers pod0 -c ctr1 | grep GPU_DEVICE
9-
# Both containers should show the same GPU ID with SHARING_STRATEGY=TimeSlicing.
6+
# Expected: Both containers see the same GPU. Check with:
7+
# kubectl logs -n basic-shared-gpu-across-containers pod0 -c ctr0 | grep GPU_DEVICE
8+
# kubectl logs -n basic-shared-gpu-across-containers pod0 -c ctr1 | grep GPU_DEVICE
9+
# Both containers should show the same GPU ID.
1010
#
11-
# Resources created:
12-
# - 1 ResourceClaimTemplate (single-gpu)
13-
# - 1 Pod (pod0) with 2 containers (ctr0, ctr1)
11+
# Cluster requirements:
12+
# Kubernetes 1.34+
1413
#
1514
# GPUs required: 1
1615

1716
---
1817
apiVersion: v1
1918
kind: Namespace
2019
metadata:
21-
name: shared-gpu-across-containers
20+
name: basic-shared-gpu-across-containers
2221

2322
---
2423
apiVersion: resource.k8s.io/v1
2524
kind: ResourceClaimTemplate
2625
metadata:
27-
namespace: shared-gpu-across-containers
26+
namespace: basic-shared-gpu-across-containers
2827
name: single-gpu
2928
spec:
3029
spec:
@@ -38,7 +37,7 @@ spec:
3837
apiVersion: v1
3938
kind: Pod
4039
metadata:
41-
namespace: shared-gpu-across-containers
40+
namespace: basic-shared-gpu-across-containers
4241
name: pod0
4342
spec:
4443
containers:
Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -4,27 +4,26 @@
44
# Each container asking for 1 distinct GPU.
55
#
66
# Expected: Each pod gets a different GPU. Check with:
7-
# kubectl logs -n two-pods-one-gpu-each pod0 -c ctr0 | grep GPU_DEVICE
8-
# kubectl logs -n two-pods-one-gpu-each pod1 -c ctr0 | grep GPU_DEVICE
7+
# kubectl logs -n basic-two-pods-one-gpu-each pod0 -c ctr0 | grep GPU_DEVICE
8+
# kubectl logs -n basic-two-pods-one-gpu-each pod1 -c ctr0 | grep GPU_DEVICE
99
# Each container should have 1 GPU_DEVICE env var with a distinct GPU ID.
1010
#
11-
# Resources created:
12-
# - 1 ResourceClaimTemplate (single-gpu)
13-
# - 2 Pods (pod0, pod1), each with 1 container
11+
# Cluster requirements:
12+
# Kubernetes 1.34+
1413
#
1514
# GPUs required: 2
1615

1716
---
1817
apiVersion: v1
1918
kind: Namespace
2019
metadata:
21-
name: two-pods-one-gpu-each
20+
name: basic-two-pods-one-gpu-each
2221

2322
---
2423
apiVersion: resource.k8s.io/v1
2524
kind: ResourceClaimTemplate
2625
metadata:
27-
namespace: two-pods-one-gpu-each
26+
namespace: basic-two-pods-one-gpu-each
2827
name: single-gpu
2928
spec:
3029
spec:
@@ -38,7 +37,7 @@ spec:
3837
apiVersion: v1
3938
kind: Pod
4039
metadata:
41-
namespace: two-pods-one-gpu-each
40+
namespace: basic-two-pods-one-gpu-each
4241
name: pod0
4342
labels:
4443
app: pod
@@ -59,7 +58,7 @@ spec:
5958
apiVersion: v1
6059
kind: Pod
6160
metadata:
62-
namespace: two-pods-one-gpu-each
61+
namespace: basic-two-pods-one-gpu-each
6362
name: pod1
6463
labels:
6564
app: pod

demo/cel-selector.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,8 @@
88
# kubectl logs -n cel-selector pod0 -c ctr0 | grep GPU_DEVICE
99
# The container should have 1 GPU_DEVICE env var.
1010
#
11-
# Resources created:
12-
# - 1 ResourceClaimTemplate (single-gpu-cel) with CEL selectors
13-
# - 1 Pod (pod0) with 1 container
11+
# Cluster requirements:
12+
# Kubernetes 1.34+
1413
#
1514
# GPUs required: 1
1615

demo/initcontainer-shared-gpu.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,8 @@
88
# kubectl logs -n initcontainer-shared-gpu pod0 -c ctr0 | grep GPU_DEVICE
99
# Both should show the same GPU ID.
1010
#
11-
# Resources created:
12-
# - 1 ResourceClaimTemplate (single-gpu)
13-
# - 1 Pod (pod0) with 1 initContainer + 1 container
11+
# Cluster requirements:
12+
# Kubernetes 1.34+
1413
#
1514
# GPUs required: 1
1615

demo/one-pod-two-gpus.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,8 @@
77
# kubectl logs -n one-pod-two-gpus pod0 -c ctr0 | grep GPU_DEVICE
88
# The container should have 2 GPU_DEVICE env vars with distinct GPU IDs.
99
#
10-
# Resources created:
11-
# - 1 ResourceClaimTemplate (multiple-gpus) with 2 requests
12-
# - 1 Pod (pod0) with 1 container
10+
# Cluster requirements:
11+
# Kubernetes 1.34+
1312
#
1413
# GPUs required: 2
1514

0 commit comments

Comments
 (0)