Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 42 additions & 40 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -244,43 +244,40 @@ Next, deploy four example apps that demonstrate how `ResourceClaim`s,
`ResourceClaimTemplate`s, and custom `GpuConfig` objects can be used to
select and configure resources in various ways:
```bash
kubectl apply --filename=demo/gpu-test{1,2,3,4,5}.yaml
kubectl apply --filename=demo/basic-resourceclaimtemplate.yaml \
--filename=demo/basic-multiple-requests.yaml \
--filename=demo/basic-shared-claim-across-containers.yaml \
--filename=demo/basic-shared-claim-across-pods.yaml \
--filename=demo/basic-resourceclaim-opaque-config.yaml
```

And verify that they are coming up successfully:
```console
$ kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
NAMESPACE NAME READY STATUS RESTARTS AGE
...
gpu-test1 pod0 0/1 Pending 0 2s
gpu-test1 pod1 0/1 Pending 0 2s
gpu-test2 pod0 0/2 Pending 0 2s
gpu-test3 pod0 0/1 ContainerCreating 0 2s
gpu-test3 pod1 0/1 ContainerCreating 0 2s
gpu-test4 pod0 0/1 Pending 0 2s
gpu-test5 pod0 0/4 Pending 0 2s
basic-resourceclaimtemplate pod0 0/1 Pending 0 2s
basic-resourceclaimtemplate pod1 0/1 Pending 0 2s
basic-multiple-requests pod0 0/2 Pending 0 2s
basic-shared-claim-across-containers pod0 0/1 ContainerCreating 0 2s
basic-shared-claim-across-containers pod1 0/1 ContainerCreating 0 2s
basic-shared-claim-across-pods pod0 0/1 Pending 0 2s
basic-resourceclaim-opaque-config pod0 0/4 Pending 0 2s
...
```

Use your favorite editor to look through each of the `gpu-test{1,2,3,4,5}.yaml`
files and see what they are doing. The semantics of each match the figure
below:

![Demo Apps Figure](demo/demo-apps.png?raw=true "Semantics of the applications requesting resources from the example DRA resource driver.")
Use your favorite editor to look through each of the `basic-*.yaml`
files and see what they are doing.

Then dump the logs of each app to verify that GPUs were allocated to them
according to these semantics:
```bash
for example in $(seq 1 5); do \
echo "gpu-test${example}:"
for pod in $(kubectl get pod -n gpu-test${example} --output=jsonpath='{.items[*].metadata.name}'); do \
for ctr in $(kubectl get pod -n gpu-test${example} ${pod} -o jsonpath='{.spec.containers[*].name}'); do \
for ns in basic-resourceclaimtemplate basic-multiple-requests basic-shared-claim-across-containers basic-shared-claim-across-pods basic-resourceclaim-opaque-config; do \
echo "${ns}:"
for pod in $(kubectl get pod -n ${ns} --output=jsonpath='{.items[*].metadata.name}'); do \
for ctr in $(kubectl get pod -n ${ns} ${pod} -o jsonpath='{.spec.containers[*].name}'); do \
echo "${pod} ${ctr}:"
if [ "${example}" -lt 3 ]; then
kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+=" | grep -v "RESOURCE_CLAIM"
else
kubectl logs -n gpu-test${example} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM"
fi
kubectl logs -n ${ns} ${pod} -c ${ctr}| grep -E "GPU_DEVICE_[0-9]+" | grep -v "RESOURCE_CLAIM"
done
done
echo ""
Expand All @@ -289,18 +286,18 @@ done

This should produce output similar to the following:
```bash
gpu-test1:
basic-resourceclaimtemplate:
pod0 ctr0:
declare -x GPU_DEVICE_6="gpu-6"
pod1 ctr0:
declare -x GPU_DEVICE_7="gpu-7"

gpu-test2:
basic-multiple-requests:
pod0 ctr0:
declare -x GPU_DEVICE_0="gpu-0"
declare -x GPU_DEVICE_1="gpu-1"

gpu-test3:
basic-shared-claim-across-containers:
pod0 ctr0:
declare -x GPU_DEVICE_2="gpu-2"
declare -x GPU_DEVICE_2_SHARING_STRATEGY="TimeSlicing"
Expand All @@ -310,7 +307,7 @@ declare -x GPU_DEVICE_2="gpu-2"
declare -x GPU_DEVICE_2_SHARING_STRATEGY="TimeSlicing"
declare -x GPU_DEVICE_2_TIMESLICE_INTERVAL="Default"

gpu-test4:
basic-shared-claim-across-pods:
pod0 ctr0:
declare -x GPU_DEVICE_3="gpu-3"
declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing"
Expand All @@ -320,7 +317,7 @@ declare -x GPU_DEVICE_3="gpu-3"
declare -x GPU_DEVICE_3_SHARING_STRATEGY="TimeSlicing"
declare -x GPU_DEVICE_3_TIMESLICE_INTERVAL="Default"

gpu-test5:
basic-resourceclaim-opaque-config:
pod0 ts-ctr0:
declare -x GPU_DEVICE_4="gpu-4"
declare -x GPU_DEVICE_4_SHARING_STRATEGY="TimeSlicing"
Expand Down Expand Up @@ -353,14 +350,14 @@ This example driver includes support for the [DRA AdminAccess feature](https://k

#### Usage Example

See `demo/gpu-test7.yaml` for a complete example. Key points:
See `demo/admin-access.yaml` for a complete example. Key points:

1. **Namespace**: Must have the `resource.kubernetes.io/admin-access` label set to create ResourceClaimTemplate and ResourceClaim with `adminAccess: true` for Kubernetes v1.34+.
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test7
name: admin-access
labels:
resource.kubernetes.io/admin-access: "true"
```
Expand Down Expand Up @@ -399,22 +396,27 @@ This demonstration shows the end-to-end flow of the DRA AdminAccess feature. In
Once you have verified everything is running correctly, delete all of the
example apps:
```bash
kubectl delete --wait=false --filename=demo/gpu-test{1,2,3,4,5,7}.yaml
kubectl delete --wait=false --filename=demo/basic-resourceclaimtemplate.yaml \
--filename=demo/basic-multiple-requests.yaml \
--filename=demo/basic-shared-claim-across-containers.yaml \
--filename=demo/basic-shared-claim-across-pods.yaml \
--filename=demo/basic-resourceclaim-opaque-config.yaml \
--filename=demo/admin-access.yaml
```

And wait for them to terminate:
```console
$ kubectl get pod -A
NAMESPACE NAME READY STATUS RESTARTS AGE
NAMESPACE NAME READY STATUS RESTARTS AGE
...
gpu-test1 pod0 1/1 Terminating 0 31m
gpu-test1 pod1 1/1 Terminating 0 31m
gpu-test2 pod0 2/2 Terminating 0 31m
gpu-test3 pod0 1/1 Terminating 0 31m
gpu-test3 pod1 1/1 Terminating 0 31m
gpu-test4 pod0 1/1 Terminating 0 31m
gpu-test5 pod0 4/4 Terminating 0 31m
gpu-test7 pod0 1/1 Terminating 0 31m
basic-resourceclaimtemplate pod0 1/1 Terminating 0 31m
basic-resourceclaimtemplate pod1 1/1 Terminating 0 31m
basic-multiple-requests pod0 2/2 Terminating 0 31m
basic-shared-claim-across-containers pod0 1/1 Terminating 0 31m
basic-shared-claim-across-containers pod1 1/1 Terminating 0 31m
basic-shared-claim-across-pods pod0 1/1 Terminating 0 31m
basic-resourceclaim-opaque-config pod0 4/4 Terminating 0 31m
admin-access pod0 1/1 Terminating 0 31m
...
```

Expand Down
31 changes: 31 additions & 0 deletions demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Demo Examples

This directory contains example workloads that demonstrate different ways to
request and configure devices using Dynamic Resource Allocation (DRA).

Examples prefixed with `basic-` are a good starting point for
learning about DRA.

Each example file has detailed comments at the top explaining what it
demonstrates, what output to expect, and the driver and cluster requirements.

## Running Examples

Each example can be run individually:

```bash
kubectl apply -f demo/<example-name>.yaml
```

To clean up:

```bash
kubectl delete -f demo/<example-name>.yaml
```

## Notes

- The default Helm chart configures **8 GPUs** per node, which is enough to run
several examples simultaneously.
- Each example creates its own namespace, so examples don't interfere with
each other's resource names.
38 changes: 31 additions & 7 deletions demo/gpu-test7.yaml → demo/admin-access.yaml
Original file line number Diff line number Diff line change
@@ -1,22 +1,46 @@
# One Namespace with admin access label
# One pod with one container requesting all GPUs with admin access
# This demo shows the DRA admin access feature with DRA_ADMIN_ACCESS environment variable
# Example: DRA Admin Access
#
# One namespace with admin access label.
# One pod with one container requesting all GPUs with admin access.
# This demo shows the DRA admin access feature with DRA_ADMIN_ACCESS
# environment variable.
#
# Key requirements:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it makes sense to organize these examples by "profile"? i.e. demo/admin-access.yaml -> demo/gpu/admin-access.yaml

As we add more profiles, that would help users know how to install the driver to run this particular example. On the other hand, more generic features like admin access could work with any profile, so it might become harder to find than in a flat list like this.

Let's not change that right now, but something to consider later.

# - The namespace must have the label:
# resource.kubernetes.io/admin-access: "true"
# - The request must set adminAccess: true
# - "allocationMode: All" is used here to access all available GPUs on a Node.
# Admins typically require access to all devices on a node to perform
# maintenance or monitoring.
#
# Expected: The container has DRA_ADMIN_ACCESS=true and GPU_DEVICE env vars
# for all available GPUs. Check with:
# kubectl logs -n admin-access pod0 -c ctr0 | grep DRA_ADMIN_ACCESS
# kubectl logs -n admin-access pod0 -c ctr0 | grep GPU_DEVICE
Comment on lines +16 to +19
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not something we need to do now, but it would be awesome if we could make comments like this drive the e2e tests as well as serve as human-readable documentation. Then we wouldn't need to update the test logic at all when adding an example.

#
# Driver requirements:
# Profile: gpu
# GPUs: all available on a Node (uses allocationMode: All)
#
# Cluster requirements:
# Kubernetes 1.34+
# Feature gate: DRAAdminAccess

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test7
name: admin-access
labels:
resource.kubernetes.io/admin-access: "true"
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test7
namespace: admin-access
name: multiple-gpus-admin
spec:
spec:
spec:
devices:
requests:
- name: admin-gpu
Expand All @@ -29,7 +53,7 @@ spec:
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test7
namespace: admin-access
name: pod0
spec:
containers:
Expand Down
23 changes: 18 additions & 5 deletions demo/gpu-test2.yaml → demo/basic-multiple-requests.yaml
Comment thread
willie-yao marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -1,17 +1,30 @@
# One pod, one container
# Asking for 2 distinct GPUs
# Example: One Pod, Two GPUs
Comment thread
willie-yao marked this conversation as resolved.
#
# One pod, one container.
# Asking for 2 distinct GPUs.
#
# Expected: The container gets 2 different GPUs. Check with:
# kubectl logs -n basic-multiple-requests pod0 -c ctr0 | grep GPU_DEVICE
# The container should have 2 GPU_DEVICE env vars with distinct GPU IDs.
#
# Driver requirements:
# Profile: gpu
# GPUs: 2
#
# Cluster requirements:
# Kubernetes 1.34+

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test2
name: basic-multiple-requests

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test2
namespace: basic-multiple-requests
name: multiple-gpus
spec:
spec:
Expand All @@ -28,7 +41,7 @@ spec:
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test2
namespace: basic-multiple-requests
name: pod0
labels:
app: pod
Expand Down
Comment thread
willie-yao marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -1,17 +1,37 @@
# One pod, 1 container
# Run as deployment with 1 replica
# Example: GPU Sharing Strategies (TimeSlicing + SpacePartitioning)
Comment thread
willie-yao marked this conversation as resolved.
#
# One pod, four containers, two GPUs with custom GpuConfig:
#
# - ts-gpu: Configured with TimeSlicing (interval: Long). Two containers
# (ts-ctr0, ts-ctr1) share this GPU by taking turns.
#
# - sp-gpu: Configured with SpacePartitioning (partitionCount: 10). Two
# containers (sp-ctr0, sp-ctr1) each get a partition of this GPU.
#
# Expected: ts-ctr0 and ts-ctr1 share one GPU with SHARING_STRATEGY=TimeSlicing
# and TIMESLICE_INTERVAL=Long. sp-ctr0 and sp-ctr1 share a different GPU with
# SHARING_STRATEGY=SpacePartitioning and PARTITION_COUNT=10. Check with:
# kubectl logs -n basic-resourceclaim-opaque-config pod0 -c ts-ctr0 | grep GPU_DEVICE
# kubectl logs -n basic-resourceclaim-opaque-config pod0 -c sp-ctr0 | grep GPU_DEVICE
#
# Driver requirements:
# Profile: gpu
# GPUs: 2
#
# Cluster requirements:
# Kubernetes 1.34+

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test5
name: basic-resourceclaim-opaque-config

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test5
namespace: basic-resourceclaim-opaque-config
name: multiple-gpus
spec:
spec:
Expand Down Expand Up @@ -49,7 +69,7 @@ spec:
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test5
namespace: basic-resourceclaim-opaque-config
name: pod0
spec:
containers:
Expand Down
Loading
Loading