Rename demo examples and add per-example documentation#170
Rename demo examples and add per-example documentation#170k8s-ci-robot merged 1 commit intokubernetes-sigs:mainfrom
Conversation
71df894 to
4ceb0ec
Compare
|
/assign @nojnhuh |
|
@alaypatel07 Do the DRA scalability tests use any of these demo manifests? Or just the driver itself and the tests define their own Pods, ResourceClaims, etc.? |
|
The tests define their own pods. |
| # This demo shows the DRA admin access feature with DRA_ADMIN_ACCESS | ||
| # environment variable. | ||
| # | ||
| # Key requirements: |
There was a problem hiding this comment.
I'm wondering if it makes sense to organize these examples by "profile"? i.e. demo/admin-access.yaml -> demo/gpu/admin-access.yaml
As we add more profiles, that would help users know how to install the driver to run this particular example. On the other hand, more generic features like admin access could work with any profile, so it might become harder to find than in a flat list like this.
Let's not change that right now, but something to consider later.
| # Expected: The container has DRA_ADMIN_ACCESS=true and GPU_DEVICE env vars | ||
| # for all available GPUs. Check with: | ||
| # kubectl logs -n admin-access pod0 -c ctr0 | grep DRA_ADMIN_ACCESS | ||
| # kubectl logs -n admin-access pod0 -c ctr0 | grep GPU_DEVICE |
There was a problem hiding this comment.
Definitely not something we need to do now, but it would be awesome if we could make comments like this drive the e2e tests as well as serve as human-readable documentation. Then we wouldn't need to update the test logic at all when adding an example.
| # - 1 ResourceClaimTemplate (multiple-gpus-admin) | ||
| # - 1 Pod (pod0) with 1 container | ||
| # | ||
| # GPUs required: all available (uses allocationMode: All) |
There was a problem hiding this comment.
Could we organize this in a new section titled something like "driver requirements" and also include the profile required? Right now only the "gpu" profile is implemented but we'll be adding more.
There was a problem hiding this comment.
Whoops I thought it was redundant and forgot that you mentioned it in the comment! I'll add it back
| # - 1 ResourceClaimTemplate (multiple-gpus-admin) | ||
| # - 1 Pod (pod0) with 1 container | ||
| # | ||
| # GPUs required: all available (uses allocationMode: All) |
There was a problem hiding this comment.
Could we add a new section for cluster requirements? For example, Kubernetes version and feature gates. I'll go through and comment on each example. This one needs Kubernetes 1.34+ and the DRAAdminAccess feature.
| # One pod, one container | ||
| # Uses CEL expression selectors instead of matching on a DeviceClass | ||
| # Selects a GPU with a specific model and at least 4Gi of memory | ||
| # Example: CEL Expression Selectors |
There was a problem hiding this comment.
This one requires Kubernetes 1.34+, no additional features.
There was a problem hiding this comment.
Actually I don't think this was ever part of the e2e script and is only referenced in the README, so nothing to do here for #171.
| gpu-test4 pod0 0/1 Pending 0 2s | ||
| gpu-test5 pod0 0/4 Pending 0 2s | ||
| ... | ||
| Clean up before the next example: |
There was a problem hiding this comment.
If we bump the default to 8 devices, then we wouldn't need to clean up in between each example. I've found it valuable before to have multiple examples running side-by-side at the same time to understand what's different between them. ResourceClaim status in particular is one of the more interesting parts that only exists at runtime and not in any of the manifests.
1b800b1 to
db6ebeb
Compare
| kind: List | ||
| metadata: | ||
| resourceVersion: "" | ||
| ```bash |
There was a problem hiding this comment.
Can we try to keep the README changes scoped to only renaming the examples? I'm happy to incorporate drive-by improvements, but let's at least keep those in separate commits.
nojnhuh
left a comment
There was a problem hiding this comment.
I think this is good to go aside from renaming one more example.
There was a problem hiding this comment.
I think it's time to remove this since it uses the old names. It would be nice if each example had a README with its own Mermaid diagram or something. Not worth blocking this PR though IMO. I opened #176 to track that separately.
| kubectl apply --filename=demo/basic-resourceclaimtemplate.yaml \ | ||
| --filename=demo/basic-multiple-requests.yaml \ | ||
| --filename=demo/basic-shared-claim-across-containers.yaml \ | ||
| --filename=demo/shared-claim-across-pods.yaml \ |
There was a problem hiding this comment.
I think it makes sense to make this a basic- one too.
There was a problem hiding this comment.
Actually I don't think this was ever part of the e2e script and is only referenced in the README, so nothing to do here for #171.
…entation Signed-off-by: William Yao <william2000yao@gmail.com>
8b3282e to
95ed732
Compare
|
LGTM label has been added. DetailsGit tree hash: 3639fe2f601d766acf1db61d38e49cbc4768cf35 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: nojnhuh, willie-yao The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes #144
This PR renames demos and keeps documentation for each demo in the yaml file itself. The main README is modified to run through the first 3 examples and a separate README under
/demogives an overview of the rest of the demos. Also the default number of GPUs was changed to 2 to more accurately reflect real world scenarios, with the trade-off of having to clean up after running each demo.