add device data in resourceclaim by guptaNswati · Pull Request #128 · kubernetes-sigs/dra-example-driver

guptaNswati · 2025-11-13T22:40:40Z

Addressing #101 to add device attributes to demonstrate how to do resourceclaim status update.

Test run

    devices:
    - conditions:
      - lastTransitionTime: "2025-11-13T22:22:56Z"
        message: ""
        reason: GPUDeviceReady
        status: "True"
        type: Ready
      data:
        driverVersion:
          version: 1.0.0
        uuid:
          string: gpu-18db0e85-99e9-c746-8531-ffeb86328b39
      device: gpu-0

  devices:
    - conditions:
      - lastTransitionTime: "2025-11-13T22:22:56Z"
        message: ""
        reason: GPUDeviceReady
        status: "True"
        type: Ready
      data:
        driverVersion:
          version: 1.0.0
        uuid:
          string: gpu-93d37703-997c-c46f-a531-755e3e0dc2ac
      device: gpu-1

k8s-ci-robot · 2025-11-13T22:40:49Z

Welcome @guptaNswati!

It looks like this is your first PR to kubernetes-sigs/dra-example-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/dra-example-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

guptaNswati · 2025-11-13T22:42:01Z

cc @nojnhuh for review.

nojnhuh · 2025-12-03T22:15:40Z

FYI @guptaNswati I was hoping to get this PR merged before #129 to keep you from having to deal with those changes here. There are other changes in the works that are based on that PR though, so I may elect to merge that one first. If #129 merges first, I'm happy to help resolve conflicts with your PR here!

guptaNswati · 2025-12-03T23:50:54Z

FYI @guptaNswati I was hoping to get this PR merged before #129 to keep you from having to deal with those changes here. There are other changes in the works that are based on that PR though, so I may elect to merge that one first. If #129 merges first, I'm happy to help resolve conflicts with your PR here!

Oh i see. thank you for the headsup. i will see if i can update this tomorrow or else you can go ahead and merge #129. Main blocker is the e2e, right? we can't merge this without the tests?

nojnhuh · 2025-12-04T17:40:59Z

@guptaNswati Yes, I would like to include something in the tests for this. #128 (comment) is the other comment I would like to resolve before merging.

nojnhuh · 2025-12-05T20:39:13Z

@guptaNswati Have you pushed your recent changes? The most recent commit I see here is from when the PR was first opened.

guptaNswati · 2025-12-05T21:06:45Z

@guptaNswati Have you pushed your recent changes? The most recent commit I see here is from when the PR was first opened.

sorry pushing them rn.

guptaNswati · 2025-12-05T21:42:39Z

this is the quick test with updated changes. fixing e2e test

I1205 21:38:59.352921       1 state.go:229] Adding device attribute to claim gpu-test1/pod0-gpu-zfncn
  devices:
  - conditions:
    - lastTransitionTime: "2025-12-05T21:38:59Z"
      message: GPU Device Allocated for this request
      reason: GPUDeviceAllocated
      status: "True"
      type: Allocated
    data:
      driverVersion:
        version: 1.0.0
      model:
        string: LATEST-GPU-MODEL
      uuid:
        string: gpu-657bd2e7-f5c2-a7f2-fbaa-0d1cdc32f81b

guptaNswati · 2025-12-05T22:10:56Z

e2e test

 ./test/e2e/e2e.sh 
./test/e2e/e2e.sh: line 1: !/usr/bin/env: No such file or directory
dra-example-driver-cluster
kind-dra-1
NAME                                       STATUS   ROLES           AGE    VERSION
dra-example-driver-cluster-control-plane   Ready    control-plane   104s   v1.34.0
dra-example-driver-cluster-worker          Ready    <none>          89s    v1.34.0
node/dra-example-driver-cluster-worker condition met
Waiting for webhook to be available
resourceclaim.resource.k8s.io/webhook-test created (server dry run)
Webhook is available
namespace/gpu-test1 created
resourceclaimtemplate.resource.k8s.io/single-gpu created
pod/pod0 created
pod/pod1 created
namespace/gpu-test2 created
resourceclaimtemplate.resource.k8s.io/multiple-gpus created
pod/pod0 created
namespace/gpu-test3 created
resourceclaimtemplate.resource.k8s.io/single-gpu created
pod/pod0 created
namespace/gpu-test4 created
resourceclaim.resource.k8s.io/single-gpu created
pod/pod0 created
pod/pod1 created
namespace/gpu-test5 created
resourceclaimtemplate.resource.k8s.io/multiple-gpus created
pod/pod0 created
namespace/gpu-test6 created
resourceclaimtemplate.resource.k8s.io/single-gpu created
pod/pod0 created
pod/pod0 condition met
pod/pod1 condition met
=== Verifying ResourceClaim device data in namespace gpu-test1 ===
Found ResourceClaim gpu-test1/pod0-gpu-pqv5n, checking status.devices[0].data ...
OK: ResourceClaim gpu-test1/pod0-gpu-pqv5n has device data (uuid=gpu-e7b42cb1-4fd8-91b2-bc77-352a0c1f5747, driverVersion=1.0.0)
Pod gpu-test1/pod0, container ctr0 claimed gpu-4
Pod gpu-test1/pod1, container ctr0 claimed gpu-5
pod/pod0 condition met
Pod gpu-test2/pod0, container ctr0 claimed gpu-0
Pod gpu-test2/pod0, container ctr0 claimed gpu-6
pod/pod0 condition met
Pod gpu-test3/pod0, container ctr0 claimed gpu-1
Pod gpu-test3/pod0, container ctr1 claimed gpu-1
pod/pod0 condition met
pod/pod1 condition met
Pod gpu-test4/pod0, container ctr0 claimed gpu-2
Pod gpu-test4/pod1, container ctr0 claimed gpu-2
pod/pod0 condition met
Pod gpu-test5/pod0, container ts-ctr0 claimed gpu-3
Pod gpu-test5/pod0, container ts-ctr1 claimed gpu-3
Pod gpu-test5/pod0, container sp-ctr0 claimed gpu-7
Pod gpu-test5/pod0, container sp-ctr1 claimed gpu-7
pod/pod0 condition met
Pod gpu-test6/pod0, container init0 claimed gpu-8
Pod gpu-test6/pod0, container ctr0 claimed gpu-8
namespace "gpu-test1" deleted
resourceclaimtemplate.resource.k8s.io "single-gpu" deleted
pod "pod0" deleted
pod "pod1" deleted
namespace "gpu-test2" deleted
resourceclaimtemplate.resource.k8s.io "multiple-gpus" deleted
pod "pod0" deleted
namespace "gpu-test3" deleted
resourceclaimtemplate.resource.k8s.io "single-gpu" deleted
pod "pod0" deleted
namespace "gpu-test4" deleted
resourceclaim.resource.k8s.io "single-gpu" deleted
pod "pod0" deleted
pod "pod1" deleted
namespace "gpu-test5" deleted
resourceclaimtemplate.resource.k8s.io "multiple-gpus" deleted
pod "pod0" deleted
namespace "gpu-test6" deleted
resourceclaimtemplate.resource.k8s.io "single-gpu" deleted
pod "pod0" deleted
test ran successfully

nojnhuh · 2025-12-08T18:07:13Z


+	klog.Infof("Adding device attribute to claim %s/%s", claim.Namespace, claim.Name)
+	if err := s.applyDeviceStatus(ctx, claim.Namespace, claim.Name, devicesStatus...); err != nil {
+		klog.Warningf("Failed to update device attributes for claim %s/%s: %v", claim.Namespace, claim.Name, err)


I don't think we need to address this now, but we should think about how to handle errors here besides only logging them. Retries will help in some cases (HTTP 409), but we should prevent blocking subsequent NodePrepareResources calls on retries that may never succeed and I'm not sure what an appropriate amount of time would be to spend on that. I wonder if a workqueue running async would be the most appropriate way to handle status updates?

@guptaNswati Could you please open an issue to track improvements to error handling here?

You also need this to de-couple the allocation times from apiserver update latency. May be as a hacky first implementation you can do this in a goroutine?

I guess the main question to ask is, is this update required for the driver to declare allocation as complete or is the information in this update auxiliary to allocation?

guptaNswati · 2026-01-14T19:47:11Z

sorry was out for holidays. looking at it now.

guptaNswati · 2026-01-14T21:22:02Z

Quick test after the rebase:

$ kubectl get resourceclaim pod1-gpu-ksml7 -n gpu-test1 -o yaml
  devices:
  - conditions: null
    data:
      driverVersion:
        version: 1.0.0
      model:
        string: LATEST-GPU-MODEL
      uuid:
        string: gpu-93d37703-997c-c46f-a531-755e3e0dc2ac
    device: gpu-1
    driver: gpu.example.com

guptaNswati · 2026-01-15T23:52:34Z

@nojnhuh i made the suggested changes. Its ready to review again.

rebased on main
added cli flag: add device data in resourceclaim #128 (comment)
updated to converting wrapper and update from SSA: add device data in resourceclaim #128 (comment)

pohly · 2026-01-22T12:17:28Z

/assign @nojnhuh

nojnhuh · 2026-02-04T20:43:45Z

 			EnvVars:     []string{"DRIVER_NAME"},
 		},
+		&cli.BoolFlag{
+			Name:        "device-attribute",


Could we call this something like gpu-device-status instead to avoid confusing with #125?

nojnhuh · 2026-02-04T20:51:44Z

 	return resultConfigs, nil
 }
+
+func (s *DeviceState) buildDeviceStatus(res resourceapi.DeviceRequestAllocationResult) resourceapi.AllocatedDeviceStatus {


Let's move this logic into the gpu profile so we can implement this differently for different kinds of devices. See #129 for more about the "profile" concept.

Ideally we shouldn't have to implement this for every kind of device.

okay. looking to it. need sometime to understand the refractor. will be back at it.

nojnhuh · 2026-02-04T20:58:03Z

 		checkpointManager: checkpointManager,
 		configDecoder:     decoder,
 		configHandler:     configHandler,
+		config:            config,


Let's keep NewDeviceState where we pull everything out of the Config that we need vs. storing the whole Config on the DeviceState. Config contains user-facing config like command line flags, so let's not plumb that too far down.

guptaNswati · 2026-02-14T00:19:47Z

Ack. I will work on it next week.

nojnhuh · 2026-04-14T18:56:57Z

We'll also need to handle this to support Kubernetes 1.36: kubernetes/kubernetes#138149

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati · 2026-05-22T22:41:40Z

@nojnhuh sorry took a long time to address this #128 (comment)

Had to deal with rebase and refractor. New changes are tested and mainly in gpu.go. PTAL. I dont wana loose it for months again.

go test -tags=e2e -v ./test/e2e/... -ginkgo.focus="should allocate 1 distinct GPU per pod"
=== RUN   TestE2e
Running Suite: E2E Suite - /home/nvidia/dra-example-driver/test/e2e
===================================================================
Random Seed: 1779489074

Will run 1 of 14 specs
•SSSSSSSSSSSSS

Ran 1 of 14 Specs in 20.226 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 13 Skipped
--- PASS: TestE2e (20.23s)
PASS
ok  	sigs.k8s.io/dra-example-driver/test/e2e	20.269s

nojnhuh · 2026-05-26T13:40:41Z

@willie-yao Could you please take a look at the e2e changes here?

looks good! comment below was addressed

willie-yao · 2026-05-26T18:02:37Z

+				g.Expect(data).To(HaveKey("driverVersion"),
+					"ResourceClaim %s/%s device %s status.data is missing driverVersion: %v", namespace, *rcs.ResourceClaimName, d.Device, data)
+				g.Expect(data["driverVersion"].VersionValue).NotTo(BeNil())
+				g.Expect(*data["driverVersion"].VersionValue).NotTo(BeEmpty())


Earlier there is a comment saying GPUDeviceStatus includes uuid, model, and driverVersion, but it seems like model isn't being asserted here.

good catch. fixing it

@willie-yao done. PTAL

Signed-off-by: Swati Gupta <swatig@nvidia.com> add model in test assertion Signed-off-by: Swati Gupta <swatig@nvidia.com>

nojnhuh

/lgtm
/approve
/label tide/merge-method-squash

k8s-ci-robot · 2026-05-27T20:04:00Z

LGTM label has been added.

Details

Git tree hash: 8dcdbbba4c44927b3ac35dfb44f8a39f56e48883

k8s-ci-robot · 2026-05-27T20:04:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: guptaNswati, nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [nojnhuh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-project-automation Bot added this to Dynamic Resource Allocation Nov 13, 2025

github-project-automation Bot moved this to 🆕 New in Dynamic Resource Allocation Nov 13, 2025

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 13, 2025

k8s-ci-robot requested review from klueska and pohly November 13, 2025 22:40

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 13, 2025

nojnhuh moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Nov 14, 2025

nojnhuh mentioned this pull request Nov 21, 2025

Introduce device profiles #129

Merged

nojnhuh reviewed Nov 21, 2025

View reviewed changes

nojnhuh reviewed Dec 1, 2025

View reviewed changes

Comment thread cmd/dra-example-kubeletplugin/state.go Outdated

Comment thread cmd/dra-example-kubeletplugin/state.go Outdated

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 5, 2025

guptaNswati force-pushed the resourceclaim-status branch 2 times, most recently from a03108c to b77e105 Compare December 5, 2025 22:18

guptaNswati requested a review from nojnhuh December 5, 2025 22:19

guptaNswati mentioned this pull request Dec 5, 2025

Add asynchronous device health conditions to ResourceClaim status #131

Open

nojnhuh reviewed Dec 8, 2025

View reviewed changes

guptaNswati mentioned this pull request Dec 10, 2025

Improve error handling for device status updates failures #133

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 10, 2026

guptaNswati force-pushed the resourceclaim-status branch from 5c158c0 to cc7f77e Compare January 14, 2026 21:19

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 14, 2026

guptaNswati requested a review from nojnhuh January 20, 2026 22:38

k8s-ci-robot assigned nojnhuh Jan 22, 2026

nojnhuh reviewed Feb 4, 2026

View reviewed changes

nojnhuh mentioned this pull request Feb 10, 2026

Demonstrate how to update ResourceClaim status #101

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2026

guptaNswati force-pushed the resourceclaim-status branch from c52ea98 to 9f5e69d Compare February 17, 2026 19:16

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 17, 2026

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 14, 2026

add device data in resourceclaim

affa378

Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati force-pushed the resourceclaim-status branch from f0b910d to 5f8bc0d Compare May 22, 2026 22:36

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 22, 2026

nojnhuh reviewed May 26, 2026

View reviewed changes

willie-yao reviewed May 26, 2026

View reviewed changes

move build device status to gpu profile

3be2ba5

Signed-off-by: Swati Gupta <swatig@nvidia.com> add model in test assertion Signed-off-by: Swati Gupta <swatig@nvidia.com>

guptaNswati force-pushed the resourceclaim-status branch from 48d995b to 3be2ba5 Compare May 27, 2026 19:12

nojnhuh approved these changes May 27, 2026

View reviewed changes

k8s-ci-robot added tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels May 27, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 27, 2026

k8s-ci-robot merged commit 04cf504 into kubernetes-sigs:main May 27, 2026
6 checks passed

pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation May 28, 2026

Uh oh!

Conversation

guptaNswati commented Nov 13, 2025

Uh oh!

k8s-ci-robot commented Nov 13, 2025

Uh oh!

guptaNswati commented Nov 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nojnhuh commented Dec 3, 2025

Uh oh!

guptaNswati commented Dec 3, 2025

Uh oh!

nojnhuh commented Dec 4, 2025

Uh oh!

nojnhuh commented Dec 5, 2025

Uh oh!

guptaNswati commented Dec 5, 2025

Uh oh!

guptaNswati commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guptaNswati commented Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

guptaNswati commented Jan 14, 2026

Uh oh!

guptaNswati commented Jan 14, 2026

Uh oh!

guptaNswati commented Jan 15, 2026

Uh oh!

pohly commented Jan 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Feb 14, 2026

Uh oh!

nojnhuh commented Apr 14, 2026

Uh oh!

guptaNswati commented May 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nojnhuh left a comment

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Dec 5, 2025 •

edited

Loading

guptaNswati May 27, 2026 •

edited

Loading