Skip to content

Add vfio-gpu profile for testing KubeVirt DRA#218

Open
Sreeja1725 wants to merge 3 commits into
kubernetes-sigs:mainfrom
Sreeja1725:kubevirt-dra-profile
Open

Add vfio-gpu profile for testing KubeVirt DRA#218
Sreeja1725 wants to merge 3 commits into
kubernetes-sigs:mainfrom
Sreeja1725:kubevirt-dra-profile

Conversation

@Sreeja1725

@Sreeja1725 Sreeja1725 commented Jun 4, 2026

Copy link
Copy Markdown

Updates #200
Updates kubevirt/kubevirt#16481
References - kubevirt/enhancements#10

Adds a vfio-gpu device profile for KubeVirt PCI passthrough via DRA, kept separate from the existing simulated gpu profile (containers /pods).

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Sreeja1725
Once this PR has been reviewed and has the lgtm label, please assign bart0sh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Welcome @Sreeja1725!

It looks like this is your first PR to kubernetes-sigs/dra-example-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/dra-example-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 4, 2026
@Sreeja1725

Copy link
Copy Markdown
Author

/cc @alaypatel07

@k8s-ci-robot k8s-ci-robot requested a review from alaypatel07 June 4, 2026 10:33
@Sreeja1725 Sreeja1725 force-pushed the kubevirt-dra-profile branch from 5082c16 to 5cece08 Compare June 4, 2026 14:43
@nojnhuh

nojnhuh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Could you please add an e2e test?

@alaypatel07 alaypatel07 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to run a POC with this, ran into issues due to kernel version mismatch, but was able to launch the example VMI provided here.

Comment thread api/example.com/resource/vfio-gpu/v1alpha1/api.go Outdated
Comment thread demo/clusters/kind-vfio-gpu/README.md Outdated
Comment thread cmd/dra-example-kubeletplugin/state.go Outdated
Comment thread demo/clusters/kind-vfio-gpu/create-cluster.sh Outdated
Comment thread demo/clusters/kind-vfio-gpu/create-cluster.sh Outdated
Comment thread internal/profiles/vfio-gpu/sysfs.go Outdated
Comment thread internal/profiles/vfio-gpu/sysfs.go Outdated
Comment thread internal/profiles/vfio-gpu/sysfs.go Outdated
Comment thread internal/profiles/vfio-gpu/sysfs.go Outdated
Comment thread internal/profiles/vfio-gpu/sysfs.go Outdated
@alaypatel07

Copy link
Copy Markdown

/wg-device-management

@alaypatel07

Copy link
Copy Markdown

@nojnhuh the e2e test for this will be in kubevirt, is that okay or do we need one here?

Could you please add an e2e test?

@nojnhuh

nojnhuh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

@nojnhuh the e2e test for this will be in kubevirt, is that okay or do we need one here?

Could you please add an e2e test?

Something here would be good. If a change here will definitely break KubeVirt tests, we want to know before we even merge. It doesn't need to be super involved, just enough to make sure the feature is wired up and making some visible impact.

@alaypatel07

Copy link
Copy Markdown

For adding a e2e test, the example driver will need to maintain logic for installing KubeVirt. See this: #218 (comment)

To begin with, I wanted to avoid depending on kubevirt here and strictly have kubevirt depend on dra-example-driver.

Something here would be good. If a change here will definitely break KubeVirt tests, we want to know before we even merge. It doesn't need to be super involved, just enough to make sure the feature is wired up and making some visible impact.

I am wondering if we can hold on to e2e test until little bit later and do it as a follow up.

@nojnhuh

nojnhuh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

For adding a e2e test, the example driver will need to maintain logic for installing KubeVirt. See this: #218 (comment)

To begin with, I wanted to avoid depending on kubevirt here and strictly have kubevirt depend on dra-example-driver.

Something here would be good. If a change here will definitely break KubeVirt tests, we want to know before we even merge. It doesn't need to be super involved, just enough to make sure the feature is wired up and making some visible impact.

I am wondering if we can hold on to e2e test until little bit later and do it as a follow up.

Any functionality or examples we add here should not depend on KubeVirt at all. Can we demonstrate the most basic functionality of the feature without KubeVirt?

@alaypatel07

alaypatel07 commented Jun 4, 2026

Copy link
Copy Markdown

Any functionality or examples we add here should not depend on KubeVirt at all. Can we demonstrate the most basic functionality of the feature without KubeVirt?

The DRA/Kubernetes allocation side can be tested here, and that is the only part we can demonstrate without a consumer like KubeVirt or Kata. We can show that the driver discovers the devices, publishes ResourceSlices, allocates a claim, and exposes the expected CDI/metadata artifacts.

The full feature cannot be demonstrated in this repo without KubeVirt/Kata, because what happens after allocation is consumer behavior: something has to consume the device metadata and turn the allocated PCI device into a VM/sandbox attachment.

@nojnhuh

nojnhuh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

The full feature cannot be demonstrated in this repo without KubeVirt/Kata, because what happens after allocation is consumer behavior: something has to consume the device metadata and turn the allocated PCI device into a VM/sandbox attachment.

We don't have to do anything interesting with the actual metadata. Could a simple example define a Pod that requests a device where metadata is known to be written, and then the Pod's container simply echoes the file?

@alaypatel07

alaypatel07 commented Jun 4, 2026

Copy link
Copy Markdown

We don't have to do anything interesting with the actual metadata. Could a simple example define a Pod that requests a device where metadata is known to be written, and then the Pod's container simply echoes the file?

Yes that could be done, but it's already covered by the metadata feature: kubernetes/kubernetes#137699

@nojnhuh

nojnhuh commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Yes that could be done, but it's already covered by the metadata feature: kubernetes/kubernetes#137699

We shouldn't be testing here whether Kubernetes implements the feature correctly, but we should have something that would indicate if the wiring in the example driver is missing or outright broken.

@nirdothan

Copy link
Copy Markdown

Why do we need to add specific code to a project that is supposed to serve as a generic example for developers or a base project for forking?
Have you considered forking the project? We can probably host the fork in the kubevirt org if its purpose is kubevirt testing.

@alaypatel07

Copy link
Copy Markdown

Why do we need to add specific code to a project that is supposed to serve as a generic example for developers or a base project for forking?
Have you considered forking the project? We can probably host the fork in the kubevirt org if its purpose is kubevirt testing.

@nirdothan This project is not just used for example, it is also used e2e testing for multiple projects:

  1. https://github.com/kubernetes-sigs/kueue/blob/fbc717fcbeb7d434a774e4be5a55012fd5df6ab9/hack/testing/e2e-common.sh#L243
  2. https://github.com/kubernetes/perf-tests/tree/master/clusterloader2/pkg/dependency/dra/manifests/dra-example-driver

Using it for kubevirt e2e tests is a natural next step for this project.

Secondly because this is an example for developers to fork, having the integration here will make it easier for developers to learn about supporting their device for KubeVirt.

Given the above two benefits, forking was avoided and contributing to this project was preferred.

Comment thread cmd/dra-example-kubeletplugin/driver.go

@alaypatel07 alaypatel07 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a second pass, PTAL

Comment thread demo/clusters/kind-vfio-gpu/create-cluster.sh Outdated
Comment thread demo/clusters/kind-vfio-gpu/delete-cluster.sh Outdated
Comment thread demo/clusters/kind-vfio-gpu/README.md Outdated
Comment thread demo/clusters/kind-vfio-gpu/vfio-gpu-test.yaml Outdated
Comment thread deployments/helm/dra-example-driver/templates/kubeletplugin.yaml
Comment thread deployments/helm/dra-example-driver/values.yaml
Comment thread internal/profiles/vfio-gpu/sysfs.go
Comment thread internal/profiles/vfio-gpu/vfio-gpu.go
Comment thread README.md Outdated
Comment thread README.md Outdated
@Sreeja1725 Sreeja1725 force-pushed the kubevirt-dra-profile branch from e553c71 to 617d478 Compare June 11, 2026 09:54
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2026
@Sreeja1725 Sreeja1725 force-pushed the kubevirt-dra-profile branch from 617d478 to 72779fe Compare June 11, 2026 09:56
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 11, 2026
@Sreeja1725 Sreeja1725 force-pushed the kubevirt-dra-profile branch from 72779fe to 88673f0 Compare June 16, 2026 12:31
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2026
@Sreeja1725

Copy link
Copy Markdown
Author

@nojnhuh @nirdothan Can you please take a look?

@pohly pohly moved this from 🏗 In progress to 👀 In review in Dynamic Resource Allocation Jun 17, 2026
@Sreeja1725 Sreeja1725 force-pushed the kubevirt-dra-profile branch 2 times, most recently from 7c99a9e to bdbfb08 Compare June 19, 2026 08:01
@kubernetes-prow

Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubernetes-prow kubernetes-prow Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2026
Signed-off-by: svarnam <svarnam@nvidia.com>
@Sreeja1725 Sreeja1725 force-pushed the kubevirt-dra-profile branch from bdbfb08 to df20712 Compare June 29, 2026 15:10
@kubernetes-prow

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Sreeja1725
Once this PR has been reviewed and has the lgtm label, please assign bart0sh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants