Skip to content

fix: apply filterdevices configuration during device registration#117

Open
chendave wants to merge 1 commit intoProject-HAMi:mainfrom
chendave:fix_filter
Open

fix: apply filterdevices configuration during device registration#117
chendave wants to merge 1 commit intoProject-HAMi:mainfrom
chendave:fix_filter

Conversation

@chendave
Copy link
Copy Markdown

The filterdevice configuration defined in the configmap: volcano-vgpu-node-config could be read correctly but not applied during device registration.

The feature was introduced by commit e7d4347 which add the support to filter GPU.

The issue was introduced during the v0.19.0 sync (commit 621c67b) when the old vgpu framework was removed.

This PR is tested by myself in my local environment.

The filterdevice configuration defined in the configmap: volcano-vgpu-node-config
could be read correctly but not applied during device registration.

The feature was introduced by commit e7d4347 which add the support to filter GPU.

The issue was introduced during the v0.19.0 sync (commit 621c67b) when the old vgpu
framework was removed.

Signed-off-by: Dave Chen <dave.jungler@gmail.com>
@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: chendave
Once this PR has been reviewed and has the lgtm label, please assign archlitchi for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Apr 24, 2026

Welcome @chendave! It looks like this is your first PR to Project-HAMi/volcano-vgpu-device-plugin 🎉

@hami-robot hami-robot Bot added the size/S label Apr 24, 2026
@chendave
Copy link
Copy Markdown
Author

cc @DSFans2014

@chendave
Copy link
Copy Markdown
Author

cc 1090646861@qq.com

Comment thread pkg/rm/device_map.go
devices := make(DeviceMap)

err := b.VisitDevices(func(i int, gpu device.Device) error {
// Get UUID for filtering
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chendave thanks for your contribution.
And I have a question that why do we need to add filtering here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is where the change is needed, I think it's where the device was detected by NVML and I will remove the change from "pkg/plugin/register.go".

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add filter during the registration phase? My concern is that if too many modifications are made in other places, it is very likely that some of them will be missed during the next upgrade.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I will revisit this later, might not able to push the update today, must have a way to detect the real physical index if go to the register phase.

Comment thread pkg/plugin/register.go
continue
}
klog.V(3).Infoln("device ", idx, "id=", dev.ID, "health=", dev.Health)
}
Copy link
Copy Markdown
Author

@chendave chendave Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code is problematic, I was intending to filter the device again (unnecessary), and the code was meant to move inside the loop below.

and the check here is unnecessary because, it will filter the device that is not meant to be filtered,
e.g. I have 3 GPU, and I want to filter the device with index = 0, the code in buildGPUDeviceMap already filter the device with index = 0, but the code here will loop the remaining devices, and filter the GPU with index = 1, since the idx here will still start from 0. This is wrong! I will remove those piece of changes.

Comment thread pkg/rm/device_map.go
devices := make(DeviceMap)

err := b.VisitDevices(func(i int, gpu device.Device) error {
// Get UUID for filtering
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is where the change is needed, I think it's where the device was detected by NVML and I will remove the change from "pkg/plugin/register.go".

@chendave
Copy link
Copy Markdown
Author

I will push an update by the end of the day.

@chendave
Copy link
Copy Markdown
Author

I will push an update by the end of the day.

no bandwidth left today, should reach an agreement on where the fix should go first, will find time to have more investigation in the coming days.

@DSFans2014
Copy link
Copy Markdown
Contributor

hi @chendave After discuss with @archlitchi , we felt it would be better to filter at pkg/rm/device_map.go to avoid issues such as health reporting after problematic cards are filtered out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants