vgpu-monitor panics with invalid UTF-8 when collecting per-container metrics during container initialization by charford · Pull Request #110 · Project-HAMi/volcano-vgpu-device-plugin

charford · 2026-03-11T21:40:05Z

Problem

The vgpu-monitor panics on every scrape cycle when a HAMi-managed container is initializing, causing no per-container vGPU metrics to be emitted for any workload.

DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialized, this memory is uninitialized/zeroed. The raw bytes are sliced to 40 characters and passed directly as a Prometheus label value to MustNewConstMetric, which panics:

panic: label value "\x80\x00\x00\x00..." is not valid UTF-8

goroutine 426 [running]:
github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
main.ClusterManagerCollector.Collect(...)
    /go/src/volcano.sh/devices/cmd/vgpu-monitor/metrics.go:238

Fix

Validate the UUID is valid UTF-8 before calling MustNewConstMetric. If not, log a warning and skip the device for this scrape cycle. On the next cycle, once libvgpu.so has written the UUID to shared memory, the metric will be collected normally.

Testing

Verified panic occurs on prod-ltx1-k8s-2 node ltx1-app157276 (pool gpu-nvidia-h100-ssd-hami-kjp) with a PyTorch job requesting only volcano.sh/vgpu-* resources
Confirmed libvgpu.so and ld.so.preload are correctly injected by the device plugin at container start
Fix prevents the panic; device will be retried on next scrape once shared memory is initialized

…UTF-8 DeviceUUID reads from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialized, this memory is uninitialized, causing MustNewConstMetric to panic with an invalid UTF-8 label value. Validate the UUID string before emitting metrics and skip the device with a warning log if the shared memory is not yet ready. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: charford <casey@caseyharford.com>

archlitchi

/lgtm

hami-robot · 2026-03-17T10:29:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, charford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [archlitchi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

archlitchi · 2026-03-17T10:30:13Z

@charford Thanks for this PR, is this issue occurs to HAMi-device-plugin as well?

DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialised, this memory is uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric returns an error and per-container metrics are silently dropped for the entire scrape cycle. Validate the UUID is valid UTF-8 after truncation. If not, log a warning and skip that device for this scrape cycle. On the next cycle, once libvgpu.so has written the real UUID to shared memory, the metric will be collected normally. Related: Project-HAMi/volcano-vgpu-device-plugin#110 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

charford · 2026-03-19T17:51:08Z

@charford Thanks for this PR, is this issue occurs to HAMi-device-plugin as well?

@archlitchi Yes, the same root cause exists in HAMi-device-plugin. DeviceUUID() is still sliced to 40 chars without UTF-8 validation before being passed to Prometheus as a label value (cmd/vGPUmonitor/metrics.go, collectContainerMetrics).

The impact is less severe though — HAMi uses prometheus.NewConstMetric via a sendMetric wrapper rather than MustNewConstMetric, so it returns an error instead of panicking. The monitor stays up but logs errors and skips per-container metrics for any device whose shared memory hasn't been initialised by libvgpu.so yet.

I've opened a PR with the same fix:

Project-HAMi/HAMi#1703

DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialised, this memory is uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric returns an error and per-container metrics are silently dropped for the entire scrape cycle. Validate the UUID is valid UTF-8 after truncation. If not, log a warning and skip that device for this scrape cycle. On the next cycle, once libvgpu.so has written the real UUID to shared memory, the metric will be collected normally. Related: Project-HAMi/volcano-vgpu-device-plugin#110 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Casey Harford <charford@linkedin.com>

…it (#1703) DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialised, this memory is uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric returns an error and per-container metrics are silently dropped for the entire scrape cycle. Validate the UUID is valid UTF-8 after truncation. If not, log a warning and skip that device for this scrape cycle. On the next cycle, once libvgpu.so has written the real UUID to shared memory, the metric will be collected normally. Related: Project-HAMi/volcano-vgpu-device-plugin#110 Signed-off-by: Casey Harford <charford@linkedin.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

hami-robot Bot added the do-not-merge/work-in-progress label Mar 11, 2026

hami-robot Bot requested review from SataQiu and archlitchi March 11, 2026 21:40

hami-robot Bot added dco-signoff: no size/XS labels Mar 11, 2026

charford force-pushed the fix/vgpu-monitor-invalid-uuid-panic branch from 1baa1d3 to b2490ce Compare March 11, 2026 21:40

hami-robot Bot added dco-signoff: yes and removed dco-signoff: no labels Mar 11, 2026

charford marked this pull request as ready for review March 12, 2026 15:49

hami-robot Bot removed the do-not-merge/work-in-progress label Mar 12, 2026

archlitchi approved these changes Mar 17, 2026

View reviewed changes

hami-robot Bot assigned archlitchi Mar 17, 2026

hami-robot Bot added the lgtm label Mar 17, 2026

hami-robot Bot added the approved label Mar 17, 2026

hami-robot Bot merged commit 4ef4b80 into Project-HAMi:main Mar 17, 2026
4 checks passed

charford mentioned this pull request Mar 19, 2026

vGPUmonitor: skip devices with invalid UTF-8 UUID during container init Project-HAMi/HAMi#1703

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vgpu-monitor panics with invalid UTF-8 when collecting per-container metrics during container initialization#110

vgpu-monitor panics with invalid UTF-8 when collecting per-container metrics during container initialization#110
hami-robot[bot] merged 1 commit intoProject-HAMi:mainfrom
charford:fix/vgpu-monitor-invalid-uuid-panic

charford commented Mar 11, 2026

Uh oh!

archlitchi left a comment

Uh oh!

hami-robot Bot commented Mar 17, 2026

Uh oh!

Uh oh!

archlitchi commented Mar 17, 2026

Uh oh!

charford commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

charford commented Mar 11, 2026

Problem

Fix

Testing

Uh oh!

archlitchi left a comment

Choose a reason for hiding this comment

Uh oh!

hami-robot Bot commented Mar 17, 2026

Uh oh!

Uh oh!

archlitchi commented Mar 17, 2026

Uh oh!

charford commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

charford commented Mar 19, 2026 •

edited

Loading