Skip to content

vgpu-monitor panics with invalid UTF-8 when collecting per-container metrics during container initialization#110

Merged
hami-robot[bot] merged 1 commit intoProject-HAMi:mainfrom
charford:fix/vgpu-monitor-invalid-uuid-panic
Mar 17, 2026
Merged

vgpu-monitor panics with invalid UTF-8 when collecting per-container metrics during container initialization#110
hami-robot[bot] merged 1 commit intoProject-HAMi:mainfrom
charford:fix/vgpu-monitor-invalid-uuid-panic

Conversation

@charford
Copy link
Copy Markdown
Contributor

Problem

The vgpu-monitor panics on every scrape cycle when a HAMi-managed container is initializing, causing no per-container vGPU metrics to be emitted for any workload.

DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialized, this memory is uninitialized/zeroed. The raw bytes are sliced to 40 characters and passed directly as a Prometheus label value to MustNewConstMetric, which panics:

panic: label value "\x80\x00\x00\x00..." is not valid UTF-8

goroutine 426 [running]:
github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
main.ClusterManagerCollector.Collect(...)
    /go/src/volcano.sh/devices/cmd/vgpu-monitor/metrics.go:238

Fix

Validate the UUID is valid UTF-8 before calling MustNewConstMetric. If not, log a warning and skip the device for this scrape cycle. On the next cycle, once libvgpu.so has written the UUID to shared memory, the metric will be collected normally.

Testing

  • Verified panic occurs on prod-ltx1-k8s-2 node ltx1-app157276 (pool gpu-nvidia-h100-ssd-hami-kjp) with a PyTorch job requesting only volcano.sh/vgpu-* resources
  • Confirmed libvgpu.so and ld.so.preload are correctly injected by the device plugin at container start
  • Fix prevents the panic; device will be retried on next scrape once shared memory is initialized

…UTF-8

DeviceUUID reads from shared memory written by libvgpu.so inside the
container. Before CUDA has fully initialized, this memory is uninitialized,
causing MustNewConstMetric to panic with an invalid UTF-8 label value.

Validate the UUID string before emitting metrics and skip the device with
a warning log if the shared memory is not yet ready.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: charford <casey@caseyharford.com>
@charford charford force-pushed the fix/vgpu-monitor-invalid-uuid-panic branch from 1baa1d3 to b2490ce Compare March 11, 2026 21:40
@charford charford marked this pull request as ready for review March 12, 2026 15:49
Copy link
Copy Markdown
Member

@archlitchi archlitchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot
Copy link
Copy Markdown
Contributor

hami-robot Bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, charford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot Bot added the approved label Mar 17, 2026
@hami-robot hami-robot Bot merged commit 4ef4b80 into Project-HAMi:main Mar 17, 2026
4 checks passed
@archlitchi
Copy link
Copy Markdown
Member

@charford Thanks for this PR, is this issue occurs to HAMi-device-plugin as well?

charford pushed a commit to charford/HAMi that referenced this pull request Mar 19, 2026
DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so
inside the container. Before CUDA has fully initialised, this memory is
uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus
label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric
returns an error and per-container metrics are silently dropped for the
entire scrape cycle.

Validate the UUID is valid UTF-8 after truncation. If not, log a warning
and skip that device for this scrape cycle. On the next cycle, once
libvgpu.so has written the real UUID to shared memory, the metric will be
collected normally.

Related: Project-HAMi/volcano-vgpu-device-plugin#110

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@charford
Copy link
Copy Markdown
Contributor Author

charford commented Mar 19, 2026

@charford Thanks for this PR, is this issue occurs to HAMi-device-plugin as well?

@archlitchi Yes, the same root cause exists in HAMi-device-plugin. DeviceUUID() is still sliced to 40 chars without UTF-8 validation before being passed to Prometheus as a label value (cmd/vGPUmonitor/metrics.go, collectContainerMetrics).

The impact is less severe though — HAMi uses prometheus.NewConstMetric via a sendMetric wrapper rather than MustNewConstMetric, so it returns an error instead of panicking. The monitor stays up but logs errors and skips per-container metrics for any device whose shared memory hasn't been initialised by libvgpu.so yet.

I've opened a PR with the same fix:

Project-HAMi/HAMi#1703

charford added a commit to charford/HAMi that referenced this pull request Mar 19, 2026
DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so
inside the container. Before CUDA has fully initialised, this memory is
uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus
label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric
returns an error and per-container metrics are silently dropped for the
entire scrape cycle.

Validate the UUID is valid UTF-8 after truncation. If not, log a warning
and skip that device for this scrape cycle. On the next cycle, once
libvgpu.so has written the real UUID to shared memory, the metric will be
collected normally.

Related: Project-HAMi/volcano-vgpu-device-plugin#110

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Casey Harford <charford@linkedin.com>
hami-robot Bot pushed a commit to Project-HAMi/HAMi that referenced this pull request Mar 26, 2026
…it (#1703)

DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so
inside the container. Before CUDA has fully initialised, this memory is
uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus
label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric
returns an error and per-container metrics are silently dropped for the
entire scrape cycle.

Validate the UUID is valid UTF-8 after truncation. If not, log a warning
and skip that device for this scrape cycle. On the next cycle, once
libvgpu.so has written the real UUID to shared memory, the metric will be
collected normally.

Related: Project-HAMi/volcano-vgpu-device-plugin#110

Signed-off-by: Casey Harford <charford@linkedin.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants