vgpu-monitor panics with invalid UTF-8 when collecting per-container metrics during container initialization#110
Conversation
…UTF-8 DeviceUUID reads from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialized, this memory is uninitialized, causing MustNewConstMetric to panic with an invalid UTF-8 label value. Validate the UUID string before emitting metrics and skip the device with a warning log if the shared memory is not yet ready. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: charford <casey@caseyharford.com>
1baa1d3 to
b2490ce
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi, charford The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@charford Thanks for this PR, is this issue occurs to HAMi-device-plugin as well? |
DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialised, this memory is uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric returns an error and per-container metrics are silently dropped for the entire scrape cycle. Validate the UUID is valid UTF-8 after truncation. If not, log a warning and skip that device for this scrape cycle. On the next cycle, once libvgpu.so has written the real UUID to shared memory, the metric will be collected normally. Related: Project-HAMi/volcano-vgpu-device-plugin#110 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@archlitchi Yes, the same root cause exists in HAMi-device-plugin. The impact is less severe though — HAMi uses I've opened a PR with the same fix: |
DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialised, this memory is uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric returns an error and per-container metrics are silently dropped for the entire scrape cycle. Validate the UUID is valid UTF-8 after truncation. If not, log a warning and skip that device for this scrape cycle. On the next cycle, once libvgpu.so has written the real UUID to shared memory, the metric will be collected normally. Related: Project-HAMi/volcano-vgpu-device-plugin#110 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Casey Harford <charford@linkedin.com>
…it (#1703) DeviceUUID() reads the GPU UUID from shared memory written by libvgpu.so inside the container. Before CUDA has fully initialised, this memory is uninitialized/zeroed. The 40-byte slice is passed directly as a Prometheus label value; if it contains non-UTF-8 bytes, prometheus.NewConstMetric returns an error and per-container metrics are silently dropped for the entire scrape cycle. Validate the UUID is valid UTF-8 after truncation. If not, log a warning and skip that device for this scrape cycle. On the next cycle, once libvgpu.so has written the real UUID to shared memory, the metric will be collected normally. Related: Project-HAMi/volcano-vgpu-device-plugin#110 Signed-off-by: Casey Harford <charford@linkedin.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Problem
The vgpu-monitor panics on every scrape cycle when a HAMi-managed container is initializing, causing no per-container vGPU metrics to be emitted for any workload.
DeviceUUID()reads the GPU UUID from shared memory written bylibvgpu.soinside the container. Before CUDA has fully initialized, this memory is uninitialized/zeroed. The raw bytes are sliced to 40 characters and passed directly as a Prometheus label value toMustNewConstMetric, which panics:Fix
Validate the UUID is valid UTF-8 before calling
MustNewConstMetric. If not, log a warning and skip the device for this scrape cycle. On the next cycle, oncelibvgpu.sohas written the UUID to shared memory, the metric will be collected normally.Testing
prod-ltx1-k8s-2nodeltx1-app157276(poolgpu-nvidia-h100-ssd-hami-kjp) with a PyTorch job requesting onlyvolcano.sh/vgpu-*resourceslibvgpu.soandld.so.preloadare correctly injected by the device plugin at container start