Summary
We run HAMi-core (libvgpu.so, LD_PRELOAD-injected via the volcano-vgpu-device-plugin postStart hook) to fractionally share NVIDIA RTX 3090 24 GiB cards across pods. A vLLM v0.19.1 deployment consistently OOMs inside profile_cudagraph_memory() on these pods, and the failure pattern strongly suggests a memory-accounting mismatch between what HAMi-core exposes inside the container and what vLLM (and presumably any consumer that pre-allocates against a percentage of "GPU capacity") computes.
The core question for HAMi: how should a downstream consumer query the clamped visible memory budget of a HAMi slice in a way that all pre-eager allocators (vLLM, PyTorch, training frameworks) will pick up correctly? Specifically, is cudaMemGetInfo the only intercepted surface, or are NVML calls (nvmlDeviceGetMemoryInfo / nvmlDeviceGetMemoryInfo_v2) also clamped? If only the CUDA Runtime path is hijacked, then any consumer that reads nvmlDeviceGetMemoryInfo for capacity (which reports physical) will systematically over-allocate by the clamp delta.
We are filing a parallel issue against vllm-project/vllm for vLLM's own accounting; this issue is about the HAMi-side API surface and the recommended interrogation pattern.
Environment
- Kubernetes cluster (Talos Linux), Volcano scheduler.
- HAMi-core LD_PRELOAD (
libvgpu.so) seeded into pods by the volcano-vgpu-device-plugin postStart hook (active and verified — slice enforcement works for normal allocations).
- 6 GPU workers each with 1× NVIDIA RTX 3090 24 GiB (Ampere, sm_86), driver 535-class, CUDA 12.x.
- Pod resource claim:
volcano.sh/vgpu-number: 1, volcano.sh/vgpu-memory: 24576 (full card), volcano.sh/vgpu-cores: 100.
- Workload: vLLM v0.19.1 serving an AutoRound-INT4 model with FP8 (
e5m2) KV cache, FlashInfer attention backend, CUDA graphs enabled.
Observation
Inside the container, cudaMemGetInfo reports a clamped visible total around 23.56 GiB (rather than the physical 24.00 GiB). The clamp delta (~440 MiB) appears to be the runtime overhead reservation HAMi-core takes inside the container — that part is expected and reasonable.
The problem surfaces during vLLM startup:
- PyTorch grows its caching allocator to a process peak of ~21.68 GiB, which is ~92% of the physical 24 GiB ceiling vLLM appears to be sizing against (its
gpu_memory_utilization is 0.92). Against the clamped visible 23.56 GiB, the same fraction would yield ~21.7 GiB — close enough that vLLM is clearly computing its budget against the physical number, not the clamped visible number.
- After PyTorch settles, vLLM enters
profile_cudagraph_memory() and tries to allocate ~1.53 GiB of additional working set for graph capture.
- Free memory remaining inside the slice is only ~1.39 GiB.
- OOM, every time, deterministic.
The ~140 MiB shortfall (~1.53 GiB requested vs ~1.39 GiB free) is on the same order as the clamp delta, which is what makes us suspect the accounting mismatch rather than a vLLM-internal bug.
Hypothesis
vLLM (via PyTorch / its memory profiler) is computing the budget by reading nvmlDeviceGetMemoryInfo for the device's total capacity, which — if HAMi-core does not intercept that NVML call — returns the physical 24 GiB. The PyTorch caching allocator's peak then reflects that physical-based budget, which leaves only enough headroom for cudagraph profiling on a non-clamped GPU. On a HAMi-clamped slice the headroom evaporates because the clamp delta was never accounted for.
If this hypothesis is correct, then any consumer following the documented PyTorch / NVIDIA pattern of "query NVML for total, multiply by utilization fraction, allocate that much" will systematically over-allocate on HAMi slices. CUDA Runtime's cudaMemGetInfo is the right answer (it reports clamped-and-aware free/total), but it is not the canonical capacity-query API for many frameworks.
Why this matters
Sliced-GPU sharing is a primary HAMi use case. The set of stacks that pre-allocate against a percentage of "device capacity" (vLLM, SGLang, certain training frameworks, anything calling torch.cuda.mem_get_info / nvmlDeviceGetMemoryInfo) is large and growing. A 1–2% accounting drift is enough to flip CUDA-graph capture from "fits comfortably" to "deterministic OOM," and the failure mode is opaque to operators because the inside-the-container nvidia-smi looks fine until the moment of allocation failure.
Asks
- Confirm or deny the NVML interception story. Does HAMi-core today intercept
nvmlDeviceGetMemoryInfo / nvmlDeviceGetMemoryInfo_v2 and return the clamped total, or only cudaMemGetInfo? A pointer to the relevant source file would be ideal — the README documents cudaMemGetInfo interception explicitly but does not enumerate the NVML surface.
- Document the recommended query pattern for downstream consumers who want to "right-size" their budgets against the actual usable slice. Is the canonical answer "always use
cudaMemGetInfo," or is there a preferred env var / sysfs / NVML extension that exposes the configured CUDA_DEVICE_MEMORY_LIMIT directly?
- Consider extending NVML interception if it is not already present. The cost of intercepting
nvmlDeviceGetMemoryInfo to return clamped totals is small, and it would silently fix a wide class of pre-allocator misbudgeting bugs in upstream consumers without requiring them to change their code.
We are happy to test patches against our cluster and report back — the reproduction is deterministic and runs on a single 3090.
Cross-reference
A parallel issue has been filed against vllm-project/vllm covering vLLM's own pre-eager budget accounting in profile_cudagraph_memory(). The two issues should land on the same root cause from opposite sides — vLLM should query a clamp-aware surface, and HAMi should ensure that surface is the natural one. Link will be added in a comment once the upstream issue URL is available.
Summary
We run HAMi-core (
libvgpu.so, LD_PRELOAD-injected via the volcano-vgpu-device-plugin postStart hook) to fractionally share NVIDIA RTX 3090 24 GiB cards across pods. A vLLM v0.19.1 deployment consistently OOMs insideprofile_cudagraph_memory()on these pods, and the failure pattern strongly suggests a memory-accounting mismatch between what HAMi-core exposes inside the container and what vLLM (and presumably any consumer that pre-allocates against a percentage of "GPU capacity") computes.The core question for HAMi: how should a downstream consumer query the clamped visible memory budget of a HAMi slice in a way that all pre-eager allocators (vLLM, PyTorch, training frameworks) will pick up correctly? Specifically, is
cudaMemGetInfothe only intercepted surface, or are NVML calls (nvmlDeviceGetMemoryInfo/nvmlDeviceGetMemoryInfo_v2) also clamped? If only the CUDA Runtime path is hijacked, then any consumer that readsnvmlDeviceGetMemoryInfofor capacity (which reports physical) will systematically over-allocate by the clamp delta.We are filing a parallel issue against
vllm-project/vllmfor vLLM's own accounting; this issue is about the HAMi-side API surface and the recommended interrogation pattern.Environment
libvgpu.so) seeded into pods by thevolcano-vgpu-device-pluginpostStart hook (active and verified — slice enforcement works for normal allocations).volcano.sh/vgpu-number: 1,volcano.sh/vgpu-memory: 24576(full card),volcano.sh/vgpu-cores: 100.e5m2) KV cache, FlashInfer attention backend, CUDA graphs enabled.Observation
Inside the container,
cudaMemGetInforeports a clamped visible total around 23.56 GiB (rather than the physical 24.00 GiB). The clamp delta (~440 MiB) appears to be the runtime overhead reservation HAMi-core takes inside the container — that part is expected and reasonable.The problem surfaces during vLLM startup:
gpu_memory_utilizationis 0.92). Against the clamped visible 23.56 GiB, the same fraction would yield ~21.7 GiB — close enough that vLLM is clearly computing its budget against the physical number, not the clamped visible number.profile_cudagraph_memory()and tries to allocate ~1.53 GiB of additional working set for graph capture.The ~140 MiB shortfall (~1.53 GiB requested vs ~1.39 GiB free) is on the same order as the clamp delta, which is what makes us suspect the accounting mismatch rather than a vLLM-internal bug.
Hypothesis
vLLM (via PyTorch / its memory profiler) is computing the budget by reading
nvmlDeviceGetMemoryInfofor the device's total capacity, which — if HAMi-core does not intercept that NVML call — returns the physical 24 GiB. The PyTorch caching allocator's peak then reflects that physical-based budget, which leaves only enough headroom for cudagraph profiling on a non-clamped GPU. On a HAMi-clamped slice the headroom evaporates because the clamp delta was never accounted for.If this hypothesis is correct, then any consumer following the documented PyTorch / NVIDIA pattern of "query NVML for total, multiply by utilization fraction, allocate that much" will systematically over-allocate on HAMi slices. CUDA Runtime's
cudaMemGetInfois the right answer (it reports clamped-and-aware free/total), but it is not the canonical capacity-query API for many frameworks.Why this matters
Sliced-GPU sharing is a primary HAMi use case. The set of stacks that pre-allocate against a percentage of "device capacity" (vLLM, SGLang, certain training frameworks, anything calling
torch.cuda.mem_get_info/nvmlDeviceGetMemoryInfo) is large and growing. A 1–2% accounting drift is enough to flip CUDA-graph capture from "fits comfortably" to "deterministic OOM," and the failure mode is opaque to operators because the inside-the-containernvidia-smilooks fine until the moment of allocation failure.Asks
nvmlDeviceGetMemoryInfo/nvmlDeviceGetMemoryInfo_v2and return the clamped total, or onlycudaMemGetInfo? A pointer to the relevant source file would be ideal — the README documentscudaMemGetInfointerception explicitly but does not enumerate the NVML surface.cudaMemGetInfo," or is there a preferred env var / sysfs / NVML extension that exposes the configuredCUDA_DEVICE_MEMORY_LIMITdirectly?nvmlDeviceGetMemoryInfoto return clamped totals is small, and it would silently fix a wide class of pre-allocator misbudgeting bugs in upstream consumers without requiring them to change their code.We are happy to test patches against our cluster and report back — the reproduction is deterministic and runs on a single 3090.
Cross-reference
A parallel issue has been filed against
vllm-project/vllmcovering vLLM's own pre-eager budget accounting inprofile_cudagraph_memory(). The two issues should land on the same root cause from opposite sides — vLLM should query a clamp-aware surface, and HAMi should ensure that surface is the natural one. Link will be added in a comment once the upstream issue URL is available.