Skip to content

[Question / Possible Bug] visible-memory clamp interaction with vLLM cudagraph profiling on Ampere #181

@bjornmage

Description

@bjornmage

Summary

We run HAMi-core (libvgpu.so, LD_PRELOAD-injected via the volcano-vgpu-device-plugin postStart hook) to fractionally share NVIDIA RTX 3090 24 GiB cards across pods. A vLLM v0.19.1 deployment consistently OOMs inside profile_cudagraph_memory() on these pods, and the failure pattern strongly suggests a memory-accounting mismatch between what HAMi-core exposes inside the container and what vLLM (and presumably any consumer that pre-allocates against a percentage of "GPU capacity") computes.

The core question for HAMi: how should a downstream consumer query the clamped visible memory budget of a HAMi slice in a way that all pre-eager allocators (vLLM, PyTorch, training frameworks) will pick up correctly? Specifically, is cudaMemGetInfo the only intercepted surface, or are NVML calls (nvmlDeviceGetMemoryInfo / nvmlDeviceGetMemoryInfo_v2) also clamped? If only the CUDA Runtime path is hijacked, then any consumer that reads nvmlDeviceGetMemoryInfo for capacity (which reports physical) will systematically over-allocate by the clamp delta.

We are filing a parallel issue against vllm-project/vllm for vLLM's own accounting; this issue is about the HAMi-side API surface and the recommended interrogation pattern.

Environment

  • Kubernetes cluster (Talos Linux), Volcano scheduler.
  • HAMi-core LD_PRELOAD (libvgpu.so) seeded into pods by the volcano-vgpu-device-plugin postStart hook (active and verified — slice enforcement works for normal allocations).
  • 6 GPU workers each with 1× NVIDIA RTX 3090 24 GiB (Ampere, sm_86), driver 535-class, CUDA 12.x.
  • Pod resource claim: volcano.sh/vgpu-number: 1, volcano.sh/vgpu-memory: 24576 (full card), volcano.sh/vgpu-cores: 100.
  • Workload: vLLM v0.19.1 serving an AutoRound-INT4 model with FP8 (e5m2) KV cache, FlashInfer attention backend, CUDA graphs enabled.

Observation

Inside the container, cudaMemGetInfo reports a clamped visible total around 23.56 GiB (rather than the physical 24.00 GiB). The clamp delta (~440 MiB) appears to be the runtime overhead reservation HAMi-core takes inside the container — that part is expected and reasonable.

The problem surfaces during vLLM startup:

  1. PyTorch grows its caching allocator to a process peak of ~21.68 GiB, which is ~92% of the physical 24 GiB ceiling vLLM appears to be sizing against (its gpu_memory_utilization is 0.92). Against the clamped visible 23.56 GiB, the same fraction would yield ~21.7 GiB — close enough that vLLM is clearly computing its budget against the physical number, not the clamped visible number.
  2. After PyTorch settles, vLLM enters profile_cudagraph_memory() and tries to allocate ~1.53 GiB of additional working set for graph capture.
  3. Free memory remaining inside the slice is only ~1.39 GiB.
  4. OOM, every time, deterministic.

The ~140 MiB shortfall (~1.53 GiB requested vs ~1.39 GiB free) is on the same order as the clamp delta, which is what makes us suspect the accounting mismatch rather than a vLLM-internal bug.

Hypothesis

vLLM (via PyTorch / its memory profiler) is computing the budget by reading nvmlDeviceGetMemoryInfo for the device's total capacity, which — if HAMi-core does not intercept that NVML call — returns the physical 24 GiB. The PyTorch caching allocator's peak then reflects that physical-based budget, which leaves only enough headroom for cudagraph profiling on a non-clamped GPU. On a HAMi-clamped slice the headroom evaporates because the clamp delta was never accounted for.

If this hypothesis is correct, then any consumer following the documented PyTorch / NVIDIA pattern of "query NVML for total, multiply by utilization fraction, allocate that much" will systematically over-allocate on HAMi slices. CUDA Runtime's cudaMemGetInfo is the right answer (it reports clamped-and-aware free/total), but it is not the canonical capacity-query API for many frameworks.

Why this matters

Sliced-GPU sharing is a primary HAMi use case. The set of stacks that pre-allocate against a percentage of "device capacity" (vLLM, SGLang, certain training frameworks, anything calling torch.cuda.mem_get_info / nvmlDeviceGetMemoryInfo) is large and growing. A 1–2% accounting drift is enough to flip CUDA-graph capture from "fits comfortably" to "deterministic OOM," and the failure mode is opaque to operators because the inside-the-container nvidia-smi looks fine until the moment of allocation failure.

Asks

  1. Confirm or deny the NVML interception story. Does HAMi-core today intercept nvmlDeviceGetMemoryInfo / nvmlDeviceGetMemoryInfo_v2 and return the clamped total, or only cudaMemGetInfo? A pointer to the relevant source file would be ideal — the README documents cudaMemGetInfo interception explicitly but does not enumerate the NVML surface.
  2. Document the recommended query pattern for downstream consumers who want to "right-size" their budgets against the actual usable slice. Is the canonical answer "always use cudaMemGetInfo," or is there a preferred env var / sysfs / NVML extension that exposes the configured CUDA_DEVICE_MEMORY_LIMIT directly?
  3. Consider extending NVML interception if it is not already present. The cost of intercepting nvmlDeviceGetMemoryInfo to return clamped totals is small, and it would silently fix a wide class of pre-allocator misbudgeting bugs in upstream consumers without requiring them to change their code.

We are happy to test patches against our cluster and report back — the reproduction is deterministic and runs on a single 3090.

Cross-reference

A parallel issue has been filed against vllm-project/vllm covering vLLM's own pre-eager budget accounting in profile_cudagraph_memory(). The two issues should land on the same root cause from opposite sides — vLLM should query a clamp-aware surface, and HAMi should ensure that surface is the natural one. Link will be added in a comment once the upstream issue URL is available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions