repro.c
Bug: cuMemCreate hook reads uninitialized dev when prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE
Summary
In src/cuda/memory.c, the cuMemCreate hook calls add_chunk_only(*handle, size, dev) with dev left uninitialized whenever prop->location.type is not CU_MEM_LOCATION_TYPE_DEVICE. The garbage value flows down into set_current_device_memory_limit, which logs Illegal device id: <random> and writes out-of-bounds into region_info.shared_region->limit[dev].
This blocks any container that uses CUDA Driver API VMM (cuMemCreate + cuMemAddressReserve + cuMemMap) with a non-DEVICE allocation location — which is the path ggml-cuda takes when staging a virtual address pool before binding it to a specific device. So basically llama.cpp, every llama-server build with VMM on, and the Lucebox / DFlash speculative-decoding stack all crash on a HAMi-managed pod.
Steps to reproduce
Hardware: NVIDIA RTX 5090 Laptop GPU (sm_120 Blackwell consumer). The bug isn't hardware-specific — same pattern should hit on any GPU under HAMi as long as the workload uses cuMemCreate with a non-DEVICE location.
Stack: HAMi 2.5.1 (deployed as part of Olares 1.12.5), nvidia/cuda:13.0.0-devel-ubuntu22.04, ggml-cuda built with default GGML_CUDA=ON (i.e. with VMM enabled).
Pod logs at startup:
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24463 MiB):
Device 0: NVIDIA GeForce RTX 5090 Laptop GPU, compute capability 12.0,
VMM: yes, VRAM: 24463 MiB
[HAMI-core ERROR (pid:1 thread=124663410515968 multiprocess_memory_limit.c:846)]:
Illegal device id: -644371744
The -644371744 value changes between runs — whatever was on the stack at the previous frame. I confirmed it's not a build, env or driver issue: same pattern across image rebuilds, LD_LIBRARY_PATH permutations, CUDA_DEVICE_MEMORY_LIMIT_0 settings, and on two different driver versions.
Minimal C reproducer:
#include <stdio.h>
#include <cuda.h>
int main(void) {
cuInit(0);
CUdevice device; cuDeviceGet(&device, 0);
CUcontext ctx; cuCtxCreate(&ctx, 0, device);
CUmemAllocationProp prop = {0};
prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop.location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA; // not DEVICE
prop.location.id = 0;
size_t granularity;
cuMemGetAllocationGranularity(&granularity, &prop,
CU_MEM_ALLOC_GRANULARITY_MINIMUM);
size_t size = ((1<<20) + granularity - 1) / granularity * granularity;
CUmemGenericAllocationHandle handle;
CUresult res = cuMemCreate(&handle, size, &prop, 0);
printf("cuMemCreate returned %d\n", res);
cuCtxDestroy(ctx);
return 0;
}
nvcc -lcuda repro.c -o repro and run it inside a HAMi-managed pod — [HAMI-core ERROR ...]: Illegal device id: <random> shows up immediately.
Root cause
src/cuda/memory.c, cuMemCreate hook around line 1009:
CUresult cuMemCreate(CUmemGenericAllocationHandle* handle, size_t size,
const CUmemAllocationProp* prop, unsigned long long flags) {
LOG_INFO("cuMemCreate:%lld:%d", size, prop->location.id);
ENSURE_RUNNING();
CUdevice dev; // (a) not initialised
int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
dev = prop->location.id;
}
if (do_oom_check && oom_check(dev, size)) {
return CUDA_ERROR_OUT_OF_MEMORY;
}
CUresult res = CUDA_OVERRIDE_CALL(cuda_library_entry,
cuMemCreate, handle, size, prop, flags);
if (res == CUDA_SUCCESS) {
add_chunk_only(*handle, size, dev); // (b) uses (a) uninitialised
}
return res;
}
When prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE, the entire if (do_oom_check && ...) block is skipped, so dev stays whatever was on the stack. The unconditional add_chunk_only(*handle, size, dev) at the bottom forwards that garbage to set_current_device_memory_limit:
// src/multiprocess/multiprocess_memory_limit.c
int set_current_device_memory_limit(const int dev, size_t newlimit) {
ensure_initialized();
if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
LOG_ERROR("Illegal device id: %d", dev);
}
LOG_INFO("dev %d new limit set to %ld",dev,newlimit);
region_info.shared_region->limit[dev]=newlimit; // OOB write
return 0;
}
Two issues actually:
cuMemCreate reads uninitialised dev (the visible symptom).
set_current_device_memory_limit logs the error and then proceeds to region_info.shared_region->limit[dev] = newlimit — that's an out-of-bounds write into shared memory, latent corruption for any other process attached to the same region_info.
The real-world path that triggers (1): ggml-cuda reserves a virtual address pool via cuMemCreate with CU_MEM_LOCATION_TYPE_HOST_NUMA for staging before binding to a specific device. Standard CUDA Driver API VMM usage, see NVIDIA's intro post.
Suggested fix
Two-line minimal fix at the cuMemCreate site — don't track non-DEVICE allocations against per-device memory limits since they don't consume device VRAM:
CUdevice dev = prop->location.id; // initialised up-front
int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
dev = prop->location.id;
}
if (do_oom_check && oom_check(dev, size)) {
return CUDA_ERROR_OUT_OF_MEMORY;
}
CUresult res = CUDA_OVERRIDE_CALL(cuda_library_entry,
cuMemCreate, handle, size, prop, flags);
if (res == CUDA_SUCCESS && do_oom_check) { // ← skip non-DEVICE
add_chunk_only(*handle, size, dev);
}
return res;
While we're at it, set_current_device_memory_limit should bail out instead of writing out of bounds:
int set_current_device_memory_limit(const int dev, size_t newlimit) {
ensure_initialized();
if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
LOG_ERROR("Illegal device id: %d", dev);
return -1; // ← was missing
}
LOG_INFO("dev %d new limit set to %ld", dev, newlimit);
region_info.shared_region->limit[dev] = newlimit;
return 0;
}
I'm happy to send a PR if that's the direction you'd take, or to test a different approach if you'd rather track non-DEVICE allocs in a separate accounting path.
Why this matters
VMM is on by default in ggml-cuda since llama.cpp PR #11446 (Jan 2026), which means every recent llama-server build hits this bug on a HAMi-managed pod. The current workaround is to rebuild with -DGGML_CUDA_NO_VMM=ON, which works but loses the VMM benefits (notably ~3.5× KV cache savings via TurboQuant TQ3_0 paths).
A fix in HAMi-core would unblock upstream images (ggml-org/llama.cpp:server-cuda13-*, vllm/vllm-openai recent builds, the Lucebox lucebox-hub consumer-tuned fork) without users having to rebuild anything.
Environment
- HAMi-core:
master, commit ec8979d (the submodule pin in HAMi master)
- HAMi: 2.5.1 (deployed via the
hami-2.5.1 Helm chart in kube-system)
- Olares: 1.12.5
- GPU: NVIDIA RTX 5090 Laptop GPU, compute capability 12.0 (sm_120 Blackwell consumer)
- Driver: 580.65.06, CUDA 13.0.48
- Runtime: containerd via k3s 1.31
- Reproducer images on Docker Hub:
aamsellem/lucebox-qwen36-blackwell:1.0.0 — VMM enabled, triggers the bug
aamsellem/lucebox-qwen36-blackwell:1.1.0 — same code, -DGGML_CUDA_NO_VMM=ON, no bug
Happy to send extra logs, the full reproducer pod manifest, or a PR.
repro.c
Bug:
cuMemCreatehook reads uninitializeddevwhenprop->location.type != CU_MEM_LOCATION_TYPE_DEVICESummary
In
src/cuda/memory.c, thecuMemCreatehook callsadd_chunk_only(*handle, size, dev)withdevleft uninitialized wheneverprop->location.typeis notCU_MEM_LOCATION_TYPE_DEVICE. The garbage value flows down intoset_current_device_memory_limit, which logsIllegal device id: <random>and writes out-of-bounds intoregion_info.shared_region->limit[dev].This blocks any container that uses CUDA Driver API VMM (
cuMemCreate+cuMemAddressReserve+cuMemMap) with a non-DEVICE allocation location — which is the pathggml-cudatakes when staging a virtual address pool before binding it to a specific device. So basicallyllama.cpp, everyllama-serverbuild with VMM on, and the Lucebox / DFlash speculative-decoding stack all crash on a HAMi-managed pod.Steps to reproduce
Hardware: NVIDIA RTX 5090 Laptop GPU (sm_120 Blackwell consumer). The bug isn't hardware-specific — same pattern should hit on any GPU under HAMi as long as the workload uses
cuMemCreatewith a non-DEVICE location.Stack: HAMi 2.5.1 (deployed as part of Olares 1.12.5),
nvidia/cuda:13.0.0-devel-ubuntu22.04,ggml-cudabuilt with defaultGGML_CUDA=ON(i.e. with VMM enabled).Pod logs at startup:
The
-644371744value changes between runs — whatever was on the stack at the previous frame. I confirmed it's not a build, env or driver issue: same pattern across image rebuilds,LD_LIBRARY_PATHpermutations,CUDA_DEVICE_MEMORY_LIMIT_0settings, and on two different driver versions.Minimal C reproducer:
nvcc -lcuda repro.c -o reproand run it inside a HAMi-managed pod —[HAMI-core ERROR ...]: Illegal device id: <random>shows up immediately.Root cause
src/cuda/memory.c,cuMemCreatehook around line 1009:When
prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE, the entireif (do_oom_check && ...)block is skipped, sodevstays whatever was on the stack. The unconditionaladd_chunk_only(*handle, size, dev)at the bottom forwards that garbage toset_current_device_memory_limit:Two issues actually:
cuMemCreatereads uninitialiseddev(the visible symptom).set_current_device_memory_limitlogs the error and then proceeds toregion_info.shared_region->limit[dev] = newlimit— that's an out-of-bounds write into shared memory, latent corruption for any other process attached to the sameregion_info.The real-world path that triggers (1):
ggml-cudareserves a virtual address pool viacuMemCreatewithCU_MEM_LOCATION_TYPE_HOST_NUMAfor staging before binding to a specific device. Standard CUDA Driver API VMM usage, see NVIDIA's intro post.Suggested fix
Two-line minimal fix at the
cuMemCreatesite — don't track non-DEVICE allocations against per-device memory limits since they don't consume device VRAM:While we're at it,
set_current_device_memory_limitshould bail out instead of writing out of bounds:I'm happy to send a PR if that's the direction you'd take, or to test a different approach if you'd rather track non-DEVICE allocs in a separate accounting path.
Why this matters
VMM is on by default in
ggml-cudasince llama.cpp PR #11446 (Jan 2026), which means every recentllama-serverbuild hits this bug on a HAMi-managed pod. The current workaround is to rebuild with-DGGML_CUDA_NO_VMM=ON, which works but loses the VMM benefits (notably ~3.5× KV cache savings via TurboQuant TQ3_0 paths).A fix in HAMi-core would unblock upstream images (
ggml-org/llama.cpp:server-cuda13-*,vllm/vllm-openairecent builds, the Luceboxlucebox-hubconsumer-tuned fork) without users having to rebuild anything.Environment
master, commitec8979d(the submodule pin in HAMimaster)hami-2.5.1Helm chart inkube-system)aamsellem/lucebox-qwen36-blackwell:1.0.0— VMM enabled, triggers the bugaamsellem/lucebox-qwen36-blackwell:1.1.0— same code,-DGGML_CUDA_NO_VMM=ON, no bugHappy to send extra logs, the full reproducer pod manifest, or a PR.