Skip to content

Bug: cuMemCreate hook reads uninitialized dev when prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE #187

@aamsellem

Description

@aamsellem

repro.c

Bug: cuMemCreate hook reads uninitialized dev when prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE

Summary

In src/cuda/memory.c, the cuMemCreate hook calls add_chunk_only(*handle, size, dev) with dev left uninitialized whenever prop->location.type is not CU_MEM_LOCATION_TYPE_DEVICE. The garbage value flows down into set_current_device_memory_limit, which logs Illegal device id: <random> and writes out-of-bounds into region_info.shared_region->limit[dev].

This blocks any container that uses CUDA Driver API VMM (cuMemCreate + cuMemAddressReserve + cuMemMap) with a non-DEVICE allocation location — which is the path ggml-cuda takes when staging a virtual address pool before binding it to a specific device. So basically llama.cpp, every llama-server build with VMM on, and the Lucebox / DFlash speculative-decoding stack all crash on a HAMi-managed pod.

Steps to reproduce

Hardware: NVIDIA RTX 5090 Laptop GPU (sm_120 Blackwell consumer). The bug isn't hardware-specific — same pattern should hit on any GPU under HAMi as long as the workload uses cuMemCreate with a non-DEVICE location.

Stack: HAMi 2.5.1 (deployed as part of Olares 1.12.5), nvidia/cuda:13.0.0-devel-ubuntu22.04, ggml-cuda built with default GGML_CUDA=ON (i.e. with VMM enabled).

Pod logs at startup:

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24463 MiB):
  Device 0: NVIDIA GeForce RTX 5090 Laptop GPU, compute capability 12.0,
            VMM: yes, VRAM: 24463 MiB
[HAMI-core ERROR (pid:1 thread=124663410515968 multiprocess_memory_limit.c:846)]:
Illegal device id: -644371744

The -644371744 value changes between runs — whatever was on the stack at the previous frame. I confirmed it's not a build, env or driver issue: same pattern across image rebuilds, LD_LIBRARY_PATH permutations, CUDA_DEVICE_MEMORY_LIMIT_0 settings, and on two different driver versions.

Minimal C reproducer:

#include <stdio.h>
#include <cuda.h>

int main(void) {
    cuInit(0);
    CUdevice device; cuDeviceGet(&device, 0);
    CUcontext ctx;   cuCtxCreate(&ctx, 0, device);

    CUmemAllocationProp prop = {0};
    prop.type             = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type    = CU_MEM_LOCATION_TYPE_HOST_NUMA;  // not DEVICE
    prop.location.id      = 0;

    size_t granularity;
    cuMemGetAllocationGranularity(&granularity, &prop,
                                  CU_MEM_ALLOC_GRANULARITY_MINIMUM);
    size_t size = ((1<<20) + granularity - 1) / granularity * granularity;

    CUmemGenericAllocationHandle handle;
    CUresult res = cuMemCreate(&handle, size, &prop, 0);
    printf("cuMemCreate returned %d\n", res);
    cuCtxDestroy(ctx);
    return 0;
}

nvcc -lcuda repro.c -o repro and run it inside a HAMi-managed pod — [HAMI-core ERROR ...]: Illegal device id: <random> shows up immediately.

Root cause

src/cuda/memory.c, cuMemCreate hook around line 1009:

CUresult cuMemCreate(CUmemGenericAllocationHandle* handle, size_t size,
                     const CUmemAllocationProp* prop, unsigned long long flags) {
    LOG_INFO("cuMemCreate:%lld:%d", size, prop->location.id);
    ENSURE_RUNNING();
    CUdevice dev;                                          // (a) not initialised
    int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
    if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
        dev = prop->location.id;
    }
    if (do_oom_check && oom_check(dev, size)) {
        return CUDA_ERROR_OUT_OF_MEMORY;
    }
    CUresult res = CUDA_OVERRIDE_CALL(cuda_library_entry,
        cuMemCreate, handle, size, prop, flags);
    if (res == CUDA_SUCCESS) {
        add_chunk_only(*handle, size, dev);                // (b) uses (a) uninitialised
    }
    return res;
}

When prop->location.type != CU_MEM_LOCATION_TYPE_DEVICE, the entire if (do_oom_check && ...) block is skipped, so dev stays whatever was on the stack. The unconditional add_chunk_only(*handle, size, dev) at the bottom forwards that garbage to set_current_device_memory_limit:

// src/multiprocess/multiprocess_memory_limit.c
int set_current_device_memory_limit(const int dev, size_t newlimit) {
    ensure_initialized();
    if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
        LOG_ERROR("Illegal device id: %d", dev);
    }
    LOG_INFO("dev %d new limit set to %ld",dev,newlimit);
    region_info.shared_region->limit[dev]=newlimit;        // OOB write
    return 0;
}

Two issues actually:

  1. cuMemCreate reads uninitialised dev (the visible symptom).
  2. set_current_device_memory_limit logs the error and then proceeds to region_info.shared_region->limit[dev] = newlimit — that's an out-of-bounds write into shared memory, latent corruption for any other process attached to the same region_info.

The real-world path that triggers (1): ggml-cuda reserves a virtual address pool via cuMemCreate with CU_MEM_LOCATION_TYPE_HOST_NUMA for staging before binding to a specific device. Standard CUDA Driver API VMM usage, see NVIDIA's intro post.

Suggested fix

Two-line minimal fix at the cuMemCreate site — don't track non-DEVICE allocations against per-device memory limits since they don't consume device VRAM:

CUdevice dev = prop->location.id;                          // initialised up-front
int do_oom_check = (prop->location.type == CU_MEM_LOCATION_TYPE_DEVICE);
if (do_oom_check && cuCtxGetDevice(&dev) != CUDA_SUCCESS) {
    dev = prop->location.id;
}
if (do_oom_check && oom_check(dev, size)) {
    return CUDA_ERROR_OUT_OF_MEMORY;
}
CUresult res = CUDA_OVERRIDE_CALL(cuda_library_entry,
    cuMemCreate, handle, size, prop, flags);
if (res == CUDA_SUCCESS && do_oom_check) {                 // ← skip non-DEVICE
    add_chunk_only(*handle, size, dev);
}
return res;

While we're at it, set_current_device_memory_limit should bail out instead of writing out of bounds:

int set_current_device_memory_limit(const int dev, size_t newlimit) {
    ensure_initialized();
    if (dev < 0 || dev >= CUDA_DEVICE_MAX_COUNT) {
        LOG_ERROR("Illegal device id: %d", dev);
        return -1;                                         // ← was missing
    }
    LOG_INFO("dev %d new limit set to %ld", dev, newlimit);
    region_info.shared_region->limit[dev] = newlimit;
    return 0;
}

I'm happy to send a PR if that's the direction you'd take, or to test a different approach if you'd rather track non-DEVICE allocs in a separate accounting path.

Why this matters

VMM is on by default in ggml-cuda since llama.cpp PR #11446 (Jan 2026), which means every recent llama-server build hits this bug on a HAMi-managed pod. The current workaround is to rebuild with -DGGML_CUDA_NO_VMM=ON, which works but loses the VMM benefits (notably ~3.5× KV cache savings via TurboQuant TQ3_0 paths).

A fix in HAMi-core would unblock upstream images (ggml-org/llama.cpp:server-cuda13-*, vllm/vllm-openai recent builds, the Lucebox lucebox-hub consumer-tuned fork) without users having to rebuild anything.

Environment

  • HAMi-core: master, commit ec8979d (the submodule pin in HAMi master)
  • HAMi: 2.5.1 (deployed via the hami-2.5.1 Helm chart in kube-system)
  • Olares: 1.12.5
  • GPU: NVIDIA RTX 5090 Laptop GPU, compute capability 12.0 (sm_120 Blackwell consumer)
  • Driver: 580.65.06, CUDA 13.0.48
  • Runtime: containerd via k3s 1.31
  • Reproducer images on Docker Hub:
    • aamsellem/lucebox-qwen36-blackwell:1.0.0 — VMM enabled, triggers the bug
    • aamsellem/lucebox-qwen36-blackwell:1.1.0 — same code, -DGGML_CUDA_NO_VMM=ON, no bug

Happy to send extra logs, the full reproducer pod manifest, or a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions