Skip to content

Windows CUDA: cuMemAddressReserve failure in VMM pool causes hard abort (GGML_CUDA_NO_VMM workaround) #580

@enceos

Description

@enceos

Hey folks. I ran into a hard crash while running some heavy embedding workloads on Windows using the CUDA backend. It looks like it's tied to the VMM allocator.

The Problem
When running a large indexing job (about 32,000 chunks via qmd), the process dies with a CUDA out of memory error.

Digging into the debug logs, the exact failure happens at ggml-cuda.cu:97. It aborts inside ggml_cuda_pool_vmm::alloc (around line 476) when calling:
cuMemAddressReserve(&pool_addr, CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)

Why it's failing
I'm on an RTX 3090 (24GB). In ggml-cuda.cu, CUDA_POOL_VMM_MAX_SIZE is hardcoded to reserve 32GB of virtual memory. Even with plenty of actual VRAM available, the virtual address space reservation fails. Instead of gracefully falling back to a non-VMM pool, the whole process hard-aborts.

The Workaround
I managed to bypass this locally by compiling node-llama-cpp from source with VMM disabled:
GGML_CUDA_NO_VMM=ON
With that flag, the exact same embedding job finishes perfectly and memory usage stays stable.

The Request
Would it be possible to add a runtime fallback here? If cuMemAddressReserve fails (which seems to happen on some Windows/WDDM setups), it would be great if it logged a warning and fell back to the standard allocator instead of crashing. That would make the prebuilt binaries a lot more stable for Windows users hitting this edge case.

My Environment

  • OS: Windows 11 Pro N (10.0.22631)
  • GPU: RTX 3090 24GB (Driver 591.44)
  • CUDA: 13.1
  • Node: v24.13.0
  • node-llama-cpp: 3.17.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions