Weird host RAM + VRAM usage (CUDA + Unified memory) #14269

kstoykov · 2025-06-18T19:03:58Z

kstoykov
Jun 18, 2025

My laptop is with 7700HQ CPU + nVidia GTX 1050 GPU. 32GB main RAM and 4GB VRAM.

I'm loading Mistral-Nemo-Instruct-2407-Q4_0 model using the following params.

Ubuntu 24 using official nVidia driver 570.x
~13B model using Q4
KV cache is using Q4.
No memory to file mapping.

docker run --gpus all -e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 -v <path_on_host>:/models -p 8000:8000 ghcr.io/ggml-org/llama.cpp:server-cuda -m /models/Mistral-Nemo-Instruct-2407-Q4_0.gguf --port 8000 --host 0.0.0.0 -n -1 -c 64000 -ngl 0 --parallel 1 --cache-type-k q4_0 --cache-type-v q4_0 -fa --no-mmap

I'm analyzing the total RAM usage = RAM + VRAM using different combination of -ngl and --no-warmup

-ngl = 0: Total RAM usage after warm-up = ~10GB + ~0GB VRAM = ~10GB total RAM usage.
-ngl = 0 and --no-warmup: Total RAM usage = ~10GB + ~0GB VRAM = ~10GB total RAM usage.
-ngl = 128: Total RAM usage after warm-up = ~10GB + ~4GB VRAM = ~14GB total RAM usage.
-ngl = 19: Total RAM usage after warm-up = ~10GB + ~4GB VRAM = ~14GB total RAM usage.
-ngl = 18: Total RAM usage after warm-up = ~6GB + ~4GB VRAM = ~10GB total RAM usage.
-ngl = 128 and --no-warmup: Total RAM usage before warming up = ~6GB + ~4GB VRAM = ~10GB total RAM usage and after warming the result is again ~10GB + ~4GB VRAM = ~14GB total RAM usage.

NB: After -ngl = 14 is the max number of layers that fit entirely into 4GB VRAM. Everything above 14 need UMA.

Why in some cases (when trying to allocate significantly more VRAM than actually available) the total RAM usage is 14GB, while the rest of the cases it is 10GB. Shouldn't it always be ~constant?

Why there is a sudden spike (from -ngl 18 to -ngl 19) of RAM usage (from 6 to 10GB of RAM) after the warming up. Having in mind that the logs show only few hundreds of MB difference. Therefore I mean this one layer is not so extremely large compared to other layers. Also from -ngl 19 to -n 128 (this model has 41 layers) the total RAM usage is always 14GB.

Any ideas what I'm doing wrong?

Update: Few observations from today's testing.

LLaMaCPP allocate 2 main buffers on VRAM - buffer for model's layers and buffer for KV cache.
This increase of 4GB of total RAM usage occurs when the size of these 2 buffers cannot fit the free VRAM. I'm still looking for why this happens.
The inference performance drop significantly (~10x) when buffer for model's layers does not fit the free VRAM. It is totally fine (in terms of performance) if buffer for KV cache is allocated on RAM using unified memory.

kstoykov · 2026-06-07T11:59:35Z

kstoykov
Jun 7, 2026
Author

I've done several tests using Unified memory and it is so slow and I think it just does not make sense to investigate this issue. It is was faster just to move some of the layers to the CPU.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird host RAM + VRAM usage (CUDA + Unified memory) #14269

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Weird host RAM + VRAM usage (CUDA + Unified memory) #14269

Uh oh!

Uh oh!

kstoykov Jun 18, 2025

Replies: 1 comment

Uh oh!

kstoykov Jun 7, 2026 Author

kstoykov
Jun 18, 2025

kstoykov
Jun 7, 2026
Author