A practical reference for deciding which open-weight model fits on which GPU, at which quantization, with how much context.
This file is intentionally math-first: every number comes from the standard quantization formulas and widely reported model parameter counts, so you can reproduce it yourself. If you need the runtime check, use gpu-guard (see README).
This table shows the smallest GPU class (by VRAM) that can load each model at each common quantization with a 2 GB safety buffer and 4K context. Longer context or parallel requests will push requirements higher; see the KV cache section.
| Model | Params | Q4_K_M | Q5_K_M | Q8_0 | FP16 |
|---|---|---|---|---|---|
| Phi-4 mini | 3.8B | 8 GB | 8 GB | 12 GB | 12 GB |
| Llama 3.1 8B | 8B | 8 GB | 12 GB | 12 GB | 24 GB |
| Mistral 7B | 7.2B | 8 GB | 8 GB | 12 GB | 16 GB |
| Gemma 2 9B | 9B | 12 GB | 12 GB | 16 GB | 24 GB |
| Qwen 2.5 14B | 14B | 12 GB | 16 GB | 24 GB | 32 GB |
| Phi-4 | 14B | 12 GB | 16 GB | 24 GB | 32 GB |
| Gemma 2 27B | 27B | 24 GB | 24 GB | 32 GB | 80 GB |
| Qwen 2.5 32B | 32B | 24 GB | 32 GB | 48 GB | 80 GB |
| Mixtral 8x7B | 46.7B | 32 GB | 48 GB | 80 GB | 2x48 GB |
| Llama 3.3 70B | 70B | 48 GB | 80 GB | 80 GB | 2x80 GB |
| Qwen 2.5 72B | 72B | 48 GB | 80 GB | 80 GB | 2x80 GB |
| Llama 3.1 405B | 405B | 2x80 GB | 4x80 GB | 6x80 GB | 8x80 GB |
Read as: "the first column where the model still fits with a 2 GB buffer."
For dense transformers (Llama, Mistral, Qwen, Gemma, Phi), memory for the weights alone is:
weight_bytes ≈ params × bits_per_param / 8
Typical bits_per_param for common GGUF quantizations:
| Quant | Effective bits | bytes/param |
|---|---|---|
| FP16 / BF16 | 16 | 2.00 |
| Q8_0 | 8.5 | 1.06 |
| Q6_K | 6.56 | 0.82 |
| Q5_K_M | 5.50 | 0.69 |
| Q5_K_S | 5.25 | 0.66 |
| Q4_K_M | 4.58 | 0.57 |
| Q4_K_S | 4.25 | 0.53 |
| Q3_K_M | 3.52 | 0.44 |
| Q2_K | 2.73 | 0.34 |
So for Llama 3.3 70B at Q4_K_M:
70 × 10^9 × 0.57 bytes/param ≈ 40 GB
Add roughly 1 GB of CUDA context overhead for llama.cpp and friends, and you get the ~41 GB figure that matches what you see in nvidia-smi at idle right after the model loads. A 48 GB card (5090 is 32 GB, but an RTX 6000 Ada or L40 is 48 GB) fits comfortably. A 32 GB 5090 does not.
For Mixtral and other mixture-of-experts models, you still need memory for every expert, not just the active ones. Mixtral 8x7B is about 46.7B parameters on disk even though only ~13B are active per token. Multiply by the quant ratio the same way.
| GPU | VRAM | Practical budget with 2 GB buffer | Practical model ceiling |
|---|---|---|---|
| RTX 4060 | 8 GB | 6 GB | 7B to 8B at Q4_K_M, short context |
| RTX 4060 Ti 16 GB | 16 GB | 14 GB | 14B at Q4_K_M; 8B at Q8_0 |
| RTX 4070 | 12 GB | 10 GB | 9B at Q4_K_M; 7B at Q5_K_M |
| RTX 4070 Ti Super | 16 GB | 14 GB | 14B at Q4_K_M; 9B at Q8_0 |
| RTX 4080 Super | 16 GB | 14 GB | 14B at Q4_K_M; 9B at Q8_0 |
| RTX 4090 | 24 GB | 22 GB | 27B at Q4_K_M; 14B at Q8_0 |
| RTX 5080 | 16 GB | 14 GB | same class as 4080 Super for this purpose |
| RTX 5090 | 32 GB | 30 GB | 32B at Q4_K_M; 27B at Q5_K_M; 14B at Q8_0 |
| RTX 6000 Ada | 48 GB | 46 GB | 70B at Q4_K_M; 32B at Q8_0 |
| L40 / L40S | 48 GB | 46 GB | 70B at Q4_K_M; 32B at Q8_0 |
| A100 40 GB | 40 GB | 38 GB | 70B at Q3_K_M; 32B at Q5_K_M |
| A100 80 GB | 80 GB | 78 GB | 70B at Q8_0; 72B at Q5_K_M |
| H100 80 GB | 80 GB | 78 GB | 70B at Q8_0; 72B at Q5_K_M |
| 2x H100 | 160 GB | 156 GB | 70B at FP16; 405B at Q2_K |
"Practical ceiling" assumes 4K context and no concurrent requests. If you run a batch serving setup, cut the ceiling by 20 to 40 percent depending on max_concurrent_requests and context length.
The weights are only half the story. Every token in your prompt plus every token you generate needs memory in the KV cache. For a transformer with L layers, H KV heads, D head dim, at FP16:
kv_cache_bytes ≈ 2 × L × H × D × sequence_length × 2 bytes
Or more practically, for common models with a 4K context window:
| Model | KV cache @ 4K | KV cache @ 32K | KV cache @ 128K |
|---|---|---|---|
| Llama 3.1 8B | ~0.5 GB | ~4 GB | ~16 GB |
| Llama 3.3 70B | ~1.25 GB | ~10 GB | ~40 GB |
| Mistral 7B | ~0.5 GB | ~4 GB | n/a (8K native) |
| Qwen 2.5 32B | ~1 GB | ~8 GB | ~32 GB |
If you try to run Llama 3.3 70B at Q4_K_M on a 48 GB card with 128K context, the weights fit at ~40 GB, but the KV cache adds another ~40 GB, and you get a crash that looks nothing like "the model is too big." The real cause is the sum.
llama.cpp, Ollama, and vLLM all let you cap context length. If you do not cap it, they will happily allocate for the full trained context, even when your request is 200 tokens.
Q4_K_M is the default for a reason: about 4.58 bits per weight, minimal perplexity loss versus Q5_K_M in most benchmarks, and it unlocks a full class of bigger models on the same card. Start here.
Q5_K_M is worth it when you can afford the extra VRAM and you are running a reasoning-heavy workload (code generation, math, multi-step agents) where the last few percent of capability show up in real output quality. On a 4090 you can run Qwen 2.5 14B at Q5_K_M instead of Q4_K_M at almost no cost.
Q8_0 is effectively lossless relative to FP16 for inference, and it costs half the memory. If you have a 48 GB or 80 GB card, there is rarely a reason to run FP16 for serving.
FP16 / BF16 makes sense for fine-tuning, for models where the quantization hurts (very small models, some reasoning models, certain vision-language models where the projector is sensitive), and for research reproducibility. For everyday serving, it is wasteful.
Q2_K and Q3_K are emergency quants. Use them when you need to run a model that is one class above your GPU and you will tolerate the quality hit. Q2_K on a 70B model is still better than Q4_K_M on a 32B model for some tasks, but not for most.
Multimodal models like LLaVA, Qwen-VL, and Gemma 2 vision variants ship the vision tower as a separate file, typically named something with mmproj in it. That file is loaded into VRAM in addition to the base model.
For Qwen 2.5 VL 7B:
- Base weights at Q4_K_M: ~4.5 GB
- mmproj at Q8: ~0.7 GB
- KV cache at 4K: ~0.5 GB
- Total: ~5.7 GB, not 4.5 GB
Auto-fit logic in llama.cpp does not always include the mmproj file in its calculation. This is a recurring source of "but it should fit" bug reports. If you are checking a VLM, size it with the mmproj added.
If your model does not fit on one card, llama.cpp and vLLM will split it across GPUs. The split is not free: there is a small overhead per layer crossing a PCIe boundary, and it is much larger if you do not have NVLink. As a rule of thumb:
- Two identical cards (e.g., 2x 4090): efficient, expect ~90% of single-card speed per token at equivalent batch size.
- Two different cards (e.g., 4090 + 3090): still works; llama.cpp supports manual layer offloading. Slower than identical cards.
- CPU offload: the model runs, but you will see 5x to 20x slowdowns depending on how many layers end up on CPU. Useful for prototyping, not for serving.
Parameter counts come from the official model cards. Quantization bit-per-weight figures come from llama.cpp's k-quant documentation and are averages; actual values vary slightly per layer because the k-quants mix block sizes.
These numbers are meant to be accurate to within about 5%. For a precise figure on a specific build, run:
gpu-guard --model-size $(stat -c %s model.gguf | awk '{print $1/1024/1024/1024}') --buffer 2or just let the guard read the file header directly when that support lands.
If you have verified numbers for a model or GPU that is not on this table, open a PR with:
- Model name and parameter count
- Quantization file URL (Hugging Face preferred)
- GPU model and driver version
nvidia-smioutput after the model loads and is warm- Context length used
Hardware reports are welcome. We would rather have real measurements than formulas where we can get them.