| Format | Status | Decode 8B Llama | Notes |
|---|---|---|---|
| Q4_K_M | ✅ | 121 t/s | Primary GGUF format, all benches use it |
| Q3_K_M | ✅ | 132 t/s | Smaller VRAM, slight quality drop |
| Q5_K | ✅ | 105 t/s | |
| Q4_0 | ✅ | — | Gemma-4 QAT GGUF line (E2B/E4B/12B/26B/31B) since v0.6.1 — byte-gated vs llama.cpp; 26B-A4B QAT: prefill 1143/646 t/s @p512/p2048, decode 117 t/s. Qwen2.5-Q4_0 stays deliberately gated (needs attn-bias / Q4_1 / Q8_0-lm_head arch features, not the quant). |
| Q6_K | ✅ | — | Less common; chat works |
| Q8_0 | — | Loadable via chat but rejected by bench |
The GGUF loader covers the llama, qwen2 / qwen3 / qwen35 and
gemma4 architectures.
Tokenizer is read from the GGUF file; no extra setup needed.
VulkanForge auto-detects all three FP8 scaling strategies from the
config.json quantization_config block + the SafeTensors header.
| Scale type | Example model | VRAM (GPU) | Decode (GPU) | Prefill pp=512* | CPU lm_head decode |
15-prompt coherence |
|---|---|---|---|---|---|---|
| Per-tensor | neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 |
7.5 GiB | 69 tok/s | 1197 tok/s | 47.6 tok/s | 13/15 |
| Per-channel | larryvrh/Qwen2.5-14B-Instruct-FP8 |
13.8 GiB | 14 tok/s | 450 tok/s | 17.8 tok/s ✓ | 15/15 |
| Block-wise [128,128] | Qwen/Qwen3-8B-FP8 |
8.5 GiB | 62 tok/s | 1118 tok/s | (not benched yet) | 15/15 |
*With native FP8 WMMA on Mesa 26.1+. Routing is capability-driven
since Sprint 47B / v0.3.16: VulkanForge picks the native path
automatically when the driver advertises
shaderFloat8CooperativeMatrix; on Mesa 26.0.x or any driver
without the extension, all three paths use the BF16 conversion
fallback at ~770 / 325 / 757 tok/s respectively.
The CPU lm_head column shows decode tok/s when
VF_CPU_LM_HEAD=1 is set (AVX-512 Q6_K GEMV on Zen 4 7945HX,
v0.3.10). The ✓ marks the 14B win zone: per-channel FP8 with the
CPU offload beats the GPU baseline by 32 % while freeing
970 MB of VRAM. Per-tensor 8B trades 32 % decode for the same
VRAM saving — useful on 12 GB cards or when running multiple 8B
sessions, otherwise leave the flag off.
- Tokenizer not in the FP8 SafeTensors model. Use
--tokenizer-from <gguf>pointing at any GGUF from the same model family. - Block-wise needs
block_sizedivisible by BM=64 and BK=16. The Qwen3 / DeepSeek-V3[128, 128]calibration satisfies both. Other block shapes silently fall through to the GEMV-loop fallback path (Sprint 35). weight_scalevsweight_scale_invsuffix conventions in the SafeTensors header are both auto-detected. Different vendors use different conventions; the loader handles either.
| Architecture | GGUF | FP8 (SafeTensors) | Tokenizer | Notes |
|---|---|---|---|---|
| Llama | ✅ | ✅ per-tensor | gpt2/llama | Llama-3.1, Llama-3.2 (8B + 14B) |
| Qwen2 / Qwen2.5 | ✅ | ✅ per-channel | gpt2 | ChatML, RoPE θ from config.json |
| Qwen3 | ✅ | ✅ block-wise [128,128] | gpt2 | <think> mode, Q/K-norm |
| Mistral 7B | ✅ | SPM | GGUF works; FP8 SPM tokenizer not yet wired | |
| DeepSeek-R1 | ✅ | — | gpt2 | R1-Distill-Llama works on GGUF; native MoE pending |
| Gemma-4 | ✅ | — (BF16/F32 SafeTensors) | gemma4 | incl. 26B-A4B MoE (128 experts, top-8); GGUF Q3_K_M + Q4_0 QAT line (v0.6.1); thought-channel template auto-detected |
gpt2 tokenizer covers Qwen, Llama-3 (which uses BPE), and the
DeepSeek family. SPM (Mistral / Llama-2 family) is GGUF-only at the
moment.
- Mixture-of-experts beyond Gemma-4 (DeepSeek-V3, Mixtral-style routing) — pending. (Gemma-4-26B-A4B MoE — 128 experts, top-8 — runs since v0.5.x; see the table above.)
- Speculative decode / draft-model tokens — single-stream only.
- Batch inference (b > 1 concurrent prompts) — single-stream only.
- Long-context optimizations beyond 4 k tokens — RoPE works to
the model's
ctx_max, but no chunked prefill or paged-attention. - Quantization formats outside the table above — IQ-formats, AWQ, GPTQ, GGML legacy formats are out of scope.
The full 15-prompt coherence suite (results/sprint34cd_logs/bench_vf_15p_8b_q4.txt)
covers:
- Simple sequence (numerics)
- Prime-check Python (code generation)
- Multilingual prose (German / French / Mandarin)
- Emoji input
- Arithmetic with Q4_K precision (numerics edge case)
- C++ snippet
- Go function
- Recursive algorithm explanation
- Haiku composition
- Long-context recall (in-context shadow)
- SQL query construction
- JSON schema generation
- Chat persona consistency
- Arithmetic Q4_K precision (second case)
- Free-form storytelling
Llama-3.1-8B-FP8 + Qwen2.5-14B-FP8 are 15 / 15 on this set. Q4_K_M 8B is 12 / 15 (the three failures are the numeric Q4_K-precision prompts — a known Q4_K format limitation, not an engine bug).