Skip to content

Commit 8de5997

Browse files
committed
docs: recommend default FA quant builds
1 parent e3cce57 commit 8de5997

8 files changed

Lines changed: 71 additions & 45 deletions

File tree

.github/workflows/release.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -496,7 +496,6 @@ jobs:
496496
-DGPU_TARGETS="${GPU_TARGETS}" \
497497
-DGGML_HIP=ON \
498498
-DGGML_CUDA_FA=ON \
499-
-DGGML_CUDA_FA_HALF_QUANTS=ON \
500499
-DHIP_PLATFORM=amd \
501500
-DGGML_HIP_ROCWMMA_FATTN=ON \
502501
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
@@ -700,7 +699,6 @@ jobs:
700699
-DGGML_NATIVE=OFF \
701700
-DGGML_CUDA=ON \
702701
-DGGML_CUDA_FA=ON \
703-
-DGGML_CUDA_FA_HALF_QUANTS=ON \
704702
-DGGML_BACKEND_DL=ON \
705703
-DGGML_CPU_ALL_VARIANTS=ON \
706704
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
@@ -1101,7 +1099,6 @@ jobs:
11011099
-DGGML_CPU=OFF ^
11021100
-DGGML_CUDA=ON ^
11031101
-DGGML_CUDA_FA=ON ^
1104-
-DGGML_CUDA_FA_HALF_QUANTS=ON ^
11051102
-DLLAMA_BUILD_BORINGSSL=ON ^
11061103
-DCMAKE_C_COMPILER_LAUNCHER=ccache ^
11071104
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache ^
@@ -1226,7 +1223,6 @@ jobs:
12261223
-DGGML_HIP_ROCWMMA_FATTN=ON `
12271224
-DGGML_HIP=ON `
12281225
-DGGML_CUDA_FA=ON `
1229-
-DGGML_CUDA_FA_HALF_QUANTS=ON `
12301226
-DCMAKE_C_COMPILER_LAUNCHER=ccache `
12311227
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache `
12321228
-DCMAKE_HIP_COMPILER_LAUNCHER=ccache `

AGENTS.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
3737
cmake --build build -j
3838
```
3939

40-
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
40+
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
4141

4242
Key binaries: `build/bin/llama-server`, `build/bin/llama-cli`, `build/bin/llama-bench`, `build/bin/llama-perplexity`.
4343

@@ -93,13 +93,13 @@ build/bin/llama-perplexity -m model.gguf -f test.txt -c 4096
9393
# Decode speed
9494
build/bin/llama-bench -m model.gguf -p 0 -n 64 -t 1
9595

96-
# Server with DFlash + TurboQuant
96+
# Server with DFlash + recommended q cache
9797
build/bin/llama-server -m target.gguf --spec-type dflash \
9898
--spec-draft-model drafter.gguf \
9999
--spec-draft-n-max 8 \
100100
--spec-branch-budget 0 \
101101
--spec-dflash-cross-ctx 512 \
102-
--flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq \
102+
--flash-attn on --cache-type-k q5_0 --cache-type-v q4_1 \
103103
--port 8080
104104
```
105105

CLAUDE.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
3737
cmake --build build -j
3838
```
3939

40-
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
40+
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
4141

4242
Key binaries: `build/bin/llama-server`, `build/bin/llama-cli`, `build/bin/llama-bench`, `build/bin/llama-perplexity`.
4343

@@ -93,13 +93,13 @@ build/bin/llama-perplexity -m model.gguf -f test.txt -c 4096
9393
# Decode speed
9494
build/bin/llama-bench -m model.gguf -p 0 -n 64 -t 1
9595

96-
# Server with DFlash + TurboQuant
96+
# Server with DFlash + recommended q cache
9797
build/bin/llama-server -m target.gguf --spec-type dflash \
9898
--spec-draft-model drafter.gguf \
9999
--spec-draft-n-max 8 \
100100
--spec-branch-budget 0 \
101101
--spec-dflash-cross-ctx 512 \
102-
--flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq \
102+
--flash-attn on --cache-type-k q5_0 --cache-type-v q4_1 \
103103
--port 8080
104104
```
105105

README.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -259,6 +259,8 @@ Target model: [Gemma 4 31B Q4_K_S](https://huggingface.co/unsloth/gemma-4-31b-it
259259

260260
K and V cache types are set independently with `--cache-type-k` and `--cache-type-v`. For the preset rationale and benchmark details, see [KV Cache Quantization Benchmarks for Long Context](https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context).
261261

262+
Current benchmarked presets recommend standard q cache types or KVarN fallback cache types. TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types at the recommended cache footprints, so the default CUDA FlashAttention build does not compile Turbo/TCQ pairs.
263+
262264
### Preset Ladder
263265

264266
| K / V | % of bf16 size | 99.9% precision | What it is for |
@@ -273,8 +275,8 @@ K and V cache types are set independently with `--cache-type-k` and `--cache-typ
273275
| **q5_0 / q4_1** | **32.8** | **92.65%** | **Best default if VRAM-constrained** |
274276
| q5_0 / q4_0 | 31.3 | 91.39% | If q5_0 / q4_1 misses the fit by a narrow margin |
275277
| q4_0 / q4_0 | 28.1 | 88.87% | Memory saving with visible precision loss |
276-
| q4_0 / turbo3_tcq | 24.2 | 84.93% | Smaller than q4, cleaner than symmetric turbo3_tcq |
277-
| **turbo3_tcq / turbo3_tcq** | **20.3** | **81.56%** | **Viable extreme-compression mode** |
278+
| q4_0 / turbo3_tcq | 24.2 | 84.93% | Legacy Turbo/TCQ comparison point; requires a HALF or ALL FA build |
279+
| turbo3_tcq / turbo3_tcq | 20.3 | 81.56% | Legacy extreme-compression comparison point, not the recommended default |
278280
| turbo2_tcq / turbo2_tcq | 14.1 | 54.38% | Last resort: not for code, JSON, math, or tool calls |
279281

280282
*99.9% precision = `100 · exp(−(quantKLD − bf16KLD))` at the 99.9% KL-divergence tail.*
@@ -290,10 +292,10 @@ K and V cache types are set independently with `--cache-type-k` and `--cache-typ
290292
| q4_1 | upstream | 5 | 3.2× | Smaller than q5_0, but weaker in the tail. Prefer q5_0 for K |
291293
| q4_0 | upstream | 4.5 | 3.56× | Default high compression type, decent at its size |
292294
| turbo4 | fork | 4.125 | 3.88× | Barely smaller than q4_0, slower, worse tail |
293-
| turbo3_tcq | fork | 3.25 | 4.92× | Viable compact mode, 82% precision at KLD 99.9%. CUDA-only |
294-
| turbo3 | fork | 3.125 | 5.12× | Weaker than turbo3_tcq. Use only when TCQ is unavailable |
295-
| turbo2_tcq | fork | 2.25 | 7.11× | Last resort, 54% precision at KLD 99.9%. CUDA-only |
296-
| turbo2 | fork | 2.125 | 7.53× | Extreme quality risk. Use only when TCQ is unavailable |
295+
| turbo3_tcq | fork | 3.25 | 4.92× | CUDA-only comparison point; no current benchmark advantage over q/KVarN presets |
296+
| turbo3 | fork | 3.125 | 5.12× | Weaker than turbo3_tcq. Use only for targeted experiments |
297+
| turbo2_tcq | fork | 2.25 | 7.11× | CUDA-only last resort, 54% precision at KLD 99.9% |
298+
| turbo2 | fork | 2.125 | 7.53× | Extreme quality risk. Use only for targeted experiments |
297299

298300
## Installation
299301

@@ -358,7 +360,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
358360
cmake --build build -j
359361
```
360362

361-
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These flags are mutually exclusive.
363+
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These flags are mutually exclusive.
362364

363365
### Other Backends
364366

@@ -382,13 +384,13 @@ llama-server -m model.gguf -c 16384 -np 4
382384
llama-server -m model.gguf -md draft.gguf
383385
```
384386

385-
### DFlash And TurboQuant Together
387+
### DFlash With Recommended KV Cache
386388

387389
```sh
388390
llama-server -m target.gguf --spec-type dflash \
389391
--spec-draft-model drafter.gguf \
390392
--spec-draft-ngl all \
391-
--flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq
393+
--flash-attn on --cache-type-k q5_0 --cache-type-v q4_1
392394
```
393395

394396
## Documentation

docs/build.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -298,7 +298,9 @@ The following compilation options are also available to tweak performance:
298298
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for V100, CDNA and RDNA4 which use FP32 compute type by default) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000). |
299299
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
300300
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile CUDA FlashAttention vec kernels for the full supported K/V cache type matrix: 361 pairs for the 19-type universe. This is useful for arbitrary asymmetric cache-type combinations. Mutually exclusive with GGML_CUDA_FA_HALF_QUANTS. |
301-
| GGML_CUDA_FA_HALF_QUANTS | Boolean | false | Compile the larger CUDA FlashAttention vec K/V cache pair set for this fork: the useful K>=V half-matrix plus f16 fallback pairs needed by TurboQuant/TCQ dequant paths. This compiles 208 pairs instead of 361 in GGML_CUDA_FA_ALL_QUANTS. Types are ranked from higher precision to lower: f16 > bf16 > q8_0 > q6_1 > q6_0 > q5_1 > q5_0 > turbo4_tcq > turbo4 > q4_1 > q4_0 > q3_1 > turbo3_tcq > turbo3 > q3_0 > q2_1 > turbo2_tcq > turbo2 > q2_0. Mutually exclusive with GGML_CUDA_FA_ALL_QUANTS. Without either flag, the default is 62 q/KVarN-fallback pairs: K>=V, no Turbo/TCQ, no mixed f16/bf16, fp capped at q5, and q8/q6 capped at q4. |
301+
| GGML_CUDA_FA_HALF_QUANTS | Boolean | false | Compile the larger CUDA FlashAttention vec K/V cache pair set for this fork: the useful K>=V half-matrix plus f16 fallback pairs needed by TurboQuant/TCQ dequant paths. This compiles 208 pairs instead of 361 in GGML_CUDA_FA_ALL_QUANTS. Types are ranked from higher precision to lower: f16 > bf16 > q8_0 > q6_1 > q6_0 > q5_1 > q5_0 > turbo4_tcq > turbo4 > q4_1 > q4_0 > q3_1 > turbo3_tcq > turbo3 > q3_0 > q2_1 > turbo2_tcq > turbo2 > q2_0. Mutually exclusive with GGML_CUDA_FA_ALL_QUANTS. |
302+
303+
Recommended CUDA FlashAttention builds should leave both quant flags off. The default compiles 62 much faster q/KVarN-fallback pairs for standard q cache types or KVarN: K>=V, no Turbo/TCQ, no mixed f16/bf16, fp capped at q5, and q8/q6 capped at q4. TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The default policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` when you want TurboQuant/TCQ or high-delta K>=V experiments, and `GGML_CUDA_FA_ALL_QUANTS=ON` only when the full 361-pair matrix is required.
302304

303305
## MUSA
304306

0 commit comments

Comments
 (0)