You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
40
+
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
40
+
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
K and V cache types are set independently with `--cache-type-k` and `--cache-type-v`. For the preset rationale and benchmark details, see [KV Cache Quantization Benchmarks for Long Context](https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context).
261
261
262
+
Current benchmarked presets recommend standard q cache types or KVarN fallback cache types. TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types at the recommended cache footprints, so the default CUDA FlashAttention build does not compile Turbo/TCQ pairs.
263
+
262
264
### Preset Ladder
263
265
264
266
| K / V | % of bf16 size | 99.9% precision | What it is for |
@@ -273,8 +275,8 @@ K and V cache types are set independently with `--cache-type-k` and `--cache-typ
273
275
|**q5_0 / q4_1**|**32.8**|**92.65%**|**Best default if VRAM-constrained**|
274
276
| q5_0 / q4_0 | 31.3 | 91.39% | If q5_0 / q4_1 misses the fit by a narrow margin |
275
277
| q4_0 / q4_0 | 28.1 | 88.87% | Memory saving with visible precision loss |
276
-
| q4_0 / turbo3_tcq | 24.2 | 84.93% |Smaller than q4, cleaner than symmetric turbo3_tcq|
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These flags are mutually exclusive.
363
+
Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These flags are mutually exclusive.
Copy file name to clipboardExpand all lines: docs/build.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -298,7 +298,9 @@ The following compilation options are also available to tweak performance:
298
298
| GGML_CUDA_FORCE_CUBLAS | Boolean | false | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for V100, CDNA and RDNA4 which use FP32 compute type by default) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000). |
299
299
| GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer | 128 | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial. |
300
300
| GGML_CUDA_FA_ALL_QUANTS | Boolean | false | Compile CUDA FlashAttention vec kernels for the full supported K/V cache type matrix: 361 pairs for the 19-type universe. This is useful for arbitrary asymmetric cache-type combinations. Mutually exclusive with GGML_CUDA_FA_HALF_QUANTS. |
301
-
| GGML_CUDA_FA_HALF_QUANTS | Boolean | false | Compile the larger CUDA FlashAttention vec K/V cache pair set for this fork: the useful K>=V half-matrix plus f16 fallback pairs needed by TurboQuant/TCQ dequant paths. This compiles 208 pairs instead of 361 in GGML_CUDA_FA_ALL_QUANTS. Types are ranked from higher precision to lower: f16 > bf16 > q8_0 > q6_1 > q6_0 > q5_1 > q5_0 > turbo4_tcq > turbo4 > q4_1 > q4_0 > q3_1 > turbo3_tcq > turbo3 > q3_0 > q2_1 > turbo2_tcq > turbo2 > q2_0. Mutually exclusive with GGML_CUDA_FA_ALL_QUANTS. Without either flag, the default is 62 q/KVarN-fallback pairs: K>=V, no Turbo/TCQ, no mixed f16/bf16, fp capped at q5, and q8/q6 capped at q4. |
301
+
| GGML_CUDA_FA_HALF_QUANTS | Boolean | false | Compile the larger CUDA FlashAttention vec K/V cache pair set for this fork: the useful K>=V half-matrix plus f16 fallback pairs needed by TurboQuant/TCQ dequant paths. This compiles 208 pairs instead of 361 in GGML_CUDA_FA_ALL_QUANTS. Types are ranked from higher precision to lower: f16 > bf16 > q8_0 > q6_1 > q6_0 > q5_1 > q5_0 > turbo4_tcq > turbo4 > q4_1 > q4_0 > q3_1 > turbo3_tcq > turbo3 > q3_0 > q2_1 > turbo2_tcq > turbo2 > q2_0. Mutually exclusive with GGML_CUDA_FA_ALL_QUANTS. |
302
+
303
+
Recommended CUDA FlashAttention builds should leave both quant flags off. The default compiles 62 much faster q/KVarN-fallback pairs for standard q cache types or KVarN: K>=V, no Turbo/TCQ, no mixed f16/bf16, fp capped at q5, and q8/q6 capped at q4. TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The default policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` when you want TurboQuant/TCQ or high-delta K>=V experiments, and `GGML_CUDA_FA_ALL_QUANTS=ON` only when the full 361-pair matrix is required.
0 commit comments