Anbeeld
diff --git a/‎.github/workflows/release.yml‎
Lines changed: 0 additions & 4 deletions b/‎.github/workflows/release.yml‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 3 additions & 3 deletions b/‎AGENTS.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 3 additions & 3 deletions b/‎CLAUDE.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 11 additions & 9 deletions b/‎README.md‎
Lines changed: 11 additions & 9 deletions
diff --git a/‎docs/build.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/build.md‎
Lines changed: 3 additions & 1 deletion
@@ -496,7 +496,6 @@ jobs:
             -DGPU_TARGETS="${GPU_TARGETS}" \
             -DGGML_HIP=ON \
             -DGGML_CUDA_FA=ON \
-            -DGGML_CUDA_FA_HALF_QUANTS=ON \
             -DHIP_PLATFORM=amd \
             -DGGML_HIP_ROCWMMA_FATTN=ON \
             -DCMAKE_C_COMPILER_LAUNCHER=ccache \
@@ -700,7 +699,6 @@ jobs:
             -DGGML_NATIVE=OFF \
             -DGGML_CUDA=ON \
             -DGGML_CUDA_FA=ON \
-            -DGGML_CUDA_FA_HALF_QUANTS=ON \
             -DGGML_BACKEND_DL=ON \
             -DGGML_CPU_ALL_VARIANTS=ON \
             -DCMAKE_C_COMPILER_LAUNCHER=ccache \
@@ -1101,7 +1099,6 @@ jobs:
             -DGGML_CPU=OFF ^
             -DGGML_CUDA=ON ^
             -DGGML_CUDA_FA=ON ^
-            -DGGML_CUDA_FA_HALF_QUANTS=ON ^
             -DLLAMA_BUILD_BORINGSSL=ON ^
             -DCMAKE_C_COMPILER_LAUNCHER=ccache ^
             -DCMAKE_CXX_COMPILER_LAUNCHER=ccache ^
@@ -1226,7 +1223,6 @@ jobs:
             -DGGML_HIP_ROCWMMA_FATTN=ON `
             -DGGML_HIP=ON `
             -DGGML_CUDA_FA=ON `
-            -DGGML_CUDA_FA_HALF_QUANTS=ON `
             -DCMAKE_C_COMPILER_LAUNCHER=ccache `
             -DCMAKE_CXX_COMPILER_LAUNCHER=ccache `
             -DCMAKE_HIP_COMPILER_LAUNCHER=ccache `
 
@@ -37,7 +37,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j
 ```
 
-Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
+Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
 
 Key binaries: `build/bin/llama-server`, `build/bin/llama-cli`, `build/bin/llama-bench`, `build/bin/llama-perplexity`.
 
@@ -93,13 +93,13 @@ build/bin/llama-perplexity -m model.gguf -f test.txt -c 4096
 # Decode speed
 build/bin/llama-bench -m model.gguf -p 0 -n 64 -t 1
 
-# Server with DFlash + TurboQuant
+# Server with DFlash + recommended q cache
 build/bin/llama-server -m target.gguf --spec-type dflash \
   --spec-draft-model drafter.gguf \
   --spec-draft-n-max 8 \
   --spec-branch-budget 0 \
   --spec-dflash-cross-ctx 512 \
-  --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq \
+  --flash-attn on --cache-type-k q5_0 --cache-type-v q4_1 \
   --port 8080
 ```
 
 
@@ -37,7 +37,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j
 ```
 
-Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
+Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
 
 Key binaries: `build/bin/llama-server`, `build/bin/llama-cli`, `build/bin/llama-bench`, `build/bin/llama-perplexity`.
 
@@ -93,13 +93,13 @@ build/bin/llama-perplexity -m model.gguf -f test.txt -c 4096
 # Decode speed
 build/bin/llama-bench -m model.gguf -p 0 -n 64 -t 1
 
-# Server with DFlash + TurboQuant
+# Server with DFlash + recommended q cache
 build/bin/llama-server -m target.gguf --spec-type dflash \
   --spec-draft-model drafter.gguf \
   --spec-draft-n-max 8 \
   --spec-branch-budget 0 \
   --spec-dflash-cross-ctx 512 \
-  --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq \
+  --flash-attn on --cache-type-k q5_0 --cache-type-v q4_1 \
   --port 8080
 ```
 
 
@@ -259,6 +259,8 @@ Target model: [Gemma 4 31B Q4_K_S](https://huggingface.co/unsloth/gemma-4-31b-it
 
 K and V cache types are set independently with `--cache-type-k` and `--cache-type-v`. For the preset rationale and benchmark details, see [KV Cache Quantization Benchmarks for Long Context](https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context).
 
+Current benchmarked presets recommend standard q cache types or KVarN fallback cache types. TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types at the recommended cache footprints, so the default CUDA FlashAttention build does not compile Turbo/TCQ pairs.
+
 ### Preset Ladder
 
 | K / V | % of bf16 size | 99.9% precision | What it is for |
@@ -273,8 +275,8 @@ K and V cache types are set independently with `--cache-type-k` and `--cache-typ
 | **q5_0 / q4_1** | **32.8** | **92.65%** | **Best default if VRAM-constrained** |
 | q5_0 / q4_0 | 31.3 | 91.39% | If q5_0 / q4_1 misses the fit by a narrow margin |
 | q4_0 / q4_0 | 28.1 | 88.87% | Memory saving with visible precision loss |
-| q4_0 / turbo3_tcq | 24.2 | 84.93% | Smaller than q4, cleaner than symmetric turbo3_tcq |
-| **turbo3_tcq / turbo3_tcq** | **20.3** | **81.56%** | **Viable extreme-compression mode** |
+| q4_0 / turbo3_tcq | 24.2 | 84.93% | Legacy Turbo/TCQ comparison point; requires a HALF or ALL FA build |
+| turbo3_tcq / turbo3_tcq | 20.3 | 81.56% | Legacy extreme-compression comparison point, not the recommended default |
 | turbo2_tcq / turbo2_tcq | 14.1 | 54.38% | Last resort: not for code, JSON, math, or tool calls |
 
 *99.9% precision = `100 · exp(−(quantKLD − bf16KLD))` at the 99.9% KL-divergence tail.*
@@ -290,10 +292,10 @@ K and V cache types are set independently with `--cache-type-k` and `--cache-typ
 | q4_1 | upstream | 5 | 3.2× | Smaller than q5_0, but weaker in the tail. Prefer q5_0 for K |
 | q4_0 | upstream | 4.5 | 3.56× | Default high compression type, decent at its size |
 | turbo4 | fork | 4.125 | 3.88× | Barely smaller than q4_0, slower, worse tail |
-| turbo3_tcq | fork | 3.25 | 4.92× | Viable compact mode, 82% precision at KLD 99.9%. CUDA-only |
-| turbo3 | fork | 3.125 | 5.12× | Weaker than turbo3_tcq. Use only when TCQ is unavailable |
-| turbo2_tcq | fork | 2.25 | 7.11× | Last resort, 54% precision at KLD 99.9%. CUDA-only |
-| turbo2 | fork | 2.125 | 7.53× | Extreme quality risk. Use only when TCQ is unavailable |
+| turbo3_tcq | fork | 3.25 | 4.92× | CUDA-only comparison point; no current benchmark advantage over q/KVarN presets |
+| turbo3 | fork | 3.125 | 5.12× | Weaker than turbo3_tcq. Use only for targeted experiments |
+| turbo2_tcq | fork | 2.25 | 7.11× | CUDA-only last resort, 54% precision at KLD 99.9% |
+| turbo2 | fork | 2.125 | 7.53× | Extreme quality risk. Use only for targeted experiments |
 
 ## Installation
 
@@ -358,7 +360,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j
 ```
 
-Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's q/KVarN-fallback default: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default leaves TurboQuant/TCQ FA pairs out; use `GGML_CUDA_FA_HALF_QUANTS=ON` for the larger 208-pair Turbo/TCQ-capable half-matrix, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These flags are mutually exclusive.
+Without `GGML_CUDA_FA_HALF_QUANTS` or `GGML_CUDA_FA_ALL_QUANTS`, the CUDA FlashAttention build compiles Bee's recommended default for standard q cache types or KVarN fallback: 62 K>=V vec pairs with fp cache types capped at q5 and q8/q6 capped at q4. This default is much faster to compile than the 208-pair HALF or 361-pair ALL modes and leaves TurboQuant/TCQ FA pairs out because TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The pair policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` for TurboQuant/TCQ or high-delta K>=V experiments, or `GGML_CUDA_FA_ALL_QUANTS=ON` for the full 361-pair matrix. These flags are mutually exclusive.
 
 ### Other Backends
 
@@ -382,13 +384,13 @@ llama-server -m model.gguf -c 16384 -np 4
 llama-server -m model.gguf -md draft.gguf
 ```
 
-### DFlash And TurboQuant Together
+### DFlash With Recommended KV Cache
 
 ```sh
 llama-server -m target.gguf --spec-type dflash \
   --spec-draft-model drafter.gguf \
   --spec-draft-ngl all \
-  --flash-attn on --cache-type-k turbo4 --cache-type-v turbo3_tcq
+  --flash-attn on --cache-type-k q5_0 --cache-type-v q4_1
 ```
 
 ## Documentation
 
@@ -298,7 +298,9 @@ The following compilation options are also available to tweak performance:
 | GGML_CUDA_FORCE_CUBLAS        | Boolean                | false   | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for V100, CDNA and RDNA4 which use FP32 compute type by default) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000).   |
 | GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer       | 128     | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial.                                                                                                                                                                  |
 | GGML_CUDA_FA_ALL_QUANTS       | Boolean                | false   | Compile CUDA FlashAttention vec kernels for the full supported K/V cache type matrix: 361 pairs for the 19-type universe. This is useful for arbitrary asymmetric cache-type combinations. Mutually exclusive with GGML_CUDA_FA_HALF_QUANTS.                                                                                                         |
-| GGML_CUDA_FA_HALF_QUANTS      | Boolean                | false   | Compile the larger CUDA FlashAttention vec K/V cache pair set for this fork: the useful K>=V half-matrix plus f16 fallback pairs needed by TurboQuant/TCQ dequant paths. This compiles 208 pairs instead of 361 in GGML_CUDA_FA_ALL_QUANTS. Types are ranked from higher precision to lower: f16 > bf16 > q8_0 > q6_1 > q6_0 > q5_1 > q5_0 > turbo4_tcq > turbo4 > q4_1 > q4_0 > q3_1 > turbo3_tcq > turbo3 > q3_0 > q2_1 > turbo2_tcq > turbo2 > q2_0. Mutually exclusive with GGML_CUDA_FA_ALL_QUANTS. Without either flag, the default is 62 q/KVarN-fallback pairs: K>=V, no Turbo/TCQ, no mixed f16/bf16, fp capped at q5, and q8/q6 capped at q4. |
+| GGML_CUDA_FA_HALF_QUANTS      | Boolean                | false   | Compile the larger CUDA FlashAttention vec K/V cache pair set for this fork: the useful K>=V half-matrix plus f16 fallback pairs needed by TurboQuant/TCQ dequant paths. This compiles 208 pairs instead of 361 in GGML_CUDA_FA_ALL_QUANTS. Types are ranked from higher precision to lower: f16 > bf16 > q8_0 > q6_1 > q6_0 > q5_1 > q5_0 > turbo4_tcq > turbo4 > q4_1 > q4_0 > q3_1 > turbo3_tcq > turbo3 > q3_0 > q2_1 > turbo2_tcq > turbo2 > q2_0. Mutually exclusive with GGML_CUDA_FA_ALL_QUANTS. |
+
+Recommended CUDA FlashAttention builds should leave both quant flags off. The default compiles 62 much faster q/KVarN-fallback pairs for standard q cache types or KVarN: K>=V, no Turbo/TCQ, no mixed f16/bf16, fp capped at q5, and q8/q6 capped at q4. TurboQuant/TCQ has not shown a benchmark advantage over standard q or KVarN cache types in current fork benchmarks. The default policy keeps K>=V because K loses precision faster under quantization and K<V pairs are inefficient, but avoids K>>>V because large K/V tier gaps are unbalanced. Use `GGML_CUDA_FA_HALF_QUANTS=ON` when you want TurboQuant/TCQ or high-delta K>=V experiments, and `GGML_CUDA_FA_ALL_QUANTS=ON` only when the full 361-pair matrix is required.
 
 ## MUSA