Anbeeld
diff --git a/‎AGENTS.md‎
Lines changed: 1 addition & 1 deletion b/‎AGENTS.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CLAUDE.md‎
Lines changed: 1 addition & 1 deletion b/‎CLAUDE.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/build.md‎
Lines changed: 2 additions & 1 deletion b/‎docs/build.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎docs/quickstart-gemma-4-31b-dflash.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/quickstart-gemma-4-31b-dflash.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/quickstart-qwen36-dflash.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/quickstart-qwen36-dflash.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎ggml/CMakeLists.txt‎
Lines changed: 1 addition & 0 deletions b/‎ggml/CMakeLists.txt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎ggml/src/ggml-cuda/CMakeLists.txt‎
Lines changed: 83 additions & 17 deletions b/‎ggml/src/ggml-cuda/CMakeLists.txt‎
Lines changed: 83 additions & 17 deletions
@@ -37,7 +37,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j
 ```
 
-`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
+`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. `GGML_CUDA_FA_HALF_QUANTS=ON` is an alternative that compiles only the useful K>=V half of the K/V pair matrix (compiling 91 FA vec K/V pairs instead of 169, reducing FA vec pair instances by ~46% vs ALL_QUANTS). These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
 
 Key binaries: `build/bin/llama-server`, `build/bin/llama-cli`, `build/bin/llama-bench`, `build/bin/llama-perplexity`.
 
 
@@ -37,7 +37,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j
 ```
 
-`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
+`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. `GGML_CUDA_FA_HALF_QUANTS=ON` is an alternative that compiles only the useful K>=V half of the K/V pair matrix (compiling 91 FA vec K/V pairs instead of 169, reducing FA vec pair instances by ~46% vs ALL_QUANTS). These two flags are mutually exclusive. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
 
 Key binaries: `build/bin/llama-server`, `build/bin/llama-cli`, `build/bin/llama-bench`, `build/bin/llama-perplexity`.
 
 
@@ -358,7 +358,7 @@ cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j
 ```
 
-`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
+`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. `GGML_CUDA_FA_HALF_QUANTS=ON` is an alternative that compiles only the useful K>=V half of the K/V pair matrix (compiling 91 FA vec K/V pairs instead of 169, reducing FA vec pair instances by ~46% vs ALL_QUANTS). These two flags are mutually exclusive.
 
 ### Other Backends
 
 
@@ -297,7 +297,8 @@ The following compilation options are also available to tweak performance:
 | GGML_CUDA_FORCE_MMQ           | Boolean                | false   | Force the use of custom matrix multiplication kernels for quantized models instead of FP16 cuBLAS even if there is no int8 tensor core implementation available (affects V100, CDNA and RDNA3+). MMQ kernels are enabled by default on GPUs with int8 tensor core support. With MMQ force enabled, speed for large batch sizes will be worse but VRAM consumption will be lower. |
 | GGML_CUDA_FORCE_CUBLAS        | Boolean                | false   | Force the use of FP16 cuBLAS instead of custom matrix multiplication kernels for quantized models. There may be issues with numerical overflows (except for V100, CDNA and RDNA4 which use FP32 compute type by default) and memory use will be higher. Prompt processing may become faster on recent datacenter GPUs (the custom kernels were tuned primarily for RTX 3000/4000).   |
 | GGML_CUDA_PEER_MAX_BATCH_SIZE | Positive integer       | 128     | Maximum batch size for which to enable peer access between multiple GPUs. Peer access requires either Linux or NVLink. When using NVLink enabling peer access for larger batch sizes is potentially beneficial.                                                                                                                                                                  |
-| GGML_CUDA_FA_ALL_QUANTS       | Boolean                | false   | Compile support for all KV cache quantization type (combinations) for the FlashAttention CUDA kernels. More fine-grained control over KV cache size but compilation takes much longer.                                                                                                                                                                                           |
+| GGML_CUDA_FA_ALL_QUANTS       | Boolean                | false   | Compile CUDA FlashAttention vec kernels for the full supported K/V cache type matrix (169 pairs for the 13-type universe). Mutually exclusive with GGML_CUDA_FA_HALF_QUANTS.                                                                                                                                                                                                     |
+| GGML_CUDA_FA_HALF_QUANTS      | Boolean                | false   | Compile only the useful K>=V half of the K/V cache type matrix for FlashAttention vec kernels (91 pairs), where types are ranked from higher precision to lower: f16 > bf16 > q8_0 > q6_0 > q5_1 > q5_0 > turbo4 > q4_1 > q4_0 > turbo3_tcq > turbo3 > turbo2_tcq > turbo2. This avoids wasteful reversed asymmetric pairs. Mutually exclusive with GGML_CUDA_FA_ALL_QUANTS.          |
 
 ## MUSA
 
 
@@ -71,7 +71,7 @@ cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
 cmake --build build -j
 ```
 
-`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
+`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. `GGML_CUDA_FA_HALF_QUANTS=ON` is an alternative that compiles only the useful K>=V half of the K/V pair matrix (compiling 91 FA vec K/V pairs instead of 169, reducing FA vec pair instances by ~46% vs ALL_QUANTS). These two flags are mutually exclusive.
 
 **macOS (Metal).**
 
 
@@ -71,7 +71,7 @@ cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
 cmake --build build -j
 ```
 
-`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. Add `-DCMAKE_CUDA_ARCHITECTURES=86` for RTX 3090, or `-DCMAKE_CUDA_ARCHITECTURES=89` for RTX 4090, if cross-compiling or building in CI without a GPU.
+`GGML_CUDA_FA_ALL_QUANTS=ON` is required for TurboQuant and TCQ cache types. `GGML_CUDA_FA_HALF_QUANTS=ON` is an alternative that compiles only the useful K>=V half of the K/V pair matrix (compiling 91 FA vec K/V pairs instead of 169, reducing FA vec pair instances by ~46% vs ALL_QUANTS). These two flags are mutually exclusive.
 
 **macOS (Metal).**
 
 
@@ -206,6 +206,7 @@ option(GGML_CUDA_NO_PEER_COPY               "ggml: do not use peer to peer copie
 option(GGML_CUDA_NO_VMM                     "ggml: do not try to use CUDA VMM"                OFF)
 option(GGML_CUDA_FA                         "ggml: compile ggml FlashAttention CUDA kernels"  ON)
 option(GGML_CUDA_FA_ALL_QUANTS              "ggml: compile all quants for FlashAttention"     OFF)
+option(GGML_CUDA_FA_HALF_QUANTS             "ggml: compile only K>=V half of KV cache quant pairs for FlashAttention" OFF)
 option(GGML_CUDA_GRAPHS                     "ggml: use CUDA graphs (llama.cpp only)"          ${GGML_CUDA_GRAPHS_DEFAULT})
 option(GGML_CUDA_NCCL                       "ggml: use NVIDIA Collective Comm. Library"       ON)
 set   (GGML_CUDA_COMPRESSION_MODE "size" CACHE STRING
 
@@ -112,27 +112,93 @@ if (CUDAToolkit_FOUND)
     file(GLOB   SRCS "template-instances/mmf*.cu")
     list(APPEND GGML_SOURCES_CUDA ${SRCS})
 
+    macro(ggml_add_fattn_vec_pair OUT_VAR PREFIX K_TYPE V_TYPE)
+        set(_FILE "${PREFIX}/fattn-vec-instance-${K_TYPE}-${V_TYPE}.cu")
+        if (NOT EXISTS "${CMAKE_CURRENT_SOURCE_DIR}/${_FILE}")
+            message(FATAL_ERROR "Missing FA vec template instance: ${_FILE}")
+        endif()
+        list(APPEND ${OUT_VAR} "${_FILE}")
+    endmacro()
+
+    if (GGML_CUDA_FA_ALL_QUANTS AND GGML_CUDA_FA_HALF_QUANTS)
+        message(FATAL_ERROR
+            "GGML_CUDA_FA_ALL_QUANTS and GGML_CUDA_FA_HALF_QUANTS are mutually exclusive. "
+            "Use GGML_CUDA_FA_ALL_QUANTS for the full K/V pair matrix, or GGML_CUDA_FA_HALF_QUANTS for only K>=V pairs.")
+    endif()
+
+    set(FA_VEC_PREFIX "template-instances")
+
     if (GGML_CUDA_FA_ALL_QUANTS)
-        file(GLOB   SRCS "template-instances/fattn-vec*.cu")
+        file(GLOB   SRCS "${FA_VEC_PREFIX}/fattn-vec-instance-*.cu")
         list(APPEND GGML_SOURCES_CUDA ${SRCS})
         add_compile_definitions(GGML_CUDA_FA_ALL_QUANTS)
+    elseif (GGML_CUDA_FA_HALF_QUANTS)
+        set(GGML_CUDA_FA_KV_TYPES_ORDERED
+            f16
+            bf16
+            q8_0
+            q6_0
+            q5_1
+            q5_0
+            turbo4_0
+            q4_1
+            q4_0
+            turbo3_tcq
+            turbo3_0
+            turbo2_tcq
+            turbo2_0
+        )
+
+        list(LENGTH GGML_CUDA_FA_KV_TYPES_ORDERED GGML_CUDA_FA_KV_TYPES_LEN)
+        math(EXPR GGML_CUDA_FA_KV_TYPES_LAST "${GGML_CUDA_FA_KV_TYPES_LEN} - 1")
+
+        foreach(KI RANGE 0 ${GGML_CUDA_FA_KV_TYPES_LAST})
+            list(GET GGML_CUDA_FA_KV_TYPES_ORDERED ${KI} K_TYPE)
+
+            foreach(VI RANGE ${KI} ${GGML_CUDA_FA_KV_TYPES_LAST})
+                list(GET GGML_CUDA_FA_KV_TYPES_ORDERED ${VI} V_TYPE)
+                ggml_add_fattn_vec_pair(GGML_SOURCES_CUDA "${FA_VEC_PREFIX}" "${K_TYPE}" "${V_TYPE}")
+            endforeach()
+        endforeach()
+
+        add_compile_definitions(GGML_CUDA_FA_HALF_QUANTS)
     else()
-        file(GLOB   SRCS "template-instances/fattn-vec*q4_0-q4_0.cu")
-        list(APPEND GGML_SOURCES_CUDA ${SRCS})
-        file(GLOB   SRCS "template-instances/fattn-vec*q8_0-q8_0.cu")
-        list(APPEND GGML_SOURCES_CUDA ${SRCS})
-        file(GLOB   SRCS "template-instances/fattn-vec*f16-f16.cu")
-        list(APPEND GGML_SOURCES_CUDA ${SRCS})
-        file(GLOB   SRCS "template-instances/fattn-vec*bf16-bf16.cu")
-        list(APPEND GGML_SOURCES_CUDA ${SRCS})
-        file(GLOB   SRCS "template-instances/fattn-vec*turbo2_0*.cu")
-        list(APPEND GGML_SOURCES_CUDA ${SRCS})
-        file(GLOB   SRCS "template-instances/fattn-vec*turbo3_0*.cu")
-        list(APPEND GGML_SOURCES_CUDA ${SRCS})
-        file(GLOB   SRCS "template-instances/fattn-vec*turbo*_tcq*.cu")
-        list(APPEND GGML_SOURCES_CUDA ${SRCS})
-        file(GLOB   SRCS "template-instances/fattn-vec*turbo4_0*.cu")
-        list(APPEND GGML_SOURCES_CUDA ${SRCS})
+        set(GGML_CUDA_FA_DEFAULT_KV_PAIRS
+            f16:f16
+            q4_0:q4_0
+            q8_0:q8_0
+            bf16:bf16
+            turbo2_0:turbo2_0
+            turbo3_0:turbo3_0
+            turbo4_0:turbo4_0
+            turbo3_tcq:turbo3_tcq
+            turbo2_tcq:turbo2_tcq
+            turbo2_0:q8_0
+            turbo3_0:q8_0
+            turbo4_0:q8_0
+            q8_0:turbo2_0
+            q8_0:turbo3_0
+            q8_0:turbo4_0
+            turbo4_0:turbo3_0
+            turbo3_0:turbo4_0
+            turbo2_0:turbo3_0
+            turbo3_0:turbo2_0
+            turbo3_tcq:q8_0
+            turbo2_tcq:q8_0
+            q8_0:turbo3_tcq
+            q8_0:turbo2_tcq
+            turbo3_tcq:turbo2_tcq
+            turbo2_tcq:turbo3_tcq
+            turbo4_0:turbo3_tcq
+            turbo3_0:turbo3_tcq
+        )
+
+        foreach(PAIR ${GGML_CUDA_FA_DEFAULT_KV_PAIRS})
+            string(REPLACE ":" ";" PARTS "${PAIR}")
+            list(GET PARTS 0 K_TYPE)
+            list(GET PARTS 1 V_TYPE)
+            ggml_add_fattn_vec_pair(GGML_SOURCES_CUDA "${FA_VEC_PREFIX}" "${K_TYPE}" "${V_TYPE}")
+        endforeach()
     endif()
 
     ggml_add_backend_library(ggml-cuda