fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)

TheTom · claude · web-flow · commit 59798f10d917 · 2026-04-16T19:27:40.000-05:00
The FA dispatcher rejected any K != V type combo unless all types were in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0` fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON. The vec template instances for f16/bf16 + q8_0 are already compiled (fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the dispatcher was gating kernels that do exist. Extend the predicate to include f16 and bf16 alongside turbo + q8_0. Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where `-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback. Co-Authored-By: tturney@psyguard.ai Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
@@ -419,11 +419,14 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
 
 #ifndef GGML_CUDA_FA_ALL_QUANTS
     if (K->type != V->type) {
-        // Allow mixed turbo KV types (any combination of turbo2, turbo3, q8_0)
-        auto is_turbo = [](ggml_type t) {
-            return t == GGML_TYPE_TURBO2_0 || t == GGML_TYPE_TURBO3_0 || t == GGML_TYPE_TURBO4_0 || t == GGML_TYPE_Q8_0;
+        // Allow mixed KV types for combinations that have FA template instances compiled in:
+        // - turbo2/3/4 + q8_0 (turbo cache work)
+        // - f16/bf16 + q8_0 (common K=f16, V=q8_0 setup)
+        auto is_kv_compat = [](ggml_type t) {
+            return t == GGML_TYPE_TURBO2_0 || t == GGML_TYPE_TURBO3_0 || t == GGML_TYPE_TURBO4_0
+                || t == GGML_TYPE_Q8_0 || t == GGML_TYPE_F16 || t == GGML_TYPE_BF16;
         };
-        if (!is_turbo(K->type) || !is_turbo(V->type)) {
+        if (!is_kv_compat(K->type) || !is_kv_compat(V->type)) {
             return BEST_FATTN_KERNEL_NONE;
         }
     }