Skip to content

Commit 59798f1

Browse files
TheTomclaude
andauthored
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0` fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON. The vec template instances for f16/bf16 + q8_0 are already compiled (fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the dispatcher was gating kernels that do exist. Extend the predicate to include f16 and bf16 alongside turbo + q8_0. Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where `-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback. Co-Authored-By: tturney@psyguard.ai Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 1073622 commit 59798f1

1 file changed

Lines changed: 7 additions & 4 deletions

File tree

ggml/src/ggml-cuda/fattn.cu

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -419,11 +419,14 @@ static best_fattn_kernel ggml_cuda_get_best_fattn_kernel(const int device, const
419419

420420
#ifndef GGML_CUDA_FA_ALL_QUANTS
421421
if (K->type != V->type) {
422-
// Allow mixed turbo KV types (any combination of turbo2, turbo3, q8_0)
423-
auto is_turbo = [](ggml_type t) {
424-
return t == GGML_TYPE_TURBO2_0 || t == GGML_TYPE_TURBO3_0 || t == GGML_TYPE_TURBO4_0 || t == GGML_TYPE_Q8_0;
422+
// Allow mixed KV types for combinations that have FA template instances compiled in:
423+
// - turbo2/3/4 + q8_0 (turbo cache work)
424+
// - f16/bf16 + q8_0 (common K=f16, V=q8_0 setup)
425+
auto is_kv_compat = [](ggml_type t) {
426+
return t == GGML_TYPE_TURBO2_0 || t == GGML_TYPE_TURBO3_0 || t == GGML_TYPE_TURBO4_0
427+
|| t == GGML_TYPE_Q8_0 || t == GGML_TYPE_F16 || t == GGML_TYPE_BF16;
425428
};
426-
if (!is_turbo(K->type) || !is_turbo(V->type)) {
429+
if (!is_kv_compat(K->type) || !is_kv_compat(V->type)) {
427430
return BEST_FATTN_KERNEL_NONE;
428431
}
429432
}

0 commit comments

Comments
 (0)