fix: gate turbo V unpad on V type, not K type (#42)

TheTom · claude · TheTom · commit 009acc30615e · 2026-04-18T17:31:55.000-05:00
When using asymmetric KV (-ctk q8_0 -ctv turbo4), the V unpad code
was gated on k-&gt;type being turbo. Since K is q8_0, the unpad was
skipped even when V was turbo and padded to 128. This caused a shape
mismatch at the wo matmul (ggml_can_mul_mat assertion) for models
with non-128-aligned head_dim (e.g., GPT-OSS-120B with head_dim=64,
openai_moe_iswa architecture).

Fix: check v-&gt;type instead of k-&gt;type for V unpad blocks in both
build_attn overloads. Q rotation remains correctly gated on k-&gt;type.

Reported-by: NigelTufnel12345
Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
Co-Authored-By: tturney@psyguard.ai
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
@@ -2189,7 +2189,8 @@ ggml_tensor * llm_graph_context::build_attn(
 
     // TurboQuant: if V was padded, the output has padded dimensions.
     // Extract original V head_dim after inverse WHT (applied inside build_attn_mha).
-    if (k->type == GGML_TYPE_TURBO3_0 || k->type == GGML_TYPE_TURBO4_0 || k->type == GGML_TYPE_TURBO2_0) {
+    // NOTE: gate on v->type (not k->type) for asymmetric configs where K=q8_0 but V=turbo
+    if (v->type == GGML_TYPE_TURBO3_0 || v->type == GGML_TYPE_TURBO4_0 || v->type == GGML_TYPE_TURBO2_0) {
         const int64_t orig_v_head = hparams.n_embd_head_v(il);
         // cur is 2D: (n_embd_head * n_head, n_tokens) after build_attn_mha
         const int64_t padded_v_head = v->ne[0];
@@ -2415,7 +2416,8 @@ ggml_tensor * llm_graph_context::build_attn(
     cb(cur, "kqv_out", il);
 
     // TurboQuant: if V was padded, extract original V head_dim after inverse WHT
-    if (k->type == GGML_TYPE_TURBO3_0 || k->type == GGML_TYPE_TURBO4_0 || k->type == GGML_TYPE_TURBO2_0) {
+    // NOTE: gate on v->type (not k->type) for asymmetric configs where K=q8_0 but V=turbo
+    if (v->type == GGML_TYPE_TURBO3_0 || v->type == GGML_TYPE_TURBO4_0 || v->type == GGML_TYPE_TURBO2_0) {
         const int64_t orig_v_head = hparams.n_embd_head_v(il);
         const int64_t padded_v_head = v->ne[0];
         if (padded_v_head != orig_v_head) {