kv-cache : disable attn_rot_k/v when SPLIT_MODE_TENSOR is active

yrougy · claude · yrougy · commit dd42d51f9618 · 2026-05-10T19:07:18.000+02:00
The Hadamard rotation path (ggml_mul_mat_aux) reshapes the K/V tensor
from [n_embd_head, n_head_kv, n_tokens] to a 2-D layout for matmul,
then restores it.  The split-axis tracker in ggml-backend-meta.cpp does
not follow this reshape correctly when the source split axis falls on
the collapsed dimension, producing an AXIS_1 tag on the values passed
to ggml_set_rows, which then trips an assertion.

Disabling attn_rot when tensor parallelism is in use sidesteps the
incompatibility without changing inference quality: the Hadamard
pre-rotation is a lossless precision aid, not a model requirement.

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
@@ -289,12 +289,14 @@ llama_kv_cache::llama_kv_cache(
 
     attn_rot_k =
         !attn_rot_disable &&
+        model.split_mode() != LLAMA_SPLIT_MODE_TENSOR &&
         n_embd_head_k_all > 0 &&
         ggml_is_quantized(type_k) &&
         hparams.n_embd_head_k() % 64 == 0;
 
     attn_rot_v =
         !attn_rot_disable &&
+        model.split_mode() != LLAMA_SPLIT_MODE_TENSOR &&
         n_embd_head_v_all > 0 &&
         ggml_is_quantized(type_v) &&
         hparams.n_embd_head_v() % 64 == 0;