Eval bug: turboquant-kv-cache: crash in build_attn for GQA models with n_head is not equal to n_head_kv

### Name and Version

version: 8821 (45f8a066e)
built with GNU 14.2.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

CPU

### Hardware

AMD Ryzen 5 Pro

### Models

LFM2-24B-A2B-APEX-I-Mini.gguf (https://huggingface.co/mudler/LFM2-24B-A2B-APEX-GGUF)

### Problem description & steps to reproduce

llama-cli -m LFM2-24B-A2B-APEX-I-Mini.gguf --cache-type-k turbo3 --cache-type-v turbo3 -p "hi"

Results in crash message:
GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2) failed                                                                            
in ggml_reshape_3d, called from llm_graph_context::build_attn                                                                   
(src/llama-graph.cpp, padded-V reshape block)

Models with n_head == n_head_kv load fine.

Root cause:
hparams.n_head_kv(il) is used to reshape the head dimension, which fails for models where n_head is not equal to n_head_kv. Switching to hparams.n_head(il) fixes the reshape.

I locally verified it working with LFM2 and MHA models (gemma-4-e4b-it and gemma-4-e2b-it) with turboquant 3 and 4.

I can submit a fix if requested, but it's simple enough to change.

### First Bad Commit

_No response_

### Relevant log output

```console
Loading model... |llama-cpp-turboquant/ggml/src/ggml.c:3656: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2)    
  failed                                                                                                                            
  -[New LWP 51579]                                                                                                                  
  [New LWP 51578]                                                                                                                   
                                                                                                                                    
  This GDB supports auto-downloading debuginfo from the following URLs:                                                             
    <https://debuginfod.ubuntu.com>                                                                                                 
  Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]                                              
  Debuginfod has been disabled.                                                                                                     
  To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.                                                     
  Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.                                           
  Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.                                          
  Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|en 
  d)|front|back|data|size|empty) will be skipped when stepping.                                                                     
  Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.                           
  [Thread debugging using libthread_db enabled]                                                                                     
  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".                                                        
  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56                                                 
  warning: 56    ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory                                      
  #0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56                                             
  56    in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S                                                                       
  #1  0x00007f3d1c09eb63 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized    
  out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49                                                                              
  warning: 49    ./nptl/cancellation.c: No such file or directory                                                                   
  #2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0,              
  a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75                                                                                 
  75    in ./nptl/cancellation.c                                                                                                    
  #3  0x00007f3d1c11ae9f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized  
  out>) at ../sysdeps/unix/sysv/linux/wait4.c:30                                                                                    
  warning: 30    ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory                                                      
  #4  0x00007f3d1cc5c293 in ggml_print_backtrace () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0   
  #5  0x00007f3d1cc5c43b in ggml_abort () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0             
  #6  0x00007f3d1cc632cb in ggml_reshape_3d () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0        
  #7  0x00007f3d1c9116b9 in llm_graph_context::build_attn(llm_graph_input_attn_kv*, ggml_tensor*, ggml_tensor*, ggml_tensor*,       
  ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, float, int) const () from                                   
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #8  0x00007f3d1ca77cfc in llm_build_lfm2<false>::llm_build_lfm2(llama_model const&, llm_graph_params                              
  const&)::{lambda(ggml_tensor*, ggml_tensor*, llm_graph_input_attn_kv*, int)#1}::operator()(ggml_tensor*, ggml_tensor*,            
  llm_graph_input_attn_kv*, int) const () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0                 
  #9  0x00007f3d1ca7869e in llm_build_lfm2<false>::llm_build_lfm2(llama_model const&, llm_graph_params const&) () from              
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #10 0x00007f3d1c963948 in llama_model::build_graph(llm_graph_params const&) const () from                                         
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #11 0x00007f3d1c8d65ad in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*,   
  bool, unsigned long*) () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                
  #12 0x00007f3d1c8d8598 in llama_context::sched_reserve() () from                                                                  
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #13 0x00007f3d1c8d9ec9 in llama_context::llama_context(llama_model const&, llama_context_params) () from                          
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #14 0x00007f3d1c8dac9b in llama_init_from_model () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0      
  #15 0x00007f3d1c8aec12 in llama_get_device_memory_data(char const*, llama_model_params const*, llama_context_params const*,       
  std::vector<ggml_backend_device*, std::allocator<ggml_backend_device*> >&, unsigned int&, unsigned int&, unsigned int&,           
  ggml_log_level) () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                      
  #16 0x00007f3d1c8b0094 in llama_params_fit_impl(char const*, llama_model_params*, llama_context_params*, float*,                  
  llama_model_tensor_buft_override*, unsigned long*, unsigned int, ggml_log_level) () from                                          
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #17 0x00007f3d1c8b4472 in llama_params_fit () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0           
  #18 0x000055912881a6fc in common_init_result::common_init_result(common_params&) ()                                               
  #19 0x000055912881ceca in common_init_from_params(common_params&) ()                                                              
  #20 0x000055912879af5c in server_context_impl::load_model(common_params const&) ()                                                
  #21 0x00005591286b3c7a in main ()                                                                                                 
  [Inferior 1 (process 51575) detached]                                                                                             
  \Aborted (core dumped) 
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: turboquant-kv-cache: crash in build_attn for GQA models with n_head is not equal to n_head_kv #78

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Eval bug: turboquant-kv-cache: crash in build_attn for GQA models with n_head is not equal to n_head_kv #78

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions