Skip to content

Eval bug: turboquant-kv-cache: crash in build_attn for GQA models with n_head is not equal to n_head_kv #78

@bingh0

Description

@bingh0

Name and Version

version: 8821 (45f8a06)
built with GNU 14.2.0 for Linux x86_64

Operating systems

Linux

GGML backends

CPU

Hardware

AMD Ryzen 5 Pro

Models

LFM2-24B-A2B-APEX-I-Mini.gguf (https://huggingface.co/mudler/LFM2-24B-A2B-APEX-GGUF)

Problem description & steps to reproduce

llama-cli -m LFM2-24B-A2B-APEX-I-Mini.gguf --cache-type-k turbo3 --cache-type-v turbo3 -p "hi"

Results in crash message:
GGML_ASSERT(ggml_nelements(a) == ne0ne1ne2) failed
in ggml_reshape_3d, called from llm_graph_context::build_attn
(src/llama-graph.cpp, padded-V reshape block)

Models with n_head == n_head_kv load fine.

Root cause:
hparams.n_head_kv(il) is used to reshape the head dimension, which fails for models where n_head is not equal to n_head_kv. Switching to hparams.n_head(il) fixes the reshape.

I locally verified it working with LFM2 and MHA models (gemma-4-e4b-it and gemma-4-e2b-it) with turboquant 3 and 4.

I can submit a fix if requested, but it's simple enough to change.

First Bad Commit

No response

Relevant log output

Loading model... |llama-cpp-turboquant/ggml/src/ggml.c:3656: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2)    
  failed                                                                                                                            
  -[New LWP 51579]                                                                                                                  
  [New LWP 51578]                                                                                                                   
                                                                                                                                    
  This GDB supports auto-downloading debuginfo from the following URLs:                                                             
    <https://debuginfod.ubuntu.com>                                                                                                 
  Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]                                              
  Debuginfod has been disabled.                                                                                                     
  To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.                                                     
  Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.                                           
  Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.                                          
  Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|en 
  d)|front|back|data|size|empty) will be skipped when stepping.                                                                     
  Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.                           
  [Thread debugging using libthread_db enabled]                                                                                     
  Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".                                                        
  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56                                                 
  warning: 56    ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory                                      
  #0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56                                             
  56    in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S                                                                       
  #1  0x00007f3d1c09eb63 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized    
  out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49                                                                              
  warning: 49    ./nptl/cancellation.c: No such file or directory                                                                   
  #2  __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0,              
  a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75                                                                                 
  75    in ./nptl/cancellation.c                                                                                                    
  #3  0x00007f3d1c11ae9f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized  
  out>) at ../sysdeps/unix/sysv/linux/wait4.c:30                                                                                    
  warning: 30    ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory                                                      
  #4  0x00007f3d1cc5c293 in ggml_print_backtrace () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0   
  #5  0x00007f3d1cc5c43b in ggml_abort () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0             
  #6  0x00007f3d1cc632cb in ggml_reshape_3d () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0        
  #7  0x00007f3d1c9116b9 in llm_graph_context::build_attn(llm_graph_input_attn_kv*, ggml_tensor*, ggml_tensor*, ggml_tensor*,       
  ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, float, int) const () from                                   
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #8  0x00007f3d1ca77cfc in llm_build_lfm2<false>::llm_build_lfm2(llama_model const&, llm_graph_params                              
  const&)::{lambda(ggml_tensor*, ggml_tensor*, llm_graph_input_attn_kv*, int)#1}::operator()(ggml_tensor*, ggml_tensor*,            
  llm_graph_input_attn_kv*, int) const () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0                 
  #9  0x00007f3d1ca7869e in llm_build_lfm2<false>::llm_build_lfm2(llama_model const&, llm_graph_params const&) () from              
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #10 0x00007f3d1c963948 in llama_model::build_graph(llm_graph_params const&) const () from                                         
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #11 0x00007f3d1c8d65ad in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*,   
  bool, unsigned long*) () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                
  #12 0x00007f3d1c8d8598 in llama_context::sched_reserve() () from                                                                  
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #13 0x00007f3d1c8d9ec9 in llama_context::llama_context(llama_model const&, llama_context_params) () from                          
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #14 0x00007f3d1c8dac9b in llama_init_from_model () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0      
  #15 0x00007f3d1c8aec12 in llama_get_device_memory_data(char const*, llama_model_params const*, llama_context_params const*,       
  std::vector<ggml_backend_device*, std::allocator<ggml_backend_device*> >&, unsigned int&, unsigned int&, unsigned int&,           
  ggml_log_level) () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                      
  #16 0x00007f3d1c8b0094 in llama_params_fit_impl(char const*, llama_model_params*, llama_context_params*, float*,                  
  llama_model_tensor_buft_override*, unsigned long*, unsigned int, ggml_log_level) () from                                          
  llama-cpp-turboquant/build-cpu/bin/libllama.so.0                                                              
  #17 0x00007f3d1c8b4472 in llama_params_fit () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0           
  #18 0x000055912881a6fc in common_init_result::common_init_result(common_params&) ()                                               
  #19 0x000055912881ceca in common_init_from_params(common_params&) ()                                                              
  #20 0x000055912879af5c in server_context_impl::load_model(common_params const&) ()                                                
  #21 0x00005591286b3c7a in main ()                                                                                                 
  [Inferior 1 (process 51575) detached]                                                                                             
  \Aborted (core dumped) 

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions