llama-cli -m LFM2-24B-A2B-APEX-I-Mini.gguf --cache-type-k turbo3 --cache-type-v turbo3 -p "hi"
Models with n_head == n_head_kv load fine.
I locally verified it working with LFM2 and MHA models (gemma-4-e4b-it and gemma-4-e2b-it) with turboquant 3 and 4.
I can submit a fix if requested, but it's simple enough to change.
Loading model... |llama-cpp-turboquant/ggml/src/ggml.c:3656: GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2)
failed
-[New LWP 51579]
[New LWP 51578]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|en
d)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56 ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0 __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56 in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1 0x00007f3d1c09eb63 in __internal_syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized
out>, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49 ./nptl/cancellation.c: No such file or directory
#2 __syscall_cancel (a1=<optimized out>, a2=<optimized out>, a3=<optimized out>, a4=<optimized out>, a5=a5@entry=0,
a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75 in ./nptl/cancellation.c
#3 0x00007f3d1c11ae9f in __GI___wait4 (pid=<optimized out>, stat_loc=<optimized out>, options=<optimized out>, usage=<optimized
out>) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4 0x00007f3d1cc5c293 in ggml_print_backtrace () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0
#5 0x00007f3d1cc5c43b in ggml_abort () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0
#6 0x00007f3d1cc632cb in ggml_reshape_3d () from llama-cpp-turboquant/build-cpu/bin/libggml-base.so.0
#7 0x00007f3d1c9116b9 in llm_graph_context::build_attn(llm_graph_input_attn_kv*, ggml_tensor*, ggml_tensor*, ggml_tensor*,
ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, ggml_tensor*, float, int) const () from
llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#8 0x00007f3d1ca77cfc in llm_build_lfm2<false>::llm_build_lfm2(llama_model const&, llm_graph_params
const&)::{lambda(ggml_tensor*, ggml_tensor*, llm_graph_input_attn_kv*, int)#1}::operator()(ggml_tensor*, ggml_tensor*,
llm_graph_input_attn_kv*, int) const () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#9 0x00007f3d1ca7869e in llm_build_lfm2<false>::llm_build_lfm2(llama_model const&, llm_graph_params const&) () from
llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#10 0x00007f3d1c963948 in llama_model::build_graph(llm_graph_params const&) const () from
llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#11 0x00007f3d1c8d65ad in llama_context::graph_reserve(unsigned int, unsigned int, unsigned int, llama_memory_context_i const*,
bool, unsigned long*) () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#12 0x00007f3d1c8d8598 in llama_context::sched_reserve() () from
llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#13 0x00007f3d1c8d9ec9 in llama_context::llama_context(llama_model const&, llama_context_params) () from
llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#14 0x00007f3d1c8dac9b in llama_init_from_model () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#15 0x00007f3d1c8aec12 in llama_get_device_memory_data(char const*, llama_model_params const*, llama_context_params const*,
std::vector<ggml_backend_device*, std::allocator<ggml_backend_device*> >&, unsigned int&, unsigned int&, unsigned int&,
ggml_log_level) () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#16 0x00007f3d1c8b0094 in llama_params_fit_impl(char const*, llama_model_params*, llama_context_params*, float*,
llama_model_tensor_buft_override*, unsigned long*, unsigned int, ggml_log_level) () from
llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#17 0x00007f3d1c8b4472 in llama_params_fit () from llama-cpp-turboquant/build-cpu/bin/libllama.so.0
#18 0x000055912881a6fc in common_init_result::common_init_result(common_params&) ()
#19 0x000055912881ceca in common_init_from_params(common_params&) ()
#20 0x000055912879af5c in server_context_impl::load_model(common_params const&) ()
#21 0x00005591286b3c7a in main ()
[Inferior 1 (process 51575) detached]
\Aborted (core dumped)
Name and Version
version: 8821 (45f8a06)
built with GNU 14.2.0 for Linux x86_64
Operating systems
Linux
GGML backends
CPU
Hardware
AMD Ryzen 5 Pro
Models
LFM2-24B-A2B-APEX-I-Mini.gguf (https://huggingface.co/mudler/LFM2-24B-A2B-APEX-GGUF)
Problem description & steps to reproduce
llama-cli -m LFM2-24B-A2B-APEX-I-Mini.gguf --cache-type-k turbo3 --cache-type-v turbo3 -p "hi"
Results in crash message:
GGML_ASSERT(ggml_nelements(a) == ne0ne1ne2) failed
in ggml_reshape_3d, called from llm_graph_context::build_attn
(src/llama-graph.cpp, padded-V reshape block)
Models with n_head == n_head_kv load fine.
Root cause:
hparams.n_head_kv(il) is used to reshape the head dimension, which fails for models where n_head is not equal to n_head_kv. Switching to hparams.n_head(il) fixes the reshape.
I locally verified it working with LFM2 and MHA models (gemma-4-e4b-it and gemma-4-e2b-it) with turboquant 3 and 4.
I can submit a fix if requested, but it's simple enough to change.
First Bad Commit
No response
Relevant log output