Skip to content

Commit e1cb817

Browse files
authored
memory: respect unified KV cache in hybrid memory for eval tasks (ggml-org#21224)
The hybrid memory paths (`llama-memory-hybrid.cpp` and `llama-memory-hybrid-iswa.cpp`) always used sequential equal split, ignoring the unified KV cache flag. This caused hellaswag, winogrande, and multiple-choice evaluations to fail on hybrid models (models with both attention and recurrent/SSM layers, such as Qwen3.5-35B-A3B) with: split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) PR ggml-org#19954 fixed this for `llama-kv-cache-iswa.cpp` by automatically enabling unified KV mode and setting n_parallel >= 4 for multi-choice eval tasks. However, the hybrid memory paths were not updated. This commit mirrors the iswa fix: use non-sequential split when KV cache is unified (n_stream == 1), which is automatically set by llama-perplexity for hellaswag/winogrande/multiple-choice since ggml-org#19954. Tested on Qwen3.5-35B-A3B (hybrid attention+SSM MoE model): - HellaSwag: 83.0% (400 tasks) - Winogrande: 74.5% (400 tasks) - MMLU: 41.2% - ARC-Challenge: 56.2% - TruthfulQA: 37.7% All previously failed with llama_decode() error.
1 parent 88d5f8f commit e1cb817

2 files changed

Lines changed: 6 additions & 6 deletions

File tree

src/llama-memory-hybrid-iswa.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,9 @@ llama_memory_context_ptr llama_memory_hybrid_iswa::init_batch(llama_batch_allocr
7373
// if all tokens are output, split by sequence
7474
ubatch = balloc.split_seq(n_ubatch);
7575
} else {
76-
// TODO: non-sequential equal split can be done if using unified KV cache
77-
// for simplicity, we always use sequential equal split for now
78-
ubatch = balloc.split_equal(n_ubatch, true);
76+
// Use non-sequential split when KV cache is unified (needed for hellaswag/winogrande/multiple-choice)
77+
const bool unified = (mem_attn->get_base()->get_n_stream() == 1);
78+
ubatch = balloc.split_equal(n_ubatch, !unified);
7979
}
8080

8181
if (ubatch.n_tokens == 0) {

src/llama-memory-hybrid.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,9 @@ llama_memory_context_ptr llama_memory_hybrid::init_batch(llama_batch_allocr & ba
7373
// if all tokens are output, split by sequence
7474
ubatch = balloc.split_seq(n_ubatch);
7575
} else {
76-
// TODO: non-sequential equal split can be done if using unified KV cache
77-
// for simplicity, we always use sequential equal split for now
78-
ubatch = balloc.split_equal(n_ubatch, true);
76+
// Use non-sequential split when KV cache is unified (needed for hellaswag/winogrande/multiple-choice)
77+
const bool unified = (mem_attn->get_n_stream() == 1);
78+
ubatch = balloc.split_equal(n_ubatch, !unified);
7979
}
8080

8181
if (ubatch.n_tokens == 0) {

0 commit comments

Comments
 (0)