Summary
Symmetric q8_0/q8_0 KV cache fails to create context on Qwen3.5 9B (hybrid DeltaNet + attention architecture, head_dim=256, GQA 4:1). Error: main: error: failed to create context.
Asymmetric configs with q8_0 K work fine (q8_0/turbo3, q8_0/turbo2). Symmetric turbo configs all work (turbo3/turbo3, turbo2/turbo2, turbo4/turbo4, f16/f16). Only q8_0 V appears broken.
Environment
- Model: Qwen3.5-9B Q4_K_M (bartowski GGUF,
attention.key_length=256, attention.value_length=256, full_attention_interval=4, hybrid DeltaNet+attention)
- Hardware: RTX 3080 Ti (SM 8.6, 12GB VRAM)
- Build:
69d8e4be4 (feature/turboquant-kv-cache tip, May 4 2026)
- CMake flags:
-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON
Reproduction
# FAILS — q8_0/q8_0
llama-bench -m qwen3.5-9b-q4km.gguf -ctk q8_0 -ctv q8_0 -p 512 -n 128
# Error: main: error: failed to create context
# WORKS — q8_0 K + turbo V
llama-bench -m qwen3.5-9b-q4km.gguf -ctk q8_0 -ctv turbo3 -p 512 -n 128
# 3759 t/s pp512, 99.6 t/s tg128
# WORKS — turbo3/turbo3
llama-bench -m qwen3.5-9b-q4km.gguf -ctk turbo3 -ctv turbo3 -p 512 -n 128
# 3842 t/s pp512, 98.8 t/s tg128
# WORKS — f16/f16
llama-bench -m qwen3.5-9b-q4km.gguf -ctk f16 -ctv f16 -p 512 -n 128
# 3680 t/s pp512, 101.9 t/s tg128
Analysis
The model is a hybrid architecture (Qwen3.5 with full_attention_interval=4 — only every 4th layer has full attention, rest are DeltaNet SSM). Both K and V have head_dim=256.
Since q8_0/turbo3 works (q8_0 K is fine), the failure is specifically in q8_0 V handling on this architecture. Possibly related to #47 (head_dim=256 asymmetric corruption, now closed) or the hybrid memory path (llama-memory-hybrid.cpp).
Additional Context
All turbo types (turbo2, turbo3, turbo4) work correctly on this model and actually outperform f16 at long context (118% of f16 prefill at 32K) due to the hybrid architecture having only 8/32 attention layers.
Summary
Symmetric
q8_0/q8_0KV cache fails to create context on Qwen3.5 9B (hybrid DeltaNet + attention architecture,head_dim=256, GQA 4:1). Error:main: error: failed to create context.Asymmetric configs with q8_0 K work fine (
q8_0/turbo3,q8_0/turbo2). Symmetric turbo configs all work (turbo3/turbo3,turbo2/turbo2,turbo4/turbo4,f16/f16). Only q8_0 V appears broken.Environment
attention.key_length=256,attention.value_length=256,full_attention_interval=4, hybrid DeltaNet+attention)69d8e4be4(feature/turboquant-kv-cache tip, May 4 2026)-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ONReproduction
Analysis
The model is a hybrid architecture (Qwen3.5 with
full_attention_interval=4— only every 4th layer has full attention, rest are DeltaNet SSM). Both K and V havehead_dim=256.Since
q8_0/turbo3works (q8_0 K is fine), the failure is specifically in q8_0 V handling on this architecture. Possibly related to #47 (head_dim=256 asymmetric corruption, now closed) or the hybrid memory path (llama-memory-hybrid.cpp).Additional Context
All turbo types (turbo2, turbo3, turbo4) work correctly on this model and actually outperform f16 at long context (118% of f16 prefill at 32K) due to the hybrid architecture having only 8/32 attention layers.