Skip to content

CUDA: q8_0/q8_0 KV cache fails to create context on Qwen3.5 9B hybrid (head_dim=256) #130

@seanrasch

Description

@seanrasch

Summary

Symmetric q8_0/q8_0 KV cache fails to create context on Qwen3.5 9B (hybrid DeltaNet + attention architecture, head_dim=256, GQA 4:1). Error: main: error: failed to create context.

Asymmetric configs with q8_0 K work fine (q8_0/turbo3, q8_0/turbo2). Symmetric turbo configs all work (turbo3/turbo3, turbo2/turbo2, turbo4/turbo4, f16/f16). Only q8_0 V appears broken.

Environment

  • Model: Qwen3.5-9B Q4_K_M (bartowski GGUF, attention.key_length=256, attention.value_length=256, full_attention_interval=4, hybrid DeltaNet+attention)
  • Hardware: RTX 3080 Ti (SM 8.6, 12GB VRAM)
  • Build: 69d8e4be4 (feature/turboquant-kv-cache tip, May 4 2026)
  • CMake flags: -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON

Reproduction

# FAILS — q8_0/q8_0
llama-bench -m qwen3.5-9b-q4km.gguf -ctk q8_0 -ctv q8_0 -p 512 -n 128
# Error: main: error: failed to create context

# WORKS — q8_0 K + turbo V
llama-bench -m qwen3.5-9b-q4km.gguf -ctk q8_0 -ctv turbo3 -p 512 -n 128
# 3759 t/s pp512, 99.6 t/s tg128

# WORKS — turbo3/turbo3
llama-bench -m qwen3.5-9b-q4km.gguf -ctk turbo3 -ctv turbo3 -p 512 -n 128
# 3842 t/s pp512, 98.8 t/s tg128

# WORKS — f16/f16
llama-bench -m qwen3.5-9b-q4km.gguf -ctk f16 -ctv f16 -p 512 -n 128
# 3680 t/s pp512, 101.9 t/s tg128

Analysis

The model is a hybrid architecture (Qwen3.5 with full_attention_interval=4 — only every 4th layer has full attention, rest are DeltaNet SSM). Both K and V have head_dim=256.

Since q8_0/turbo3 works (q8_0 K is fine), the failure is specifically in q8_0 V handling on this architecture. Possibly related to #47 (head_dim=256 asymmetric corruption, now closed) or the hybrid memory path (llama-memory-hybrid.cpp).

Additional Context

All turbo types (turbo2, turbo3, turbo4) work correctly on this model and actually outperform f16 at long context (118% of f16 prefill at 32K) due to the hybrid architecture having only 8/32 attention layers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions