CUDA: q8_0/q8_0 KV cache fails to create context on Qwen3.5 9B hybrid (head_dim=256)

## Summary

Symmetric `q8_0/q8_0` KV cache fails to create context on Qwen3.5 9B (hybrid DeltaNet + attention architecture, `head_dim=256`, GQA 4:1). Error: `main: error: failed to create context`. 

Asymmetric configs with q8_0 K work fine (`q8_0/turbo3`, `q8_0/turbo2`). Symmetric turbo configs all work (`turbo3/turbo3`, `turbo2/turbo2`, `turbo4/turbo4`, `f16/f16`). Only q8_0 V appears broken.

## Environment

- **Model:** Qwen3.5-9B Q4_K_M (bartowski GGUF, `attention.key_length=256`, `attention.value_length=256`, `full_attention_interval=4`, hybrid DeltaNet+attention)
- **Hardware:** RTX 3080 Ti (SM 8.6, 12GB VRAM)
- **Build:** `69d8e4be4` (feature/turboquant-kv-cache tip, May 4 2026)
- **CMake flags:** `-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON`

## Reproduction

```bash
# FAILS — q8_0/q8_0
llama-bench -m qwen3.5-9b-q4km.gguf -ctk q8_0 -ctv q8_0 -p 512 -n 128
# Error: main: error: failed to create context

# WORKS — q8_0 K + turbo V
llama-bench -m qwen3.5-9b-q4km.gguf -ctk q8_0 -ctv turbo3 -p 512 -n 128
# 3759 t/s pp512, 99.6 t/s tg128

# WORKS — turbo3/turbo3
llama-bench -m qwen3.5-9b-q4km.gguf -ctk turbo3 -ctv turbo3 -p 512 -n 128
# 3842 t/s pp512, 98.8 t/s tg128

# WORKS — f16/f16
llama-bench -m qwen3.5-9b-q4km.gguf -ctk f16 -ctv f16 -p 512 -n 128
# 3680 t/s pp512, 101.9 t/s tg128
```

## Analysis

The model is a hybrid architecture (Qwen3.5 with `full_attention_interval=4` — only every 4th layer has full attention, rest are DeltaNet SSM). Both K and V have `head_dim=256`.

Since `q8_0/turbo3` works (q8_0 K is fine), the failure is specifically in q8_0 V handling on this architecture. Possibly related to #47 (head_dim=256 asymmetric corruption, now closed) or the hybrid memory path (`llama-memory-hybrid.cpp`).

## Additional Context

All turbo types (turbo2, turbo3, turbo4) work correctly on this model and actually **outperform f16 at long context** (118% of f16 prefill at 32K) due to the hybrid architecture having only 8/32 attention layers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: q8_0/q8_0 KV cache fails to create context on Qwen3.5 9B hybrid (head_dim=256) #130

Summary

Environment

Reproduction

Analysis

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

CUDA: q8_0/q8_0 KV cache fails to create context on Qwen3.5 9B hybrid (head_dim=256) #130

Description

Summary

Environment

Reproduction

Analysis

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions