HIP/turbo3: graph-safe decode + inline-dequant TILE prefill on gfx1201 (RDNA4) by KaiFelixBennett · Pull Request #28 · AtomicBot-ai/atomic-llama-cpp-turboquant

KaiFelixBennett · 2026-06-14T11:13:42Z

What

Makes TurboQuant (turbo3) KV cache fully usable on ROCm / HIP graphs on gfx1201 (RDNA4) — both prefill and decode — and brings turbo3 prefill up to roughly f16 speed.

Measured on a Radeon AI PRO R9700 (gfx1201, 32 GB), Windows 11, HIP SDK 7.1, Gemma-4 Q4_K_M, HIP graphs ON. Nothing extrapolated.

Two coupled changes

1. Graph-safe decode (fixes the crash)

With HIP graphs on, turbo KV crashed on the first decode step:

FLASH_ATTN_EXT failed: operation not permitted when stream is capturing

Cause: launch_fattn's f16 dequant temp buffers (K_f16/V_f16) used raw cudaMalloc/cudaFree during graph capture, which is illegal. Fix: capture-aware allocation — pool alloc while capturing / for small batches, raw alloc+free for large eager prefill so VRAM stays bounded on the no-VMM card — and decode (Q->ne[1] <= 2) routes to the graph-safe VEC kernel (inline dequant, no temp buffer).

This is the same class of decode crash we fixed canonically upstream in TheTom#176 (merged, 7985f6b). This PR adapts that fix to this fork's newer base.

2. Inline-dequant TILE prefill (makes prefill fast)

The TILE/MMA path hardcoded need_f16_K/V = true and materialized the whole KV cache to an f16 temp buffer every step — a per-step O(KV) dequant tax that negates the 3-bit cache, so turbo3 prefill was stuck on the slow sequential VEC kernel. A new TILE path inline-dequantizes turbo3 K/V during the global→shared tile load (no f16 materialization); turbo3 head_dim=256 multi-row batches (Q->ne[1] >= 3, prefill + spec-verify) route to it.

prefill	turbo3 VEC (stock)	turbo3 TILE (this)	speedup	f16 ref
pp512	1570 t/s	2187	1.39×	2174
pp2048	1038 t/s	1929	1.86×	2049
pp4096	1039 t/s	1752	1.69×	1984

turbo3 prefill is now within ~6–12% of f16, and the gain grows with context length — the long-context regime where the 3-bit cache earns its keep.

Correctness

test-backend-ops -o FLASH_ATTN_EXT, turbo3 hsk=256, nb ∈ {3,4,6 (verify), 64,128,256 (prefill)}, kv ∈ {512,1024}: 12/12 OK (NMSE within tolerance vs the CPU reference).

Scope / honesty

Validated for turbo3/turbo3 at head_dim=256 (Gemma-4 family). Other dims/types keep the f16 path.
FlashAttention path only. It does not make self-speculative MTP beat baseline decode on RDNA4 (that wall is GEMM weight-load amortization, not attention).
Full methodology, raw data, correctness logs and a one-command gfx1201 build: https://github.com/KaiFelixBennett/gemma4-turboquant-rdna4

…1 (RDNA4) Make TurboQuant (turbo3) KV cache usable on ROCm with HIP graphs on gfx1201, both prefill and decode. 1) Graph-safe decode. launch_fattn's f16 dequant temp buffers (K_f16/V_f16) used raw cudaMalloc/cudaFree during graph capture, which is illegal and crashed decode on the first step ("operation not permitted when stream is capturing"). Allocation is now capture-aware: pool alloc while capturing / for small batches, raw alloc+free for large eager prefill (keeps VRAM bounded on the no-VMM card). Decode (Q->ne[1] <= 2) routes to the graph-safe VEC kernel (inline dequant, no temp buffer). Same class of decode crash fixed canonically upstream in TheTom#176 (merged, 7985f6b); this adapts it to the newer base. 2) Inline-dequant TILE prefill. The TILE/MMA path hardcoded need_f16_K/V=true and materialized the whole KV cache to an f16 temp buffer every step. A new TILE path inline-dequantizes turbo3 K/V during the global->shared tile load (no f16 materialization); turbo3 head_dim=256 multi-row batches (Q->ne[1] >= 3, prefill + spec-verify) route to it. Prefill 1.39x/1.86x/1.69x faster (pp512/2048/4096), within ~6-12% of f16; the gain grows with context length. Correctness: test-backend-ops -o FLASH_ATTN_EXT, turbo3 hsk=256, nb in {3,4,6,64,128,256}, kv in {512,1024}: 12/12 OK (NMSE within tol vs CPU ref). Scope: validated for turbo3/turbo3 head_dim=256 (Gemma-4 family); other dims/types keep the f16 path.

…icBot-ai#28 Key findings from Dejan.ai, unixsysdev, and mudler: 1. QJL naively added back destroys quality (cosine 0.69) 2. Pre-rotate queries eliminates rotation from dequant path 3. WHT abandoned by everyone — dense QR or no rotation preferred 4. unixsysdev gets -0.8% speed loss with fused CUDA kernel 5. We're the only Metal implementation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… (issue AtomicBot-ai#28) The block-size divisibility check in llama-context.cpp rejected turbo4 on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128≠0) before the KV cache zero-padding code could run. Fix: for turbo types, compute the padded head_dim (ceil to 128) before the divisibility check, matching what llama-kv-cache.cpp actually does. Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…-ai#28 follow-up) state_write_data and state_read_data used hparams.n_embd_k_gqa (576) for ggml_row_size, but turbo types zero-pad to 640. For turbo4 (QK=128), 576 % 128 != 0 → ggml_row_size assertion failure during prompt cache save on llama-server slot reuse. Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of hparams values in all four serialization paths (K write, K read, V write, V read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Systematische Prüfung aller ROADMAP-Items gegen Fork-Code: - AtomicBot-ai#11 EAGLE-3 (PR ggml-org#18039): bereits integriert (Commit 57774253c) - AtomicBot-ai#12 Coopmat2 (PR ggml-org#19075): bereits integriert (flash_attn_cm2.comp, SPV generiert) - AtomicBot-ai#20 Tensor Parallelism (PR ggml-org#19378): bereits integriert (Commit d850df3) - AtomicBot-ai#28 Adaptive MTP (PR ggml-org#22931): Fork hat eigene Implementierung (LLAMA_MTP_SKIP_STREAK_THRESHOLD), PR closed/inkompatibel Meilenstein-Status: - M1: ✅ abgeschlossen - M2: ✅ evaluiert (AtomicBot-ai#6✅, AtomicBot-ai#7✅, AtomicBot-ai#12✅ bereits integriert, AtomicBot-ai#9❌, AtomicBot-ai#10❌) - M3: ⏳ teilweise (AtomicBot-ai#3✅, AtomicBot-ai#13❌, AtomicBot-ai#14⏭️, AtomicBot-ai#15 offen) - M4: ✅ abgeschlossen (AtomicBot-ai#11✅, AtomicBot-ai#28✅ eigene Implementierung) - M5: ⏳ teilweise (AtomicBot-ai#12✅, AtomicBot-ai#20✅ bereits integriert, AtomicBot-ai#21 offen) - M6: ☐ offen (Tier 4 Forschung)

AGENTS.md: - Styx Kontext 224k→196k in Tabelle + Überschrift (seit 2026-07-19) - Services-Tabelle: Venus + Uranus ergänzt, veraltete Skript-Namen ersetzt - EA Phase 3 (llama-expected-attention.cpp) in Schlüsseldateien aufgenommen - Qwen NextN-Referenz korrigiert (qwen35.cpp ist Dense/MoE, nicht NextN) - GPU-Datum 8.7→19.7, 188k-Klippe-Hinweis 224k→256k common/AGENTS.md: - NextN-Referenz qwen35-nextn.cpp→qwen3next.cpp (Dateien existierten nicht) ROADMAP.md: - M3-Status: Item AtomicBot-ai#6 fälschlich aufgeführt (gehört zu M2) - 5 tote Plan-Links entfernt (Features direkt integriert, keine Pläne erstellt) - AtomicBot-ai#28 Adaptive MTP: ✅→⏭️ (Doku vorhanden, Code-Implementierung fehlt) - M4-Status entsprechend aktualisiert - TheTom#88 MTP+TP-Test: ☐→⏭️ (Uranus durch Training blockiert)

…st behoben LLAMA_MTP_SKIP_STREAK_THRESHOLD in common/speculative.cpp re-applien. Der skip-streak Mechanismus war in Commit 88bd4f0 (2026-06-23) voll implementiert, ging bei AtomicBot-Sync-Squash (394963e) verloren und wurde im MTP 0% Fix (4cff93d) nicht re-applien. Doku beschrieb das Feature weiterhin, ROADMAP AtomicBot-ai#28 war als ✅ markiert, aber Code fehlte. Re-Applien auf neuen common_speculative_impl_draft_mtp Struct (multi-seq refactor): Member-Variablen, getenv im Konstruktor, mtp_would_skip_next_draft() Helper, skip-check + streak-update in draft(), reset in begin(). Build verifiziert (llama-common + llama-server). ROADMAP AtomicBot-ai#28 → ✅, M4 → ✅.

github-actions Bot added testing ggml CUDA labels Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIP/turbo3: graph-safe decode + inline-dequant TILE prefill on gfx1201 (RDNA4) - #28

HIP/turbo3: graph-safe decode + inline-dequant TILE prefill on gfx1201 (RDNA4)#28
KaiFelixBennett wants to merge 1 commit into
AtomicBot-ai:feature/turboquant-kv-cachefrom
KaiFelixBennett:feat/turbo3-rocm-graphsafe-and-inline-tile-prefill

KaiFelixBennett commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

KaiFelixBennett commented Jun 14, 2026

What

Two coupled changes

1. Graph-safe decode (fixes the crash)

2. Inline-dequant TILE prefill (makes prefill fast)

Correctness

Scope / honesty

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants