Update CHANGELOG.md

Anbeeld · Anbeeld · commit 1a115e218bc4 · 2026-06-15T15:19:47.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,9 +5,11 @@
 - Merged latest upstream llama.cpp master. Notable inherited additions include Granite 4 Vision, Gemma 4 MTP (including E2B/E4B assistants), Gemma 4 unified-conversion / audio-projector / no-audio-encoder fixes, multimodal video input (with ffmpeg in the released image) and Qwen-VL frame merge, Mistral-Medium-3.5 conversion, the unified LFM2/LFM2.5 tool parser and `<think>` reasoning round-trip fix, speculative vocab-compatibility checks, a "placeholder bitmap" token-counting / `input_tokens` API, optional server prompt logging, kv-cache cell-sharing and copy-avoidance fixes, `GGML_OP_COL2IM_1D`, ggml bumped to 0.14.0, the CUDA 13.3 base image, HIP gfx1152/gfx1153 (RDNA3.5), and a broad set of WebGPU/Vulkan/SYCL/Metal/CPU backend and Web UI improvements.
 - Added experimental KVarN KV-cache compression, selected through `--cache-type-k`/`--cache-type-v` pseudo names `kvarn2|kvarn3|kvarn4|kvarn5|kvarn6|kvarn8` covering all nine 2/3/4/5/6/8 K/V bit-width combinations over 128-slice-compatible KV heads (larger heads are represented as multiple 128-dimensional pseudo-heads so the CPU/CUDA op ABI stays unchanged). KVarN is wired for Qwen3.6 and Gemma 4 through the normal memory architecture: hybrid and iSWA caches can each be stock KV or KVarN, Gemma-family reuse mappings are validated before sharing producer records, and the iSWA SWA side stays on the compact upstream cache. It is target-context only (DFlash draft and auxiliary contexts stay non-KVarN), validates unsupported placements at runtime, rejects group attention, and preserves normal-KV fallback. KVarN works with both unified KV and non-unified per-stream storage, so `--parallel > 1` slots do not share sink/tail groups; cross-stream `seq_cp` is queued and applied after base KV synchronization, and prompt-cache restore is gated on actual memory-stream safety. Non-KVarN layers (e.g. iSWA sliding-window layers) fall back to a bit-width-matched standard cache type (`kvarn2`→`q2_0`, `kvarn3`→`q3_0`, `kvarn4`→`q4_0`, `kvarn5`→`q5_0`, `kvarn6`→`q6_0`, `kvarn8`→`q8_0`) instead of full `f16`, keeping `kvarn4/kvarn4` total VRAM at or just below the `q4_0/q4_0` baseline (the prior `f16` fallback cost ~860 MiB extra on Gemma 4 31B at 102k context). Implementation includes CPU and CUDA store/materialize ops with precomputed live-group counts, ROCm/HIP (low-shared-memory store path) and Vulkan backend support behind per-backend capability hooks, group-range state serialization that shrinks large-bit-width checkpoints (a kvarn5/5 18k checkpoint drops from ~1992 MiB to ~658 MiB), a side-effect-free `llama_memory_can_seq_rm` API used to keep prompt-cache rollback safe, `llama-bench` support that reports KVarN names, and regression coverage. Internal sink/Sinkhorn defaults come from the selected `kvarnN` type and are not exposed as CLI knobs.
 - Added new KV/cache types: `turbo4_tcq` (TCQ) plus plain quantized `q2_0`, `q2_1`, `q3_0`, `q3_1`, and `q6_1`, each selectable through `--cache-type-k`/`--cache-type-v`, with CPU and CUDA quantize/dequantize, MMQ/vec-dot, and FlashAttention vec coverage. The new low-bit `q2_0`/`q3_0` types also back KVarN's non-KVarN-layer fallback mapping.
-- Reworked the CUDA FlashAttention vec build modes. Added `GGML_CUDA_FA_HALF_QUANTS` as a mutually-exclusive alternative to `GGML_CUDA_FA_ALL_QUANTS`, and changed the no-flag default from the old Turbo-focused set to a 62-pair K≥V set focused on the recommended q cache types and KVarN fallback (fp cache capped at q5, q8/q6 capped at q4) — much faster to compile while covering the common configurations. The three modes now compile 62 (`DEFAULT`) / 208 (`HALF_QUANTS`: the K≥V half-matrix plus the K/f16 fallback pairs TurboQuant/TCQ dequant needs) / 361 (`ALL_QUANTS`: the full 19×19 K/V matrix) vec pairs. A runtime `ggml_cuda_fattn_pair_compiled` guard rejects uncompiled pairs, `GGML_CUDA_FA_ROUTE_DEBUG` adds env-gated routing diagnostics, and mixed classic-K + Turbo-V routing for Gemma-class D256/D512 now decodes the Turbo side to `f16` and goes through the normal MMA/tile selector instead of crashing on the vec path (the D≤128 mixed-Turbo vec fallback is preserved).
+- Reworked the CUDA FlashAttention vec build modes. Added `GGML_CUDA_FA_HALF_QUANTS` as a mutually-exclusive alternative to `GGML_CUDA_FA_ALL_QUANTS`, and changed the no-flag default from the old Turbo-focused set to a 62-pair K≥V set focused on the recommended q cache types and KVarN fallback (fp cache capped at q5, q8/q6 capped at q4) — much faster to compile while covering the common configurations. The three modes now compile 62 (`DEFAULT`) / 217 (`HALF_QUANTS`: the K≥V half-matrix plus the K/f16 fallback pairs TurboQuant/TCQ dequant needs) / 361 (`ALL_QUANTS`: the full 19×19 K/V matrix) vec pairs. A runtime `ggml_cuda_fattn_pair_compiled` guard catches uncompiled pairs — TurboQuant/TCQ decode is routed via its effective compiled types instead of aborting, and `GGML_CUDA_FA_IGNORE_UNCOMPILED_PAIRS=1` warns and continues for anything still missing — while `GGML_CUDA_FA_ROUTE_DEBUG` adds env-gated routing diagnostics, and mixed classic-K + Turbo-V routing for Gemma-class D256/D512 now decodes the Turbo side to `f16` and goes through the normal MMA/tile selector instead of crashing on the vec path (the D≤128 mixed-Turbo vec fallback is preserved).
+- Restored standard q/KVarN-cache CUDA FlashAttention vec decode speed by removing a Bee-specific sparse-V dequant skip (now kept only for TurboQuant/TCQ) and non-upstream launch bounds that had diverged from upstream and slowed q-cache decode.
 - Added DFlash tensor-split Meta placement support. Auto-detected DFlash draft loading keeps tensor-split placement when shared target output tensors live in a Meta buffer, instead of reloading the drafter onto one GPU and crashing during draft warmup; the previous `GGML_DFLASH_ALLOW_TENSOR_SPLIT` env-var gate is removed, and explicit incompatible `--spec-draft-device` placement fails closed with a direct diagnostic. Meta backend support spans TurboQuant WHT split-state propagation, public Meta device introspection for capture placement, Meta-backed hidden/prefill capture allocation, Meta-safe GPU ring readback and recurrent backup row copies, a full-logits drafter fallback when compact argmax/top-k cannot run on Meta output, and a backend-gated compact verifier.
 - Reduced DFlash recurrent backup-stream VRAM. Flat DFlash on recurrent targets now uses recurrent-only backup cells scoped to upstream MTP instead of allocating dead attention/KV backup streams or expanding target `n_ctx`, avoiding the Qwen recurrent-state snapshot memory multiplier that regressed decode speed and VRAM. The server probes GGUF architecture before context allocation and reserves recurrent backup streams only for DFlash targets that actually use recurrent or hybrid state, so attention-only targets such as Gemma no longer double their parallel stream reservation and inflate 32k-context VRAM. DDTree branch rollback stays on explicit attention backup streams.
+- Fixed a crash when DFlash is combined with another speculative type (e.g. `--spec-type dflash,ngram-mod`). DFlash is now detected order-insensitively across the `--spec-type` list, so recurrent backup-sequence allocation, draft-device selection, and rollback planning no longer depend on which type is listed last.
 - Fixed DFlash long-prompt prefill after full prompt reprocessing. DFlash now keeps the drafter on the target model's absolute prompt timeline when flushing only the final cross-attention window, fixing the drafter RoPE/mask misalignment that produced invalid reduced-logit drafts and Qwen3.6 garbage output after long prompts.
 - Capped the active context checkpoints cloned into a prompt-cache entry at the 3 newest fitting ones, so multi-GiB checkpoint lists are no longer copied into cached prompts (fixing a Gemma 4 DFlash follow-up crash/memory spike).
 - Bumped the SYCL compute runtime from 25.40 to 26.18 and IGC from v2.20.5 to v2.34.4 in the Intel Docker images, fixing a ~21% SYCL performance regression.