Update CHANGELOG.md

Anbeeld · Anbeeld · commit d9edb8d18ccd · 2026-06-13T19:02:06.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,33 +2,17 @@
 
 ## v0.3.2
 
-- Fixed KVarN VRAM overhead on models with non-KVarN KV layers. Selecting a `kvarnN` cache type previously forced `cache_type_k/v` to `f16`, so layers that cannot use structured KVarN records (e.g. iSWA sliding-window layers) stored their full KV cache uncompressed — on Gemma 4 31B at 102k context this cost ~860 MiB extra VRAM versus `q4_0/q4_0`. Non-KVarN layers now fall back to a bit-width-matched standard cache type (`kvarn2`→`q2_0`, `kvarn3`→`q3_0`, `kvarn4`→`q4_0`, `kvarn5`→`q5_0`, `kvarn6`→`q6_0`, `kvarn8`→`q8_0`), bringing `kvarn4/kvarn4` total VRAM slightly below the `q4_0/q4_0` baseline. The same mapping applies in `llama-bench`.
-- Added experimental KVarN KV-cache compression as cache-type pseudo names: `--cache-type-k kvarn2|kvarn3|kvarn4|kvarn5|kvarn6|kvarn8` and `--cache-type-v kvarn2|kvarn3|kvarn4|kvarn5|kvarn6|kvarn8`. KVarN supports all nine 2/3/4/5/6/8 K/V bit-width combinations over 128-slice-compatible KV heads, generalizing from exact 128-dimensional heads to larger heads represented as multiple 128-dimensional pseudo-heads. KVarN is kept to the target context, validates unsupported placements at runtime, and is wired for Qwen3.6 and Gemma4 paths. Internal defaults such as sink tokens and Sinkhorn iterations come from the selected `kvarnN` type; group-attention use with KVarN is rejected before runtime, iSWA keeps the SWA side on the compact upstream cache, and normal-KV fallback is preserved.
-- Implemented non-unified per-stream KVarN storage so `--parallel > 1` can use KVarN without sharing sink/tail groups across slots. KVarN cache state is written per stream, cross-stream `seq_cp` operations are queued and applied after base KV cache synchronization, and unified multi-sequence KVarN contexts are rejected. KVarN sequence shifts and non-CUDA GPU backends fail fast instead of silently producing invalid cache state.
-- Added CPU and CUDA GGML store/materialize ops for KVarN, a target-context KVarN KV cache wrapper, state serialization with group-range compression, cache memory accounting, and server `n_parallel` safety handling. KVarN cache types with >4 bit widths serialize only the used portion of records tensors (computed from `n_groups_used`), dropping a kvarn5/5 checkpoint at 18k tokens from ~1992 MiB to ~658 MiB. The wire format uses `uint64_t` for cross-platform size fields and `ggml_backend_tensor_memset` for zero-fill on restore.
-- Extended KVarN runtime support from exact 128-dimensional KV heads to 128-slice-compatible heads. Larger heads are represented as multiple 128-dimensional pseudo-heads so the existing KVarN CPU/CUDA op ABI stays unchanged. Hybrid attention memory can now be stock KV or KVarN, iSWA base/SWA caches can each be stock KV or KVarN, and Gemma-family reuse mappings are validated before sharing producer records.
-- Added KVarN ROCm/HIP and Vulkan backend support. HIP builds use a low-shared-memory store path for AMD-class shared-memory limits and `reinterpret_cast` for dynamic shared-memory kernel attributes. Vulkan KVarN store/materialize shaders are wired through pipeline support. Backend capability hooks replace the previous CUDA-name gate so KVarN fails closed per backend.
-- KVarN contexts are forced onto non-unified KV streams across CLI normalization, direct context initialization, runtime validation, server auto parallel selection, and DFlash recurrent-backup setup. Target-only KVarN parameters are cleared from DFlash draft contexts so draft setup stays independent.
-- Added a side-effect-free `llama_memory_can_seq_rm` API for constrained removal checks. Composite memories preflight range removal and short-circuit cell removal so base/SWA and recurrent/attention children do not diverge when one child cannot remove a range. KVarN prompt-cache reuse checks rollback safety before attempting suffix trims, and the server skips prompt checkpoints and falls back to full prefill when the trim would be unsafe.
-- `llama-bench` accepts KVarN cache-type names for benchmarking and reports KVarN names instead of the internal storage fallback type.
-- Added KVarN regression tests covering type parsing, tile layout, quantization, cache ops, parser validation, DFlash plumbing, ISWA splitting, backend fail-closed behavior, internal API separation, experimental preset warnings, two-stream round-trips, and architecture-gated backup streams.
-- Fixed DFlash recurrent rollback allocation and sizing for flat DFlash targets. Recurrent snapshots are now scoped to upstream MTP, and flat DFlash uses recurrent-only backup cells instead of allocating dead attention/KV backup streams. This preserves visible context without expanding target `n_ctx` and avoids the Qwen RS snapshot memory multiplier that regressed decode speed and VRAM. DDTree branch rollback stays on explicit attention backup streams, and recurrent-only sequence cleanup covers fallback backup cells.
-- Gated DFlash backup streams by target architecture. The server now probes GGUF architecture before context allocation and reserves recurrent backup streams only for DFlash targets that use recurrent or hybrid state, preventing Gemma and other attention-only targets from doubling effective parallel stream reservation and inflating VRAM on 32k-context configs.
-- Fixed DFlash long-prompt prefill after full prompt reprocessing. DFlash now keeps the drafter on the target model's absolute prompt timeline when flushing only the final cross-attention window, preventing invalid reduced-logit drafts and Qwen3.6 garbage output after long prompts. The profit controller also fills incomplete lower-depth probes before rotating so depths below max collect enough samples for confirmed adaptive switches.
-- Fixed DFlash + KVarN visible context sizing for recurrent Qwen3.6 targets. Bee's internal recurrent rollback sequence ids are now accounted for before target context allocation, so `-c` / `--ctx-size` keeps the configured per-slot capacity even when KVarN forces non-unified KV streams. Unified KV paths and `n_ctx=0` model-default sizing are left unchanged, preserving upstream visible-slot semantics.
-- Fixed the DFlash + KVarN crash class after prompt-cache reuse. Partial composite state now includes memory children that require exact restore state, and server checkpoint selection follows the upstream restore window instead of preflighting candidates against live memory.
-- Added DFlash tensor-split Meta placement support. Auto-detected DFlash draft loading now keeps tensor-split placement when shared target output tensors live in a Meta buffer, instead of reloading the drafter onto one GPU and crashing during draft warmup. The previous `GGML_DFLASH_ALLOW_TENSOR_SPLIT` env-var gate is removed. Explicit incompatible `--spec-draft-device` placement fails closed with a direct diagnostic. Meta backend support includes TurboQuant WHT split-state propagation, public Meta device introspection for capture placement, Meta-backed hidden/prefill capture allocation, Meta-safe GPU ring readback fallback, Meta-safe recurrent backup row copies, and a full-logits drafter fallback when compact argmax/top-k cannot run on Meta output. The compact DFlash verifier is gated by backend.
-- Fixed prompt checkpoint memory for SWA/hybrid prompt-cache paths. Active checkpoints are now controlled by the existing count and spacing knobs. Prompt-cache saves prune over-limit checkpoint state via an explicit `max_checkpoints` cap (at most 3), cache eviction may remove every oversized entry, and explicit slot clears release retained checkpoint buffers. This fixes the Gemma4 DFlash follow-up crash/memory spike where multi-GiB checkpoint lists were copied into prompt cache.
-- Fixed prompt checkpoint threshold for multimodal reuse so newly appended suffix tokens invalidate stale checkpoints instead of reusing them.
-- Fixed MTP prompt-cache reuse to keep a single boundary checkpoint even when regular mid-prompt checkpoints are disabled.
-- Added the `GGML_CUDA_FA_HALF_QUANTS` build mode as a mutually-exclusive alternative to `ALL_QUANTS`. The three FlashAttention vec-pair build modes now compile: `DEFAULT` (27 curated pairs, same as v0.3.1 default), `HALF_QUANTS` (103 pairs: the K≥V half-matrix plus all K/f16 fallback pairs needed by TurboQuant/TCQ dequant paths), and `ALL_QUANTS` (169 pairs: full 13×13 K/V matrix). `HALF_QUANTS` is now the recommended build flag; `ALL_QUANTS` remains the escape hatch. Runtime `ggml_cuda_fattn_pair_compiled` guard rejects uncompiled pairs, and `GGML_CUDA_FA_ROUTE_DEBUG` env-gated diagnostics are available for validation.
-- Fixed CUDA FlashAttention routing for mixed TurboQuant K/V types on Gemma-class head dimensions. Mixed classic K + Turbo V now decodes the Turbo side to f16 and routes through the normal MMA/tile selector with generic f16 K conversion instead of crashing on the vec path. D=512 mixed classic K + Turbo V is routed away from vec; the D≤128 mixed Turbo fallback is preserved. This also removes the old raw mixed-pair force-vec override.
-- Fixed CI ccache hit rate (three root causes: double-prefixed keys, truthy `append-timestamp: false`, and key prefix mismatch between restore and save). Replaced `ggml-org/ccache-action` with a custom `setup-ccache` composite action that installs ccache per platform, restores via `actions/cache/restore@v4`, and persists `CCACHE_DIR`/`CCACHE_MAXSIZE` across steps.
-- Bumped SYCL compute runtime from 25.40 to 26.18 and IGC from v2.20.5 to v2.34.4 to fix a ~21% SYCL performance regression.
-- Added `-j` to the Windows CPU build job and shortened Docker matrix job names.
-- Added `reinterpret_cast` for `hipFuncSetAttribute` in KVarN ROCm builds (HIP requires `const void*` but cannot implicitly convert kernel function pointers unlike native CUDA).
+- Merged latest upstream llama.cpp master. Notable inherited additions include Granite 4 Vision, Gemma 4 MTP (including E2B/E4B assistants), Gemma 4 unified-conversion / audio-projector / no-audio-encoder fixes, multimodal video input (with ffmpeg in the released image) and Qwen-VL frame merge, Mistral-Medium-3.5 conversion, the unified LFM2/LFM2.5 tool parser and `<think>` reasoning round-trip fix, speculative vocab-compatibility checks, a "placeholder bitmap" token-counting / `input_tokens` API, optional server prompt logging, kv-cache cell-sharing and copy-avoidance fixes, `GGML_OP_COL2IM_1D`, ggml bumped to 0.14.0, the CUDA 13.3 base image, HIP gfx1152/gfx1153 (RDNA3.5), and a broad set of WebGPU/Vulkan/SYCL/Metal/CPU backend and Web UI improvements.
+- Added experimental KVarN KV-cache compression, selected through `--cache-type-k`/`--cache-type-v` pseudo names `kvarn2|kvarn3|kvarn4|kvarn5|kvarn6|kvarn8` covering all nine 2/3/4/5/6/8 K/V bit-width combinations over 128-slice-compatible KV heads (larger heads are represented as multiple 128-dimensional pseudo-heads so the CPU/CUDA op ABI stays unchanged). KVarN is wired for Qwen3.6 and Gemma 4 through the normal memory architecture: hybrid and iSWA caches can each be stock KV or KVarN, Gemma-family reuse mappings are validated before sharing producer records, and the iSWA SWA side stays on the compact upstream cache. It is target-context only (DFlash draft and auxiliary contexts stay non-KVarN), validates unsupported placements at runtime, rejects group attention, and preserves normal-KV fallback. KVarN works with both unified KV and non-unified per-stream storage, so `--parallel > 1` slots do not share sink/tail groups; cross-stream `seq_cp` is queued and applied after base KV synchronization, and prompt-cache restore is gated on actual memory-stream safety. Non-KVarN layers (e.g. iSWA sliding-window layers) fall back to a bit-width-matched standard cache type (`kvarn2`→`q2_0`, `kvarn3`→`q3_0`, `kvarn4`→`q4_0`, `kvarn5`→`q5_0`, `kvarn6`→`q6_0`, `kvarn8`→`q8_0`) instead of full `f16`, keeping `kvarn4/kvarn4` total VRAM at or just below the `q4_0/q4_0` baseline (the prior `f16` fallback cost ~860 MiB extra on Gemma 4 31B at 102k context). Implementation includes CPU and CUDA store/materialize ops with precomputed live-group counts, ROCm/HIP (low-shared-memory store path) and Vulkan backend support behind per-backend capability hooks, group-range state serialization that shrinks large-bit-width checkpoints (a kvarn5/5 18k checkpoint drops from ~1992 MiB to ~658 MiB), a side-effect-free `llama_memory_can_seq_rm` API used to keep prompt-cache rollback safe, `llama-bench` support that reports KVarN names, and regression coverage. Internal sink/Sinkhorn defaults come from the selected `kvarnN` type and are not exposed as CLI knobs.
+- Added new KV/cache types: `turbo4_tcq` (TCQ) plus plain quantized `q2_0`, `q2_1`, `q3_0`, `q3_1`, and `q6_1`, each selectable through `--cache-type-k`/`--cache-type-v`, with CPU and CUDA quantize/dequantize, MMQ/vec-dot, and FlashAttention vec coverage. The new low-bit `q2_0`/`q3_0` types also back KVarN's non-KVarN-layer fallback mapping.
+- Reworked the CUDA FlashAttention vec build modes. Added `GGML_CUDA_FA_HALF_QUANTS` as a mutually-exclusive alternative to `GGML_CUDA_FA_ALL_QUANTS`, and changed the no-flag default from the old Turbo-focused set to a 62-pair K≥V set focused on the recommended q cache types and KVarN fallback (fp cache capped at q5, q8/q6 capped at q4) — much faster to compile while covering the common configurations. The three modes now compile 62 (`DEFAULT`) / 208 (`HALF_QUANTS`: the K≥V half-matrix plus the K/f16 fallback pairs TurboQuant/TCQ dequant needs) / 361 (`ALL_QUANTS`: the full 19×19 K/V matrix) vec pairs. A runtime `ggml_cuda_fattn_pair_compiled` guard rejects uncompiled pairs, `GGML_CUDA_FA_ROUTE_DEBUG` adds env-gated routing diagnostics, and mixed classic-K + Turbo-V routing for Gemma-class D256/D512 now decodes the Turbo side to `f16` and goes through the normal MMA/tile selector instead of crashing on the vec path (the D≤128 mixed-Turbo vec fallback is preserved).
+- Added DFlash tensor-split Meta placement support. Auto-detected DFlash draft loading keeps tensor-split placement when shared target output tensors live in a Meta buffer, instead of reloading the drafter onto one GPU and crashing during draft warmup; the previous `GGML_DFLASH_ALLOW_TENSOR_SPLIT` env-var gate is removed, and explicit incompatible `--spec-draft-device` placement fails closed with a direct diagnostic. Meta backend support spans TurboQuant WHT split-state propagation, public Meta device introspection for capture placement, Meta-backed hidden/prefill capture allocation, Meta-safe GPU ring readback and recurrent backup row copies, a full-logits drafter fallback when compact argmax/top-k cannot run on Meta output, and a backend-gated compact verifier.
+- Reduced DFlash recurrent backup-stream VRAM. Flat DFlash on recurrent targets now uses recurrent-only backup cells scoped to upstream MTP instead of allocating dead attention/KV backup streams or expanding target `n_ctx`, avoiding the Qwen recurrent-state snapshot memory multiplier that regressed decode speed and VRAM. The server probes GGUF architecture before context allocation and reserves recurrent backup streams only for DFlash targets that actually use recurrent or hybrid state, so attention-only targets such as Gemma no longer double their parallel stream reservation and inflate 32k-context VRAM. DDTree branch rollback stays on explicit attention backup streams.
+- Fixed DFlash long-prompt prefill after full prompt reprocessing. DFlash now keeps the drafter on the target model's absolute prompt timeline when flushing only the final cross-attention window, fixing the drafter RoPE/mask misalignment that produced invalid reduced-logit drafts and Qwen3.6 garbage output after long prompts.
+- Capped the active context checkpoints cloned into a prompt-cache entry at the 3 newest fitting ones, so multi-GiB checkpoint lists are no longer copied into cached prompts (fixing a Gemma 4 DFlash follow-up crash/memory spike).
+- Bumped the SYCL compute runtime from 25.40 to 26.18 and IGC from v2.20.5 to v2.34.4 in the Intel Docker images, fixing a ~21% SYCL performance regression.
 - Fixed HIP shuffle compatibility for ROCm builds.
-- Updated preview/release workflow maintenance for v0.3.2, including preview branch routing, asset download filtering for prebuilt packages and CUDA runtimes, release artifact recovery, preview retag cleanup, and avoiding cached release toolchains.
+- Reworked release CI: replaced `ggml-org/ccache-action` with custom `setup-ccache`/`save-rolling-ccache` composite actions to restore a working release-build ccache hit rate, route `v*` branch pushes as rolling preview builds while keeping stable `v*` tag releases main-owned, filter release-asset downloads to prebuilt packages and CUDA runtimes, and tidy Docker links and preview release notes.
 
 ## v0.3.1