Anbeeld
diff --git a/‎CHANGELOG.md‎
Lines changed: 14 additions & 18 deletions b/‎CHANGELOG.md‎
Lines changed: 14 additions & 18 deletions
@@ -2,24 +2,20 @@
 
 ## v0.2.0
 
-- Added compatibility with upstream DFlash PR drafter GGUFs that use `general.architecture = dflash`. Bee now keeps this as a separate schema from `dflash-draft`, reads upstream metadata keys such as `dflash.block_size` and `dflash.target_layer_ids`, uses upstream tensor names such as `fc.weight`, `hidden_norm.weight`, and `blk.N.ffn_norm.weight`, and keeps existing Bee/buun `dflash-draft` tensor and metadata names unchanged.
-- Hardened recurrent memory resizing and prompt-cache restore paths. Recurrent resize now repairs tails, sequence IDs, source links, `used`, `head`, `n`, and recurrent-state size metadata after shrink/expand, while the server shrinks recurrent state before prompt-cache save/load when it is safe to do so.
-- Fixed DFlash backup-sequence cleanup and recurrent copy behavior. Server slots now track the backup sequence ID they created, clear leaked backup cells on release, and preserve the local DFlash recurrent copy-plan invalidation path after recurrent memory changes.
-- Added unified-KV admission deferral for non-parent server tasks so large pending prompts do not over-commit shared KV cells while active slots still own prompt or task tokens.
-- Added categorized DFlash profiling and diagnostics. `GGML_DFLASH_PROFILE` can enable summary, replay, copy, prefill, verify, and trace logging, and DFlash logs now report draft/verify/accept timing, graph reuse, reduced-verifier decisions, stream-copy behavior, and contract details.
-- Improved DFlash CUDA stream ordering. GPU hidden capture, recurrent replay, K/V projection-cache updates, backup copies, graph-copied hidden tensors, and DFlash stream waits now use explicit backend/DFlash stream ordering helpers instead of broad synchronization where possible.
-- Added DFlash drafter K/V projection caching for the cross-attention window, including ring-only persistent K/V storage, chronological D2D append/interleave helpers, K-only/V-only diagnostic isolation, a kill switch, CUDA graph-capture exclusions, and multi-GPU draft-placement fallback.
-- Reduced DFlash verification overhead in greedy paths. The verifier can consume reduced top-k logits for eligible DFlash verify batches, skips raw-logit readback when the reduced path is active, preserves seed-token row alignment, and falls back when grammar, sampling, or reasoning state requires full logits.
-- Fixed reasoning-end forcing with reduced DFlash verification. When an EOG token appears during active reasoning, the sampler now forces the reasoning-end token through the normal forcing path instead of accepting an unsafe reduced candidate set.
-- Reworked DFlash prefill capture and flush handling. Prefill capture now tracks per-slot plans, GPU staging buffers, source-aware CPU/GPU ring validity, suffix spans across internal ubatches, graph-reuse keys for source/destination offsets, and fail-closed behavior for partial or mismatched captures.
-- Hardened target hidden-state capture across Qwen3.5, Qwen3.5-MoE, Gemma4-ISWA, GPU tape, hidden-only, and prefill-only contexts. Capture layer assignment, callback suppression, token-count derivation, per-view plans, multi-slot GPU cross data, and shape validation now have explicit checks and regression coverage.
-- Added DFlash input and contract validation. The drafter now rejects cross feature-size mismatches, validates mask tokens and target layer metadata, logs target/drafter vocab and context contracts, and provides debug toggles such as `GGML_DFLASH_DEBUG`, `GGML_DFLASH_INPUT_DEBUG`, `GGML_DFLASH_CUDA_DEBUG`, `GGML_DFLASH_FORCE_CPU_CROSS`, and `GGML_DFLASH_VERBOSE_CONTRACT`.
-- Extended CUDA FlashAttention template coverage for 512-wide quantized K/V combinations, including TurboQuant and TCQ cache types, and added generator/test coverage so the 512-dim instances are not dropped.
-- Fixed long-context DFlash and CUDA stability issues, including GPU ring crashes in DFlash K/V updates at long-context prefill, TurboQuant GPU-ring hangs, CUDA driver link propagation, op-table drift, Gated DeltaNet kernel hardening, and stream-safe DFlash replay.
-- Reduced peak memory in the perplexity tool and fixed streaming perplexity/KLD logits memory. Streaming perplexity now writes bounded chunks, checks stream errors, and avoids retaining unbounded logits for long-context KL runs.
-- Improved DFlash draft model discovery and download plumbing so sibling DFlash draft GGUFs can be found more reliably from related model repositories.
-- Improved DFlash converter support and diagnostics. DFlash conversion now handles `dflash_config` metadata nesting, logs metadata warnings and summaries, scopes Gemma4 tokenizer handling, and validates DFlash-specific metadata more clearly.
-- Expanded regression coverage for DFlash server invariants, recurrent prompt-cache shrink/expand, reduced verifier behavior, CUDA stream ordering, prefill staging, ring validity, contract validation, converter metadata, adaptive draft-max behavior, sampling/reasoning-end forcing, and streaming perplexity plumbing.
+- Added compatibility with upstream DFlash PR drafter GGUFs that use `general.architecture = dflash`. Bee now keeps this separate from the older `dflash-draft` schema, understands upstream metadata keys such as `dflash.block_size` and `dflash.target_layer_ids`, reads upstream tensor names, and keeps existing Bee/buun draft GGUF naming intact.
+- Tightened DFlash draft model discovery and converter behavior. Bee now prefers exact sibling DFlash draft directories, supports nested `dflash_config` metadata, scopes Gemma4 tokenizer handling correctly, and logs clearer DFlash metadata warnings and summaries during conversion.
+- Hardened recurrent memory, prompt-cache restore, and unified-KV scheduling. Recurrent resize now repairs its metadata after shrink/expand, the server shrinks recurrent state before prompt-cache save/load when it is safe, backup-sequence cleanup is tracked correctly, and non-parent tasks defer unified-KV admission so large pending prompts do not over-commit shared cells.
+- Added richer DFlash diagnostics, profiling, and validation. `GGML_DFLASH_PROFILE` now exposes categorized summary/replay/copy/prefill/verify/trace logging, routine decode timing is hidden behind debug logging instead of always printing, the profit controller now logs when it disables speculative depth, drafter/target contract and input validation are stricter, and Bee also exposes targeted debug envs such as `GGML_DFLASH_DEBUG`, `GGML_DFLASH_INPUT_DEBUG`, `GGML_DFLASH_CUDA_DEBUG`, `GGML_DFLASH_FORCE_CPU_CROSS`, `GGML_DFLASH_VERBOSE_CONTRACT`, and `GGML_DFLASH_CRASH_TRACE`.
+- Improved DFlash CUDA ordering and split-buffer correctness. Hidden capture, recurrent replay, backup copies, K/V projection-cache updates, and DFlash stream waits now use explicit ordering helpers and safer backend ownership checks instead of broader synchronization or wrong-buffer access.
+- Added DFlash drafter K/V projection caching for the cross-attention window. Bee now keeps ring-backed drafter K/V state for recent target hidden-state windows, supports chronological D2D append/interleave on CUDA, excludes the unsafe parts from graph capture when needed, and falls back more safely on placements that cannot use the fast GPU path.
+- Reworked DFlash prefill capture and flush handling. Prefill capture now uses per-slot and per-view plans, GPU staging buffers, source-aware CPU/GPU ring validity, suffix-span tracking across internal ubatches, graph-reuse keys for source/destination offsets, callback suppression for irrelevant ubatches, and fail-closed behavior for partial or mismatched captures.
+- Hardened target hidden-state capture across Qwen3.5, Qwen3.5-MoE, Gemma4-ISWA, hidden-only contexts, GPU tape, and multi-slot GPU cross data. Capture layer assignment, token-count derivation, callback routing, and GPU multi-slot cross collection now have explicit correctness checks.
+- Reduced greedy DFlash verification overhead and made verifier control stricter. Eligible verify batches can use reduced top-k logits without raw-logit readback, Bee keeps seed-row alignment correct, the flat verify horizon is capped, server-side depth control is authoritative, and the reduced path falls back when grammar, sampler, or reasoning state requires full logits.
+- Hardened DFlash reasoning, draft, and suffix handling. Reasoning-end forcing now goes through the normal full-logits path when needed, invalid reduced-logits drafts are rejected instead of crashing or looping, empty drafts fall back safely, accepted-prefix full-KV commits respect the drafter window, explicit `--spec-draft-ctx-size` overrides are tracked correctly, Bee keeps the DFlash auto-`-cd 256` default path when no draft ctx is passed, and the drafter stays aligned with the live accepted suffix.
+- Improved Gemma 4 support substantially. Bee added Gemma4-ISWA DFlash target plumbing and profiling callbacks, ported the cleaner upstream Gemma4 graph and loader path back onto Bee hooks, restored Bee precision behavior where needed, synced SWA max-position authority and 512-dim FlashAttention selection with upstream, and fixed Gemma multimodal image decode and dynamic resize bounds.
+- Extended CUDA kernel coverage and backend hardening. Bee now keeps 512-wide quantized FlashAttention instances for standard and TurboQuant/TCQ KV combinations, syncs upstream Hadamard rotation plumbing, propagates CUDA driver links correctly, and hardens op-table / Gated DeltaNet integration alongside long-context GPU ring stability fixes.
+- Reduced peak memory in the perplexity tool and fixed streaming perplexity / KLD cache handling. Streaming perplexity now writes bounded chunks, checks stream errors, avoids retaining unbounded logits for long-context KL runs, and keeps the logits-cache format versioning compatible with the legacy magic.
+- Completed the malformed tool-call guard path for non-stream responses. Final OpenAI-compatible responses now quarantine malformed raw tool-looking text the same way streamed tool-parsing responses already did.
 
 ## v0.1.2