|
2 | 2 |
|
3 | 3 | ## v0.2.0 |
4 | 4 |
|
5 | | -- Added compatibility with upstream DFlash PR drafter GGUFs that use `general.architecture = dflash`. Bee now keeps this as a separate schema from `dflash-draft`, reads upstream metadata keys such as `dflash.block_size` and `dflash.target_layer_ids`, uses upstream tensor names such as `fc.weight`, `hidden_norm.weight`, and `blk.N.ffn_norm.weight`, and keeps existing Bee/buun `dflash-draft` tensor and metadata names unchanged. |
6 | | -- Hardened recurrent memory resizing and prompt-cache restore paths. Recurrent resize now repairs tails, sequence IDs, source links, `used`, `head`, `n`, and recurrent-state size metadata after shrink/expand, while the server shrinks recurrent state before prompt-cache save/load when it is safe to do so. |
7 | | -- Fixed DFlash backup-sequence cleanup and recurrent copy behavior. Server slots now track the backup sequence ID they created, clear leaked backup cells on release, and preserve the local DFlash recurrent copy-plan invalidation path after recurrent memory changes. |
8 | | -- Added unified-KV admission deferral for non-parent server tasks so large pending prompts do not over-commit shared KV cells while active slots still own prompt or task tokens. |
9 | | -- Added categorized DFlash profiling and diagnostics. `GGML_DFLASH_PROFILE` can enable summary, replay, copy, prefill, verify, and trace logging, and DFlash logs now report draft/verify/accept timing, graph reuse, reduced-verifier decisions, stream-copy behavior, and contract details. |
10 | | -- Improved DFlash CUDA stream ordering. GPU hidden capture, recurrent replay, K/V projection-cache updates, backup copies, graph-copied hidden tensors, and DFlash stream waits now use explicit backend/DFlash stream ordering helpers instead of broad synchronization where possible. |
11 | | -- Added DFlash drafter K/V projection caching for the cross-attention window, including ring-only persistent K/V storage, chronological D2D append/interleave helpers, K-only/V-only diagnostic isolation, a kill switch, CUDA graph-capture exclusions, and multi-GPU draft-placement fallback. |
12 | | -- Reduced DFlash verification overhead in greedy paths. The verifier can consume reduced top-k logits for eligible DFlash verify batches, skips raw-logit readback when the reduced path is active, preserves seed-token row alignment, and falls back when grammar, sampling, or reasoning state requires full logits. |
13 | | -- Fixed reasoning-end forcing with reduced DFlash verification. When an EOG token appears during active reasoning, the sampler now forces the reasoning-end token through the normal forcing path instead of accepting an unsafe reduced candidate set. |
14 | | -- Reworked DFlash prefill capture and flush handling. Prefill capture now tracks per-slot plans, GPU staging buffers, source-aware CPU/GPU ring validity, suffix spans across internal ubatches, graph-reuse keys for source/destination offsets, and fail-closed behavior for partial or mismatched captures. |
15 | | -- Hardened target hidden-state capture across Qwen3.5, Qwen3.5-MoE, Gemma4-ISWA, GPU tape, hidden-only, and prefill-only contexts. Capture layer assignment, callback suppression, token-count derivation, per-view plans, multi-slot GPU cross data, and shape validation now have explicit checks and regression coverage. |
16 | | -- Added DFlash input and contract validation. The drafter now rejects cross feature-size mismatches, validates mask tokens and target layer metadata, logs target/drafter vocab and context contracts, and provides debug toggles such as `GGML_DFLASH_DEBUG`, `GGML_DFLASH_INPUT_DEBUG`, `GGML_DFLASH_CUDA_DEBUG`, `GGML_DFLASH_FORCE_CPU_CROSS`, and `GGML_DFLASH_VERBOSE_CONTRACT`. |
17 | | -- Extended CUDA FlashAttention template coverage for 512-wide quantized K/V combinations, including TurboQuant and TCQ cache types, and added generator/test coverage so the 512-dim instances are not dropped. |
18 | | -- Fixed long-context DFlash and CUDA stability issues, including GPU ring crashes in DFlash K/V updates at long-context prefill, TurboQuant GPU-ring hangs, CUDA driver link propagation, op-table drift, Gated DeltaNet kernel hardening, and stream-safe DFlash replay. |
19 | | -- Reduced peak memory in the perplexity tool and fixed streaming perplexity/KLD logits memory. Streaming perplexity now writes bounded chunks, checks stream errors, and avoids retaining unbounded logits for long-context KL runs. |
20 | | -- Improved DFlash draft model discovery and download plumbing so sibling DFlash draft GGUFs can be found more reliably from related model repositories. |
21 | | -- Improved DFlash converter support and diagnostics. DFlash conversion now handles `dflash_config` metadata nesting, logs metadata warnings and summaries, scopes Gemma4 tokenizer handling, and validates DFlash-specific metadata more clearly. |
22 | | -- Expanded regression coverage for DFlash server invariants, recurrent prompt-cache shrink/expand, reduced verifier behavior, CUDA stream ordering, prefill staging, ring validity, contract validation, converter metadata, adaptive draft-max behavior, sampling/reasoning-end forcing, and streaming perplexity plumbing. |
| 5 | +- Added compatibility with upstream DFlash PR drafter GGUFs that use `general.architecture = dflash`. Bee now keeps this separate from the older `dflash-draft` schema, understands upstream metadata keys such as `dflash.block_size` and `dflash.target_layer_ids`, reads upstream tensor names, and keeps existing Bee/buun draft GGUF naming intact. |
| 6 | +- Tightened DFlash draft model discovery and converter behavior. Bee now prefers exact sibling DFlash draft directories, supports nested `dflash_config` metadata, scopes Gemma4 tokenizer handling correctly, and logs clearer DFlash metadata warnings and summaries during conversion. |
| 7 | +- Hardened recurrent memory, prompt-cache restore, and unified-KV scheduling. Recurrent resize now repairs its metadata after shrink/expand, the server shrinks recurrent state before prompt-cache save/load when it is safe, backup-sequence cleanup is tracked correctly, and non-parent tasks defer unified-KV admission so large pending prompts do not over-commit shared cells. |
| 8 | +- Added richer DFlash diagnostics, profiling, and validation. `GGML_DFLASH_PROFILE` now exposes categorized summary/replay/copy/prefill/verify/trace logging, routine decode timing is hidden behind debug logging instead of always printing, the profit controller now logs when it disables speculative depth, drafter/target contract and input validation are stricter, and Bee also exposes targeted debug envs such as `GGML_DFLASH_DEBUG`, `GGML_DFLASH_INPUT_DEBUG`, `GGML_DFLASH_CUDA_DEBUG`, `GGML_DFLASH_FORCE_CPU_CROSS`, `GGML_DFLASH_VERBOSE_CONTRACT`, and `GGML_DFLASH_CRASH_TRACE`. |
| 9 | +- Improved DFlash CUDA ordering and split-buffer correctness. Hidden capture, recurrent replay, backup copies, K/V projection-cache updates, and DFlash stream waits now use explicit ordering helpers and safer backend ownership checks instead of broader synchronization or wrong-buffer access. |
| 10 | +- Added DFlash drafter K/V projection caching for the cross-attention window. Bee now keeps ring-backed drafter K/V state for recent target hidden-state windows, supports chronological D2D append/interleave on CUDA, excludes the unsafe parts from graph capture when needed, and falls back more safely on placements that cannot use the fast GPU path. |
| 11 | +- Reworked DFlash prefill capture and flush handling. Prefill capture now uses per-slot and per-view plans, GPU staging buffers, source-aware CPU/GPU ring validity, suffix-span tracking across internal ubatches, graph-reuse keys for source/destination offsets, callback suppression for irrelevant ubatches, and fail-closed behavior for partial or mismatched captures. |
| 12 | +- Hardened target hidden-state capture across Qwen3.5, Qwen3.5-MoE, Gemma4-ISWA, hidden-only contexts, GPU tape, and multi-slot GPU cross data. Capture layer assignment, token-count derivation, callback routing, and GPU multi-slot cross collection now have explicit correctness checks. |
| 13 | +- Reduced greedy DFlash verification overhead and made verifier control stricter. Eligible verify batches can use reduced top-k logits without raw-logit readback, Bee keeps seed-row alignment correct, the flat verify horizon is capped, server-side depth control is authoritative, and the reduced path falls back when grammar, sampler, or reasoning state requires full logits. |
| 14 | +- Hardened DFlash reasoning, draft, and suffix handling. Reasoning-end forcing now goes through the normal full-logits path when needed, invalid reduced-logits drafts are rejected instead of crashing or looping, empty drafts fall back safely, accepted-prefix full-KV commits respect the drafter window, explicit `--spec-draft-ctx-size` overrides are tracked correctly, Bee keeps the DFlash auto-`-cd 256` default path when no draft ctx is passed, and the drafter stays aligned with the live accepted suffix. |
| 15 | +- Improved Gemma 4 support substantially. Bee added Gemma4-ISWA DFlash target plumbing and profiling callbacks, ported the cleaner upstream Gemma4 graph and loader path back onto Bee hooks, restored Bee precision behavior where needed, synced SWA max-position authority and 512-dim FlashAttention selection with upstream, and fixed Gemma multimodal image decode and dynamic resize bounds. |
| 16 | +- Extended CUDA kernel coverage and backend hardening. Bee now keeps 512-wide quantized FlashAttention instances for standard and TurboQuant/TCQ KV combinations, syncs upstream Hadamard rotation plumbing, propagates CUDA driver links correctly, and hardens op-table / Gated DeltaNet integration alongside long-context GPU ring stability fixes. |
| 17 | +- Reduced peak memory in the perplexity tool and fixed streaming perplexity / KLD cache handling. Streaming perplexity now writes bounded chunks, checks stream errors, avoids retaining unbounded logits for long-context KL runs, and keeps the logits-cache format versioning compatible with the legacy magic. |
| 18 | +- Completed the malformed tool-call guard path for non-stream responses. Final OpenAI-compatible responses now quarantine malformed raw tool-looking text the same way streamed tool-parsing responses already did. |
23 | 19 |
|
24 | 20 | ## v0.1.2 |
25 | 21 |
|
|
0 commit comments