Skip to content

Commit 633cd34

Browse files
committed
v0.1.2
- Fixed the adaptive `profit` controller's no-spec baseline path. Profit mode now seeds baseline samples before positive-depth warmup, can shut DFlash fully off when the measured baseline wins, and no longer makes speculative decisions from draft-only telemetry. - Fixed profit-controller reset handling across context-bucket and configuration changes so cleared baseline telemetry cannot leave the controller in a stale active or off state. - Added low-frequency profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so runs can refresh target-only timing as context grows. The default interval is 1024 active speculative cycles; reprobes resume the previous active draft depth and avoid off-probe counter starvation. - Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set. - Hardened DFlash on split CUDA / multi-GPU placement. GPU cross-ring setup, hidden capture, CUDA graph capture, K/V projection cache updates, recurrent replay, conv replay, and async tensor get/set paths now check buffer/backend ownership and fall back to safer CPU or owning-buffer paths instead of reading or writing recurrent state through the wrong CUDA backend. - Added clearer diagnostics and regression coverage for multi-GPU DFlash fallback decisions, CUDA graph buffer visibility, wrong-device async tensor access, active-reasoning reduced-sampling rejection, adaptive DM defaults, and profit-controller baseline behavior. - Fixed ROCm 7 build: added `cudaPointerAttributes` / `cudaMemoryType` shim aliases to `hip.h`, extended `CUDART_VERSION >= 10000` guards with `|| defined(GGML_USE_HIP)` so the `.type` field path is taken on HIP, and removed the `WIN32` guard around TurboQuant flash-attention instance compilation so Linux ROCm builds include the turbo KV-cache kernels (acerspyro#11). - Known limitation: the current multi-GPU DFlash path is a correctness fallback, not a performant split-GPU implementation. On split target placement it can be slower than non-speculative decoding because recurrent replay and hidden capture avoid unsafe single-backend GPU fast paths. A performant implementation still needs per-device replay graphs or a scheduler that follows ggml's split-buffer ownership model.
1 parent 3109a0b commit 633cd34

1 file changed

Lines changed: 3 additions & 3 deletions

File tree

CHANGELOG.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,12 @@
33
## v0.1.2
44

55
- Fixed the adaptive `profit` controller's no-spec baseline path. Profit mode now seeds baseline samples before positive-depth warmup, can shut DFlash fully off when the measured baseline wins, and no longer makes speculative decisions from draft-only telemetry.
6-
- Fixed a profit-controller bucket-transition deadlock where telemetry reset could clear the no-spec baseline while preserving a positive active draft depth, causing all later cycles to run as unrecorded single-token baseline and leaving DFlash permanently disabled.
7-
- Added low-frequency profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so runs can refresh target-only timing as context grows. The default interval is 1024 active speculative cycles to keep probe overhead minimal; bucket transitions still seed a fresh baseline. Off-state probes now restart with the configured probe depth instead of jumping straight to full draft depth.
8-
- Stabilized profit depth selection on the production ladder (`0`-`8`, `10`, `12`, `14`, `16`, and the configured max) while preserving the previous active depth across baseline reprobes and avoiding off-probe counter starvation from repeated baseline cycles.
6+
- Fixed profit-controller reset handling across context-bucket and configuration changes so cleared baseline telemetry cannot leave the controller in a stale active or off state.
7+
- Added low-frequency profit-controller baseline reprobes with `--spec-dm-profit-baseline-interval` / `LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL` so runs can refresh target-only timing as context grows. The default interval is 1024 active speculative cycles; reprobes resume the previous active draft depth and avoid off-probe counter starvation.
98
- Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set.
109
- Hardened DFlash on split CUDA / multi-GPU placement. GPU cross-ring setup, hidden capture, CUDA graph capture, K/V projection cache updates, recurrent replay, conv replay, and async tensor get/set paths now check buffer/backend ownership and fall back to safer CPU or owning-buffer paths instead of reading or writing recurrent state through the wrong CUDA backend.
1110
- Added clearer diagnostics and regression coverage for multi-GPU DFlash fallback decisions, CUDA graph buffer visibility, wrong-device async tensor access, active-reasoning reduced-sampling rejection, adaptive DM defaults, and profit-controller baseline behavior.
11+
- Fixed ROCm 7 build: added `cudaPointerAttributes` / `cudaMemoryType` shim aliases to `hip.h`, extended `CUDART_VERSION >= 10000` guards with `|| defined(GGML_USE_HIP)` so the `.type` field path is taken on HIP, and removed the `WIN32` guard around TurboQuant flash-attention instance compilation so Linux ROCm builds include the turbo KV-cache kernels (acerspyro#11).
1212
- Known limitation: the current multi-GPU DFlash path is a correctness fallback, not a performant split-GPU implementation. On split target placement it can be slower than non-speculative decoding because recurrent replay and hidden capture avoid unsafe single-backend GPU fast paths. A performant implementation still needs per-device replay graphs or a scheduler that follows ggml's split-buffer ownership model.
1313

1414
## v0.1.1

0 commit comments

Comments
 (0)