ddtree: support dflash server cache and chain validation#1
Conversation
Add three ggml extension ops for DDTree-style speculative decoding verify: - ggml_ssm_conv_tree (src[2] = parent_ids) - ggml_gated_delta_net_tree (src[6] = parent_ids) - ggml_gated_delta_net_tree_persist (src[7] = persist_inter) Tree mode is signalled via src[] slots; no new op codes. The gated_delta_net result tensor is enlarged to pack per-token intermediate recurrent states so the spec-decode loop can roll back without a replay forward. Referenced by standalone Luce DFlash (https://github.com/Luce-Org/luce-dflash).
Implements TQ3_0 quantization for KV cache, achieving 22% memory reduction vs Q4_0 (3.5 vs 4.5 bpv) with comparable decode speed. Key components: - block_tq3_0 struct (14 bytes/32 elements, group-quantized with FWHT rotation) - CUDA quantize/dequantize kernels (cpy, set_rows, getrows) - Flash attention vec kernel integration (vec_dot_KQ, dequantize_V) - Chunked FA kernel for TQ3_0 prefill (fallback when MMA unavailable) - FWHT rotation: K/V stored in rotated space, Q rotated at attention time - Auto-routing to chunked kernel for TQ3_0 during prefill Benchmarks on RTX 3090, Qwen3.5-27B at 262K context: KV memory: 3584 MiB (TQ3_0) vs 4608 MiB (Q4_0) — 22% savings Decode: ~40 tok/s (identical) Prefill: ~10% slower than Q4_0 (FWHT overhead) Usage: llama-server -ctk tq3_0 -ctv tq3_0
feat: add TQ3_0 (TurboQuant 3.5bpv) KV cache type
Add tree-mode batch path for DDTree-style speculative decode verify. Default (parent_id == NULL) keeps chain mode behavior unchanged. Public API: - llama_batch.parent_id (int32_t *, NULL = chain mode) - llama_batch_init_tree() / free symmetrically handles parent_id Internal: - llama_ubatch propagates parent_id; asserts tree batches fit one ubatch - llm_graph_input_tree builds parent_ids (i32) + tree_mask (f32 ancestor-only) - llama_decode rejects tree batches on non-QWEN35 architectures - delta-net dispatches to build_delta_net_tree when parent_ids is set - qwen35 full-attn substitutes tree_mask for KQ_MASK; ssm_conv_tree replaces ssm_conv on linear-attn layers when parent_ids is present Out of scope (Phase 2+): persist-buffer variant, snapshot/restore, M-RoPE 4-axis positions (UNKNOWN-3 in roadmap).
Add tests/test-qwen35-tree.cpp: loads Qwen3.5 GGUF, runs forward in either chain or tree mode, dumps F32 logits to a binary file (int32 n_tokens, int32 vocab_size, then n_tokens*vocab_size floats). Tree fixture format: JSON with committed_offset and a list of nodes (flat_idx, token_id, parent_idx, depth). Phase 1 fixture is 5 nodes (root + chain + 1 sibling) to exercise both ancestor-walk and branching. CMake target gated by -DLLAMA_BUILD_TESTS_QWEN35_TREE=ON (off by default; test requires a 16+ GB GGUF not in CI). tests/.gitignore: re-include fixtures/ subtree.
split_equal increments n_used per token before calling ubatch_add, so the previous 'n_used == 0' guard fired on the first (and only) tree ubatch. Check 'all tokens emitted in this ubatch' instead, which is the actual semantic we want.
build_inp_tree allocates the mask as F32 (for clarity in set_input), but ggml_flash_attn_ext asserts mask->type == F16 when FA is enabled. Cast at construction time so all downstream consumers see the right type, mirroring how self_kq_mask_cnv is built.
Previous attempt allocated a separate tree_mask graph input and substituted self_kq_mask_cnv. That orphaned the F32 self_kq_mask, so gallocr left it without a backend buffer and set_input_kq_mask asserted on dst->buffer == NULL. Better: keep the standard kq_mask path intact and, when ubatch->parent_id is set, have llama_kv_cache::set_input_kq_mask write the ancestor-only pattern directly into dst. Phase 1 asserts n_stream==1 and n_kv==n_tokens (no past); these will relax in later phases. Drops: - llm_graph_input_tree::inp_tree_mask (no longer needed) - llm_graph_context::tree_mask field - qwen35.cpp self_kq_mask_cnv substitution
Tree mode allocates KV cells for the new batch tokens, but n_kv can be padded above n_tokens by the cache layer. Surplus columns are filled with -inf (treated as empty cells), so the assertion only needs n_kv to be at least n_tokens.
Public C API:
- llama_seq_snapshot / restore / release for recurrent state
- LLAMA_MEM_SNAPSHOT_INVALID sentinel
Internal:
- llama_memory_recurrent::{snapshot,restore,release} allocate per-layer
backup tensors on the same backend as the main cache and use
ggml_backend_tensor_copy to capture/restore SSM + conv state for one
seq_id. No-op when called on non-recurrent memory.
- llama_kv_cache::seq_compact_tree copies K/V rows from accepted DFS
slots to spine slots [0, commit_n) and truncates the cache logical
length. v_trans=true path is skipped (qwen3.5 uses v_trans=false).
- build_delta_net_tree gains an optional persist_inter parameter that
routes to ggml_gated_delta_net_tree_persist; existing callers pass
nullptr.
Punted (Phase 2.4 in roadmap): wiring per-layer F16 persist buffers
through llm_build_qwen35 — needs clearer decision on lifetime/ownership
in llama_model vs llama_context. Documented in build_delta_net_tree
header comment.
tests/test-qwen35-tree-rollback.cpp: prompt prefill, snapshot, decode K chain steps, restore, decode K chain steps with the same first token, dump pre/post logits. Driver expects bit-equality after restore; verify with scripts/compare_logits.py --abs-tol 0. CMake target gated by -DLLAMA_BUILD_TESTS_QWEN35_TREE_ROLLBACK=ON (default OFF, not registered with ctest). Test 2.B (vs test_dflash golden state) deferred — depends on test_dflash --dump-state-at-commit (Phase 0 prerequisite).
Only the last position has logits enabled (batch.logits[n-1]=1), so llama_get_logits_ith(ctx, 0) returned a null row and segfaulted. Read at n_prompt-1 instead.
Qwen3.5 hybrid uses llama_memory_hybrid which contains a llama_memory_recurrent component plus a KV cache. The C wrappers were dynamic_cast'ing to llama_memory_recurrent directly, which fails on hybrid models — so llama_seq_snapshot returned LLAMA_MEM_SNAPSHOT_INVALID even though the underlying recurrent state exists. Fall back to hyb->get_mem_recr() when the direct cast misses.
ggml_backend_tensor_copy dereferences src->buffer directly, not src->view_src->buffer, so copying through a no_alloc view tensor crashes with the buffer assertion. Use ggml_backend_tensor_get/set with explicit row offsets instead — they correctly resolve view_src and just bounce a single row through a host buffer per layer.
Tensor data alone isn't enough — after restore, the recurrent memory still tracks the post-run-1 position, so M-RoPE rejects the next batch because X (=23) >= Y (=16). Capture cell_id, cells[cell_id].pos, cells[cell_id].src, and cells[seq_id].tail at snapshot time and write them back in restore.
New arch LLM_ARCH_DFLASH_DRAFT (5-layer non-causal Qwen3 variant): - llama-arch: register arch + 3 new tensor names (fc, hidden_norm, out_norm) - llama-hparams: dflash_target_capture_layers / target_n_embd / mask_token_id / block_size fields - llama-model: GGUF metadata parsing under "dflash_draft.*" keys, tensor loader, model fields for dflash_fc / dflash_hidden_norm - src/models/dflash-draft.cpp: 5-layer forward — Q from noise, K/V from fc(target_feat) || noise concat, NEOX rope theta=10M, FA mask=null (non-causal), GQA 32:8, SwiGLU FFN, shared target lm_head via out_norm + lora_mm Hidden capture for target qwen35 (opt-in): - llama-context: capture_hidden flag + llama_set_capture_hidden / llama_get_hidden_capture C API - llama-graph: t_hidden_capture result tensor - qwen35.cpp: when capture_hidden, allocate [n_embd, 5*n_tokens] buffer and ggml_cpy post-FFN residual at the 5 capture layers; zero overhead when capture_hidden=false (no graph nodes added) Token embed lookup helper: - llama_model_token_embd_lookup: F32 / F16 (via fp16_to_fp32_row); -1 for out-of-range or unsupported quant types Punted: persist-buffer wiring through llm_build_qwen35 (Phase 2.4 still stands; will be revisited with Phase 4's DDTree driver). Mask token embed source for the host driver is exposed via the API above; actual spec-decode wiring lives in Phase 4.
tests/test-dflash-draft.cpp — loads target+draft, looks up [last_tok, MASK*15] from target token_embd, reads target_feat binary, runs draft forward, dumps 16 x vocab logits. tests/test-qwen35-chain-capture.cpp — Mode A: capture enabled, asserts shape/NaN/nonzero on capture buffer; Mode B: capture disabled, chain re-decode, asserts logits bit-equal to a previously captured baseline. tests/fixtures/ddtree/dflash_draft_metadata_smoke.json — expected metadata fields for the converted draft GGUF (used by scripts/check_dflash_draft_gguf.py at the super-repo level). CMake target gated by -DLLAMA_BUILD_TESTS_DFLASH_DRAFT=ON (default OFF, not registered with ctest). Test 3.B (vs dflash dump-draft-output) deferred — depends on test_dflash --dump-draft-output (Phase 0 prerequisite).
- chain-capture: llama_kv_self_clear was removed; switch to llama_memory_clear(llama_get_memory(ctx), true). - dflash-draft: llama_model_token_embd_lookup is single-token; loop over the 16-token noise batch instead of passing an array.
qwen35.cpp allocates the capture buffer as [n_embd, 5*n_tokens] (slots stacked along ne[1]). The test was asserting the transposed [5*n_embd, n_tokens] form. Accept both — capture data is the same either way; downstream consumers can pick whichever stride they prefer.
ggml_set_input on the capture buffer was wrong — the buffer is written by GPU ggml_cpy ops during forward and read by host code afterward, which is the output pattern. Drop set_input from qwen35.cpp; have llm_graph_result::set_outputs register t_hidden_capture so the backend syncs the writes back to host memory. Also reset the field in llm_graph_result::reset. Without this fix the host saw an all-zero buffer because the GPU writes never propagated to the input-side host pointer.
qwen35.cpp: ggml_set_input on hidden_cap_buf is required for ggml-alloc to give the leaf tensor a backend (OUTPUT-only path leaves backend_id = -1 and crashes gallocr). Both flags coexist: INPUT for backend assignment, OUTPUT for post-forward host sync. llama_model_token_embd_lookup: handle quantized tok_embd via the type traits' to_float dequantize path. Required for the dflash draft test which feeds noise embeddings looked up from a Q4_K_M target.
Hidden capture: replaced pre-allocated INPUT buffer + cpy-to-view pattern (which silently wrote zeros because gallocr put the dst on CPU-pinned host while src lived on device) with the cleaner approach: collect per-layer hidden states into cap_slots[5] using ggml_cont, then ggml_concat them along dim 1 into [n_embd, 5*n_tokens] AFTER the layer loop, register as t_hidden_capture (OUTPUT). The concat node is a regular compute output — gallocr schedules it on the device backend and the sched syncs it to host automatically (same pattern as t_logits / t_embd). dflash-draft FATTN: ggml_flash_attn_ext result tensor must have name prefix "__fattn__-<il>" or sched_reserve asserts. Add the cb() call that build_attn_mha makes; minimal mirror of the existing helper.
common/speculative-tree.{h,cpp}: pure CPU algorithms moved out of
test_dflash for reuse. Includes:
- build_ddtree (best-first heap with optional chain seed; budget
counts total nodes including root)
- follow_verified_tree (greedy walk along target argmax)
- build_tree_visibility (ancestor-only mask from parent_idx)
- extract_top_k_logprobs (online logsumexp + min-heap top-K, descending)
tests/test-speculative-tree: 4 standalone test cases, no model
required. Registered with ctest via llama_build_and_test.
Validates: a small tree build, a greedy-walk acceptance trace,
a hand-computed visibility mask, and top-K extract numerics.
llama_get_hidden_capture returns a ggml_tensor* whose data field is a device pointer when the model is offloaded — dereferencing it from host code segfaults. Mirror the logits/embd pattern: keep an std::vector<float> in llama_context, populate it via ggml_backend_tensor_get_async during decode synchronization, and expose it via a new llama_get_hidden_capture_data accessor that returns a CPU pointer plus ne0/ne1 dims. Update the chain-capture test to use the new accessor.
Encoder-side copy was wired but the chain-capture test goes through llama_context::decode(); duplicate the same get_async block there.
target_feat injection (Phase 3.4 gap):
- llm_graph_input_target_feat (new) wires three dflash-draft inputs:
target_feat_raw, pos_q, pos_k. set_input copies pending host data
via memcpy.
- llama_context::set_target_feat_raw stashes data + dims; the next
llama_decode on a draft context picks them up.
- public C API: llama_set_target_feat_raw + llama_kv_cache_seq_compact_tree.
- dflash-draft.cpp builds these inputs through the new helper instead
of standalone ggml_set_input nodes.
Phase 4 spec-decode driver:
- common/speculative-tree-driver.{h,cpp} implements the 10-step DDTree
loop: noise embed lookup → target_feat repack from capture →
draft forward → top-K → build_ddtree → tree-verify → posterior →
follow_verified_tree → KV compact → SSM rollback via snapshot+replay.
The snapshot+replay path is slower than the reference's per-layer
capture but matches Phase 2 invariants. Documented in source.
- examples/speculative-tree/main.cpp CLI wrapper, gated by
LLAMA_BUILD_EXAMPLES_SPECULATIVE_TREE=ON (default OFF).
tests/test-speculative-tree-e2e.cpp: runs the same prompt through chain-mode greedy decode (reference) and through the spec-decode driver (via llama_speculative_tree_driver_step), dumps both token streams as int32 LE binaries. Acceptance: with --temp 0 the first min(chain_n, spec_n) tokens must be bit-identical (DDTree is lossless for greedy sampling). CMake target gated by -DLLAMA_BUILD_TESTS_SPECULATIVE_TREE_E2E=ON (default OFF, not registered with ctest). Test 4.B (vs test_dflash golden trajectory) deferred — chain-vs-spec comparison gives strong functional verification independently.
Bug: the driver assumed llama_get_hidden_capture_data returned cumulative hidden states across decodes, but the capture buffer only holds the most recent decode's n_tokens columns. Castle test 4.A diverged at token 1 because the driver was slicing 24 cols from a buffer of just 1. Fix: driver now owns a [5*n_embd, target_feat_cap=8192] linear ring indexed by committed_pos. After each target decode (prompt prefill, prompt-style chain steps, or tree-verify) the new tokens' hidden states get copied from the per-decode capture into the ring. On spec-step start, the driver slices the most recent ctx_len = min( committed, 2048) columns and feeds them via llama_set_target_feat_raw. Tree verify ingest uses accepted_dfs[] so only accepted nodes' hidden states enter the ring (matches the dflash reference semantics). New public API: - llama_speculative_tree_driver_ingest_prompt_capture(d, n_prompt) Call after the chain-mode prompt prefill that populated capture, before the first spec step. example/main.cpp and tests/test-speculative-tree-e2e.cpp both call the new API at the right spot. Limitation: 8192-column linear buffer (no rotation); aborts on overflow. True ring rotation is Phase 5 follow-up.
Phase 1 wrote the tree mask assuming KV slot j == ubatch token j and n_past = 0. After prompt prefill, that's wrong: tree tokens land at slots [n_past, n_past+n_tree) and prompt cells stay at [0, n_past). Each tree query needs to see ALL prompt cells plus its tree ancestors. Switch to the full kv-cells walk: identify the prompt/tree boundary via min(ubatch->pos), then for each KV cell decide visibility based on whether its position is past (always visible) or matches one of the query's ancestor positions in the current tree. Mirrors the dflash reference (test_dflash.cpp:423-442).
Driver returns [root_echo, draft_accepts..., bonus] where bonus is the next step's root_token and is NOT in KV. Test was treating the whole result as committed (advancing committed_pos by result.size() and appending all to output), so each step double-counted the bonus from the previous step. Result: spec output had a duplicate token at every step boundary, diverging from the chain reference. Mirror the example's correct accounting: out += result[:-1], root_token = result.back(), committed_pos += result.size() - 1.
Result: {"status":"keep","tps":7.73072,"e2e_tps":1.251564,"spec_sec":25.568,"gen_tokens":32,"steps":17,"committed":33,"step_ms":243.49,"pack_ms":4.49,"draft_ms":171.03,"topk_ms":19.51,"exact_ms":48.43,"exact_decode_ms":0,"acceptance":1.941}
Result: {"status":"keep","tps":8.388756,"e2e_tps":1.262128,"spec_sec":25.354,"gen_tokens":32,"steps":17,"committed":33,"step_ms":224.39,"pack_ms":2.24,"draft_ms":154.39,"topk_ms":19.77,"exact_ms":47.96,"exact_decode_ms":0,"acceptance":1.941}
Result: {"status":"keep","tps":8.751467,"e2e_tps":1.277649,"spec_sec":25.046,"gen_tokens":32,"steps":17,"committed":33,"step_ms":215.09,"pack_ms":1.09,"draft_ms":144.76,"topk_ms":20.79,"exact_ms":48.41,"exact_decode_ms":0,"acceptance":1.941}
Result: {"status":"keep","tps":9.112863,"e2e_tps":1.2838,"spec_sec":24.926,"gen_tokens":32,"steps":17,"committed":33,"step_ms":206.56,"pack_ms":1.06,"draft_ms":137.65,"topk_ms":19.8,"exact_ms":48.01,"exact_decode_ms":0,"acceptance":1.941}
Result: {"status":"keep","tps":9.820771,"e2e_tps":1.295075,"spec_sec":24.709,"gen_tokens":32,"steps":16,"committed":33,"step_ms":203.65,"pack_ms":1.08,"draft_ms":134.43,"topk_ms":17.84,"exact_ms":50.27,"exact_decode_ms":0,"acceptance":2.062}
Result: {"status":"keep","tps":10.147906,"e2e_tps":1.29749,"spec_sec":24.663,"gen_tokens":32,"steps":14,"committed":33,"step_ms":225.24,"pack_ms":1.12,"draft_ms":145.35,"topk_ms":20.04,"exact_ms":58.69,"exact_decode_ms":0,"acceptance":2.357}
Result: {"status":"keep","tps":10.443251,"e2e_tps":1.302932,"spec_sec":24.56,"gen_tokens":32,"steps":14,"committed":33,"step_ms":218.87,"pack_ms":1.11,"draft_ms":140.22,"topk_ms":19.49,"exact_ms":58.02,"exact_decode_ms":0,"acceptance":2.357}
Result: {"status":"keep","tps":10.507099,"e2e_tps":1.308686,"spec_sec":24.452,"gen_tokens":32,"steps":14,"committed":33,"step_ms":217.54,"pack_ms":1.11,"draft_ms":138.33,"topk_ms":19.95,"exact_ms":58.11,"exact_decode_ms":0,"acceptance":2.357}
Result: {"status":"keep","tps":11.924636,"e2e_tps":1.328573,"spec_sec":24.086,"gen_tokens":32,"steps":14,"committed":33,"step_ms":191.68,"pack_ms":1.1,"draft_ms":114.78,"topk_ms":17.91,"exact_ms":57.86,"exact_decode_ms":0,"acceptance":2.357}
…exact-validation path on real_rendered_prompt
Result: {"status":"keep","tps":9.411626}
Result: {"status":"keep","tps":10.195434,"e2e_tps":1.306336,"spec_sec":24.496,"gen_tokens":32,"steps":14,"committed":33,"step_ms":224.19,"pack_ms":1.16,"draft_ms":143.91,"topk_ms":20.75,"exact_ms":58.32,"acceptance":2.357}
Result: {"status":"keep","tps":10.147005,"acceptance":2.357,"committed":33,"draft_ms":145.55,"e2e_tps":1.302932,"exact_ms":58.27,"gen_tokens":32,"pack_ms":0.94,"spec_sec":24.56,"step_ms":225.26,"steps":14,"topk_ms":20.47}
…oes not regress baseline
Result: {"status":"keep","tps":9.892726,"acceptance":2.429,"committed":34,"draft_ms":147.88,"e2e_tps":0.951446,"exact_ms":61.21,"gen_tokens":32,"pack_ms":1.11,"spec_sec":33.633,"step_ms":231.05,"steps":14,"topk_ms":20.82}
… speedup vs budget 32
Result: {"status":"keep","tps":11.808554,"acceptance":2.267,"committed":34,"draft_ms":106.3,"e2e_tps":1.32714,"exact_ms":55.5,"gen_tokens":32,"pack_ms":1.06,"spec_sec":24.112,"step_ms":180.66,"steps":15,"topk_ms":17.77}
Result: {"status":"keep","tps":10.364171,"acceptance":2.357,"committed":33,"draft_ms":142.29,"e2e_tps":1.309329,"exact_ms":57.78,"gen_tokens":32,"pack_ms":1.1,"spec_sec":24.44,"step_ms":220.54,"steps":14,"topk_ms":19.33}
… work, new best TPS
Result: {"status":"keep","tps":10.254438,"acceptance":2.357,"committed":33,"draft_ms":144.06,"e2e_tps":1.306389,"exact_ms":57.84,"gen_tokens":32,"pack_ms":1.11,"spec_sec":24.495,"step_ms":222.9,"steps":14,"topk_ms":19.86}
…performs chain seed with K=4
Result: {"status":"keep","tps":10.334649,"acceptance":2.357,"committed":33,"draft_ms":142.02,"e2e_tps":1.29907,"exact_ms":57.89,"gen_tokens":32,"pack_ms":1.09,"spec_sec":24.633,"step_ms":221.17,"steps":14,"topk_ms":20.13}
Result: {"status":"keep","tps":10.589855,"e2e_tps":1.303038,"spec_sec":24.558,"gen_tokens":32,"steps":14,"committed":33,"step_ms":215.84,"pack_ms":1.14,"draft_ms":137.23,"topk_ms":19.59,"exact_ms":57.84,"acceptance":2.357}
…topk cost
Result: {"status":"keep","tps":10.666453,"e2e_tps":1.305377,"spec_sec":24.514,"gen_tokens":32,"steps":14,"committed":33,"step_ms":214.29,"pack_ms":1.11,"draft_ms":140.9,"topk_ms":14.71,"exact_ms":57.53,"acceptance":2.357}
There was a problem hiding this comment.
5 issues found across 10 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="autoresearch.md">
<violation number="1" location="autoresearch.md:7">
P2: The documented `tps` formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.</violation>
</file>
<file name="multi_prompt_probe.sh">
<violation number="1" location="multi_prompt_probe.sh:17">
P2: The script always outputs a final `METRIC tps=0`, which can make benchmark consumers read a false zero TPS result.</violation>
</file>
<file name="autoresearch.sh">
<violation number="1" location="autoresearch.sh:48">
P2: The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.</violation>
</file>
<file name="src/llama-context.cpp">
<violation number="1" location="src/llama-context.cpp:1200">
P2: The new failure path leaks `ggml_context` when per-layer buffer allocation fails before ownership is moved into `dflash_persist_ctxs_bufs`.</violation>
</file>
<file name="common/speculative-tree-driver.cpp">
<violation number="1" location="common/speculative-tree-driver.cpp:555">
P1: Clamp `top_k` to the vocabulary size before using it; otherwise `extract_top_k_logprobs` can access past the heap bounds when `top_k > n_vocab`.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| // Read draft logits: [block_size, n_vocab]. | ||
| // Skip position 0 (root slot fixed to root_token) and use positions 1..block_size-1. | ||
| const int L = (int)block_size - 1; // draft positions with meaningful predictions | ||
| const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1); |
There was a problem hiding this comment.
P1: Clamp top_k to the vocabulary size before using it; otherwise extract_top_k_logprobs can access past the heap bounds when top_k > n_vocab.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At common/speculative-tree-driver.cpp, line 555:
<comment>Clamp `top_k` to the vocabulary size before using it; otherwise `extract_top_k_logprobs` can access past the heap bounds when `top_k > n_vocab`.</comment>
<file context>
@@ -550,7 +552,7 @@ std::vector<llama_token> llama_speculative_tree_driver_step(
// Skip position 0 (root slot fixed to root_token) and use positions 1..block_size-1.
const int L = (int)block_size - 1; // draft positions with meaningful predictions
- const int K = (d->params.budget > L) ? 8 : 1;
+ const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1);
d->top_log_probs.resize((size_t)L * K);
</file context>
| const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1); | |
| const int K = std::max(1, std::min((int)n_vocab, (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1))); |
| Improve DFlash + DDTree decode throughput for the Qwen3.5-27B llama.cpp server port on Castle without sacrificing greedy bit-equal correctness in the e2e harness. The current user-visible TPS is far below the standalone DFlash baseline and below the target-only llama.cpp server reference. | ||
|
|
||
| ## Metrics | ||
| - **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown. |
There was a problem hiding this comment.
P2: The documented tps formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At autoresearch.md, line 7:
<comment>The documented `tps` formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.</comment>
<file context>
@@ -0,0 +1,62 @@
+Improve DFlash + DDTree decode throughput for the Qwen3.5-27B llama.cpp server port on Castle without sacrificing greedy bit-equal correctness in the e2e harness. The current user-visible TPS is far below the standalone DFlash baseline and below the target-only llama.cpp server reference.
+
+## Metrics
+- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown.
+- **Secondary**: `e2e_tps`, `step_ms`, `draft_ms`, `exact_decode_ms`, `pack_ms`, `topk_ms`, `acceptance`, `gen_tokens`, `spec_sec`.
+
</file context>
| - **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown. | |
| - **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms / 1000.0)` from the harness timing breakdown. |
| committed=$(echo "$out" | python3 -c "import sys,re; m=re.search(r'committed=(\d+)', sys.stdin.read()); print(m.group(1) if m else '0')") | ||
| echo "result: prompt=$p tps=$tps same=$same diff=$diff steps=$steps committed=$committed" | ||
| done | ||
| echo "METRIC tps=0" |
There was a problem hiding this comment.
P2: The script always outputs a final METRIC tps=0, which can make benchmark consumers read a false zero TPS result.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At multi_prompt_probe.sh, line 17:
<comment>The script always outputs a final `METRIC tps=0`, which can make benchmark consumers read a false zero TPS result.</comment>
<file context>
@@ -0,0 +1,17 @@
+ committed=$(echo "$out" | python3 -c "import sys,re; m=re.search(r'committed=(\d+)', sys.stdin.read()); print(m.group(1) if m else '0')")
+ echo "result: prompt=$p tps=$tps same=$same diff=$diff steps=$steps committed=$committed"
+done
+echo "METRIC tps=0"
</file context>
| } | ||
|
|
||
| out_file=$(mktemp /tmp/autoresearch_ddtree.XXXXXX) | ||
| ssh "$REMOTE" "cd '$REMOTE_DIR' && \ |
There was a problem hiding this comment.
P2: The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At autoresearch.sh, line 48:
<comment>The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.</comment>
<file context>
@@ -0,0 +1,134 @@
+}
+
+out_file=$(mktemp /tmp/autoresearch_ddtree.XXXXXX)
+ssh "$REMOTE" "cd '$REMOTE_DIR' && \
+ LLAMA_DDTREE_PROFILE='$PROFILE' \
+ LLAMA_DDTREE_BLOCK_SIZE='$BLOCK_SIZE' \
</file context>
| ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft); | ||
| if (!buf) { | ||
| LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n", | ||
| __func__, il, (long long)n_tokens); | ||
| dflash_persist_inter_l.clear(); | ||
| dflash_persist_conv_l.clear(); | ||
| dflash_persist_ctxs_bufs.clear(); | ||
| dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens); | ||
| return; |
There was a problem hiding this comment.
P2: The new failure path leaks ggml_context when per-layer buffer allocation fails before ownership is moved into dflash_persist_ctxs_bufs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/llama-context.cpp, line 1200:
<comment>The new failure path leaks `ggml_context` when per-layer buffer allocation fails before ownership is moved into `dflash_persist_ctxs_bufs`.</comment>
<file context>
@@ -1205,24 +1196,20 @@ void llama_context::ensure_dflash_persist_capacity(int64_t n_tokens) {
- size_t total_bytes = 0;
- for (auto & [buft, ctx] : ctx_map) {
- ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx.get(), buft);
+ ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);
if (!buf) {
- LLAMA_LOG_ERROR("%s: failed to allocate persist buffers (n_tokens=%lld)\n",
</file context>
| ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft); | |
| if (!buf) { | |
| LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n", | |
| __func__, il, (long long)n_tokens); | |
| dflash_persist_inter_l.clear(); | |
| dflash_persist_conv_l.clear(); | |
| dflash_persist_ctxs_bufs.clear(); | |
| dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens); | |
| return; | |
| ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft); | |
| if (!buf) { | |
| LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n", | |
| __func__, il, (long long)n_tokens); | |
| dflash_persist_inter_l.clear(); | |
| dflash_persist_conv_l.clear(); | |
| dflash_persist_ctxs_bufs.clear(); | |
| dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens); | |
| ggml_free(ctx); | |
| return; | |
| } |
Add DFlash draft top-K runtime
Summary
output.weightwith the DFlash draft model to avoid duplicate lm_head VRAM useLLAMA_DDTREE_FAST_ROLLBACK=1so default snapshot/replay mode can fit full draft GPU offloadLLAMA_DDTREE_FAST_BATCHED=1or trace env varsVerification
cmake --build build-cpu -j 8 --target test-speculative-tree-e2e llama-servercmake --build build-server -j 16 --target test-speculative-tree-e2e llama-serverPASS: all 8 token positions are bit-equalmax_tokens=8:snap=0.00 target_tree=0.00 posterior=0.00Notes
This PR is for tracking the current DFlash + DDTree implementation work in the fork. Fast batched verification remains opt-in because prior diagnostics showed batched/exact divergences on real prompts.
Summary by cubic
Adds DDTree speculative decoding with a DFlash draft model and server integration using chain-only exact validation. Also introduces
TQ3_0KV cache for memory savings, chunked prefill FlashAttention for faster long‑prompt prefill, and a tunable target feature window for the draft.New Features
llama-server:--speculative-mode ddtreewith-md <draft>; chain-only verify by default with grammar-aware validation; single-slot only.LLM_ARCH_DFLASH_DRAFTloader and forward; hidden-state capture on the target; target feature injection to the draft; tunable target feature window for draft conditioning.batch.parent_id, KV compaction after accept (spine-aware), and SSM/conv rollback using per-layer persist buffers.llama_set_capture_hidden,llama_get_hidden_capture_data,llama_seq_snapshot/restore/release,llama_kv_cache_seq_compact_tree,llama_set_target_feat_raw.TQ3_0quantization with FWHT rotation and chunked prefill attention; use-ctk tq3_0 -ctv tq3_0.llama-speculative-tree(off by default) and unit/e2e tests for DDTree and draft paths (opt-in via CMake flags).Migration
--speculative-mode ddtree -md <draft-gguf>; optional--ddtree-budget,--ddtree-temp,--ddtree-no-chain-seed, and the target feature window flag; keep--parallel 1.output.weightand declare archdflash-draft.Written for commit b63fd0d. Summary will update on new commits. Review in cubic