ddtree: support dflash server cache and chain validation by Leechael · Pull Request #1 · Leechael/llama.cpp-dflash-ggml

Leechael · 2026-04-29T19:13:50Z

Summary

add/finish DDTree + DFlash server integration paths for the llama.cpp fork
share target output.weight with the DFlash draft model to avoid duplicate lm_head VRAM use
gate DFlash persist-buffer allocation behind LLAMA_DDTREE_FAST_ROLLBACK=1 so default snapshot/replay mode can fit full draft GPU offload
restore prompt cache/checkpoint support for DDTree by re-decoding the last 2048 prompt tokens to rebuild the DFlash target-feature ring
add a DDTree rebuild-boundary checkpoint so repeated long prompts restore close to the ring rebuild boundary
make default correctness mode use chain-only exact validation; keep batched tree verify behind LLAMA_DDTREE_FAST_BATCHED=1 or trace env vars
add profiling/logging and E2E coverage for the DDTree server path

Verification

local build: cmake --build build-cpu -j 8 --target test-speculative-tree-e2e llama-server
Castle CUDA build: cmake --build build-server -j 16 --target test-speculative-tree-e2e llama-server
Castle E2E: PASS: all 8 token positions are bit-equal
Castle server smoke with a 20,410-token rendered prompt, max_tokens=8:
- request 1: 22.76s
- request 2: 4.86s
- snap=0.00 target_tree=0.00 posterior=0.00

Notes

This PR is for tracking the current DFlash + DDTree implementation work in the fork. Fast batched verification remains opt-in because prior diagnostics showed batched/exact divergences on real prompts.

Summary by cubic

Adds DDTree speculative decoding with a DFlash draft model and server integration using chain-only exact validation. Also introduces TQ3_0 KV cache for memory savings, chunked prefill FlashAttention for faster long‑prompt prefill, and a tunable target feature window for the draft.

New Features
- DDTree in llama-server: --speculative-mode ddtree with -md <draft>; chain-only verify by default with grammar-aware validation; single-slot only.
- Draft model support: new LLM_ARCH_DFLASH_DRAFT loader and forward; hidden-state capture on the target; target feature injection to the draft; tunable target feature window for draft conditioning.
- Tree verify path: tree-mode batches via batch.parent_id, KV compaction after accept (spine-aware), and SSM/conv rollback using per-layer persist buffers.
- APIs: llama_set_capture_hidden, llama_get_hidden_capture_data, llama_seq_snapshot/restore/release, llama_kv_cache_seq_compact_tree, llama_set_target_feat_raw.
- KV cache: new TQ3_0 quantization with FWHT rotation and chunked prefill attention; use -ctk tq3_0 -ctv tq3_0.
- Tooling: example llama-speculative-tree (off by default) and unit/e2e tests for DDTree and draft paths (opt-in via CMake flags).
- Stats: split timing for exact validation in server traces.
Migration
- To use DDTree in the server: run with --speculative-mode ddtree -md <draft-gguf>; optional --ddtree-budget, --ddtree-temp, --ddtree-no-chain-seed, and the target feature window flag; keep --parallel 1.
- Draft GGUF must contain output.weight and declare arch dflash-draft.
- Existing chain mode and non-DDTree paths are unchanged.

^{Written for commit b63fd0d. Summary will update on new commits. Review in cubic}

Add three ggml extension ops for DDTree-style speculative decoding verify: - ggml_ssm_conv_tree (src[2] = parent_ids) - ggml_gated_delta_net_tree (src[6] = parent_ids) - ggml_gated_delta_net_tree_persist (src[7] = persist_inter) Tree mode is signalled via src[] slots; no new op codes. The gated_delta_net result tensor is enlarged to pack per-token intermediate recurrent states so the spec-decode loop can roll back without a replay forward. Referenced by standalone Luce DFlash (https://github.com/Luce-Org/luce-dflash).

Implements TQ3_0 quantization for KV cache, achieving 22% memory reduction vs Q4_0 (3.5 vs 4.5 bpv) with comparable decode speed. Key components: - block_tq3_0 struct (14 bytes/32 elements, group-quantized with FWHT rotation) - CUDA quantize/dequantize kernels (cpy, set_rows, getrows) - Flash attention vec kernel integration (vec_dot_KQ, dequantize_V) - Chunked FA kernel for TQ3_0 prefill (fallback when MMA unavailable) - FWHT rotation: K/V stored in rotated space, Q rotated at attention time - Auto-routing to chunked kernel for TQ3_0 during prefill Benchmarks on RTX 3090, Qwen3.5-27B at 262K context: KV memory: 3584 MiB (TQ3_0) vs 4608 MiB (Q4_0) — 22% savings Decode: ~40 tok/s (identical) Prefill: ~10% slower than Q4_0 (FWHT overhead) Usage: llama-server -ctk tq3_0 -ctv tq3_0

feat: add TQ3_0 (TurboQuant 3.5bpv) KV cache type

Add tree-mode batch path for DDTree-style speculative decode verify. Default (parent_id == NULL) keeps chain mode behavior unchanged. Public API: - llama_batch.parent_id (int32_t *, NULL = chain mode) - llama_batch_init_tree() / free symmetrically handles parent_id Internal: - llama_ubatch propagates parent_id; asserts tree batches fit one ubatch - llm_graph_input_tree builds parent_ids (i32) + tree_mask (f32 ancestor-only) - llama_decode rejects tree batches on non-QWEN35 architectures - delta-net dispatches to build_delta_net_tree when parent_ids is set - qwen35 full-attn substitutes tree_mask for KQ_MASK; ssm_conv_tree replaces ssm_conv on linear-attn layers when parent_ids is present Out of scope (Phase 2+): persist-buffer variant, snapshot/restore, M-RoPE 4-axis positions (UNKNOWN-3 in roadmap).

Add tests/test-qwen35-tree.cpp: loads Qwen3.5 GGUF, runs forward in either chain or tree mode, dumps F32 logits to a binary file (int32 n_tokens, int32 vocab_size, then n_tokens*vocab_size floats). Tree fixture format: JSON with committed_offset and a list of nodes (flat_idx, token_id, parent_idx, depth). Phase 1 fixture is 5 nodes (root + chain + 1 sibling) to exercise both ancestor-walk and branching. CMake target gated by -DLLAMA_BUILD_TESTS_QWEN35_TREE=ON (off by default; test requires a 16+ GB GGUF not in CI). tests/.gitignore: re-include fixtures/ subtree.

split_equal increments n_used per token before calling ubatch_add, so the previous 'n_used == 0' guard fired on the first (and only) tree ubatch. Check 'all tokens emitted in this ubatch' instead, which is the actual semantic we want.

build_inp_tree allocates the mask as F32 (for clarity in set_input), but ggml_flash_attn_ext asserts mask->type == F16 when FA is enabled. Cast at construction time so all downstream consumers see the right type, mirroring how self_kq_mask_cnv is built.

Previous attempt allocated a separate tree_mask graph input and substituted self_kq_mask_cnv. That orphaned the F32 self_kq_mask, so gallocr left it without a backend buffer and set_input_kq_mask asserted on dst->buffer == NULL. Better: keep the standard kq_mask path intact and, when ubatch->parent_id is set, have llama_kv_cache::set_input_kq_mask write the ancestor-only pattern directly into dst. Phase 1 asserts n_stream==1 and n_kv==n_tokens (no past); these will relax in later phases. Drops: - llm_graph_input_tree::inp_tree_mask (no longer needed) - llm_graph_context::tree_mask field - qwen35.cpp self_kq_mask_cnv substitution

Tree mode allocates KV cells for the new batch tokens, but n_kv can be padded above n_tokens by the cache layer. Surplus columns are filled with -inf (treated as empty cells), so the assertion only needs n_kv to be at least n_tokens.

Public C API: - llama_seq_snapshot / restore / release for recurrent state - LLAMA_MEM_SNAPSHOT_INVALID sentinel Internal: - llama_memory_recurrent::{snapshot,restore,release} allocate per-layer backup tensors on the same backend as the main cache and use ggml_backend_tensor_copy to capture/restore SSM + conv state for one seq_id. No-op when called on non-recurrent memory. - llama_kv_cache::seq_compact_tree copies K/V rows from accepted DFS slots to spine slots [0, commit_n) and truncates the cache logical length. v_trans=true path is skipped (qwen3.5 uses v_trans=false). - build_delta_net_tree gains an optional persist_inter parameter that routes to ggml_gated_delta_net_tree_persist; existing callers pass nullptr. Punted (Phase 2.4 in roadmap): wiring per-layer F16 persist buffers through llm_build_qwen35 — needs clearer decision on lifetime/ownership in llama_model vs llama_context. Documented in build_delta_net_tree header comment.

tests/test-qwen35-tree-rollback.cpp: prompt prefill, snapshot, decode K chain steps, restore, decode K chain steps with the same first token, dump pre/post logits. Driver expects bit-equality after restore; verify with scripts/compare_logits.py --abs-tol 0. CMake target gated by -DLLAMA_BUILD_TESTS_QWEN35_TREE_ROLLBACK=ON (default OFF, not registered with ctest). Test 2.B (vs test_dflash golden state) deferred — depends on test_dflash --dump-state-at-commit (Phase 0 prerequisite).

Only the last position has logits enabled (batch.logits[n-1]=1), so llama_get_logits_ith(ctx, 0) returned a null row and segfaulted. Read at n_prompt-1 instead.

Qwen3.5 hybrid uses llama_memory_hybrid which contains a llama_memory_recurrent component plus a KV cache. The C wrappers were dynamic_cast'ing to llama_memory_recurrent directly, which fails on hybrid models — so llama_seq_snapshot returned LLAMA_MEM_SNAPSHOT_INVALID even though the underlying recurrent state exists. Fall back to hyb->get_mem_recr() when the direct cast misses.

ggml_backend_tensor_copy dereferences src->buffer directly, not src->view_src->buffer, so copying through a no_alloc view tensor crashes with the buffer assertion. Use ggml_backend_tensor_get/set with explicit row offsets instead — they correctly resolve view_src and just bounce a single row through a host buffer per layer.

Tensor data alone isn't enough — after restore, the recurrent memory still tracks the post-run-1 position, so M-RoPE rejects the next batch because X (=23) >= Y (=16). Capture cell_id, cells[cell_id].pos, cells[cell_id].src, and cells[seq_id].tail at snapshot time and write them back in restore.

New arch LLM_ARCH_DFLASH_DRAFT (5-layer non-causal Qwen3 variant): - llama-arch: register arch + 3 new tensor names (fc, hidden_norm, out_norm) - llama-hparams: dflash_target_capture_layers / target_n_embd / mask_token_id / block_size fields - llama-model: GGUF metadata parsing under "dflash_draft.*" keys, tensor loader, model fields for dflash_fc / dflash_hidden_norm - src/models/dflash-draft.cpp: 5-layer forward — Q from noise, K/V from fc(target_feat) || noise concat, NEOX rope theta=10M, FA mask=null (non-causal), GQA 32:8, SwiGLU FFN, shared target lm_head via out_norm + lora_mm Hidden capture for target qwen35 (opt-in): - llama-context: capture_hidden flag + llama_set_capture_hidden / llama_get_hidden_capture C API - llama-graph: t_hidden_capture result tensor - qwen35.cpp: when capture_hidden, allocate [n_embd, 5*n_tokens] buffer and ggml_cpy post-FFN residual at the 5 capture layers; zero overhead when capture_hidden=false (no graph nodes added) Token embed lookup helper: - llama_model_token_embd_lookup: F32 / F16 (via fp16_to_fp32_row); -1 for out-of-range or unsupported quant types Punted: persist-buffer wiring through llm_build_qwen35 (Phase 2.4 still stands; will be revisited with Phase 4's DDTree driver). Mask token embed source for the host driver is exposed via the API above; actual spec-decode wiring lives in Phase 4.

tests/test-dflash-draft.cpp — loads target+draft, looks up [last_tok, MASK*15] from target token_embd, reads target_feat binary, runs draft forward, dumps 16 x vocab logits. tests/test-qwen35-chain-capture.cpp — Mode A: capture enabled, asserts shape/NaN/nonzero on capture buffer; Mode B: capture disabled, chain re-decode, asserts logits bit-equal to a previously captured baseline. tests/fixtures/ddtree/dflash_draft_metadata_smoke.json — expected metadata fields for the converted draft GGUF (used by scripts/check_dflash_draft_gguf.py at the super-repo level). CMake target gated by -DLLAMA_BUILD_TESTS_DFLASH_DRAFT=ON (default OFF, not registered with ctest). Test 3.B (vs dflash dump-draft-output) deferred — depends on test_dflash --dump-draft-output (Phase 0 prerequisite).

- chain-capture: llama_kv_self_clear was removed; switch to llama_memory_clear(llama_get_memory(ctx), true). - dflash-draft: llama_model_token_embd_lookup is single-token; loop over the 16-token noise batch instead of passing an array.

qwen35.cpp allocates the capture buffer as [n_embd, 5*n_tokens] (slots stacked along ne[1]). The test was asserting the transposed [5*n_embd, n_tokens] form. Accept both — capture data is the same either way; downstream consumers can pick whichever stride they prefer.

ggml_set_input on the capture buffer was wrong — the buffer is written by GPU ggml_cpy ops during forward and read by host code afterward, which is the output pattern. Drop set_input from qwen35.cpp; have llm_graph_result::set_outputs register t_hidden_capture so the backend syncs the writes back to host memory. Also reset the field in llm_graph_result::reset. Without this fix the host saw an all-zero buffer because the GPU writes never propagated to the input-side host pointer.

qwen35.cpp: ggml_set_input on hidden_cap_buf is required for ggml-alloc to give the leaf tensor a backend (OUTPUT-only path leaves backend_id = -1 and crashes gallocr). Both flags coexist: INPUT for backend assignment, OUTPUT for post-forward host sync. llama_model_token_embd_lookup: handle quantized tok_embd via the type traits' to_float dequantize path. Required for the dflash draft test which feeds noise embeddings looked up from a Q4_K_M target.

Hidden capture: replaced pre-allocated INPUT buffer + cpy-to-view pattern (which silently wrote zeros because gallocr put the dst on CPU-pinned host while src lived on device) with the cleaner approach: collect per-layer hidden states into cap_slots[5] using ggml_cont, then ggml_concat them along dim 1 into [n_embd, 5*n_tokens] AFTER the layer loop, register as t_hidden_capture (OUTPUT). The concat node is a regular compute output — gallocr schedules it on the device backend and the sched syncs it to host automatically (same pattern as t_logits / t_embd). dflash-draft FATTN: ggml_flash_attn_ext result tensor must have name prefix "__fattn__-<il>" or sched_reserve asserts. Add the cb() call that build_attn_mha makes; minimal mirror of the existing helper.

common/speculative-tree.{h,cpp}: pure CPU algorithms moved out of test_dflash for reuse. Includes: - build_ddtree (best-first heap with optional chain seed; budget counts total nodes including root) - follow_verified_tree (greedy walk along target argmax) - build_tree_visibility (ancestor-only mask from parent_idx) - extract_top_k_logprobs (online logsumexp + min-heap top-K, descending) tests/test-speculative-tree: 4 standalone test cases, no model required. Registered with ctest via llama_build_and_test. Validates: a small tree build, a greedy-walk acceptance trace, a hand-computed visibility mask, and top-K extract numerics.

llama_get_hidden_capture returns a ggml_tensor* whose data field is a device pointer when the model is offloaded — dereferencing it from host code segfaults. Mirror the logits/embd pattern: keep an std::vector<float> in llama_context, populate it via ggml_backend_tensor_get_async during decode synchronization, and expose it via a new llama_get_hidden_capture_data accessor that returns a CPU pointer plus ne0/ne1 dims. Update the chain-capture test to use the new accessor.

Encoder-side copy was wired but the chain-capture test goes through llama_context::decode(); duplicate the same get_async block there.

target_feat injection (Phase 3.4 gap): - llm_graph_input_target_feat (new) wires three dflash-draft inputs: target_feat_raw, pos_q, pos_k. set_input copies pending host data via memcpy. - llama_context::set_target_feat_raw stashes data + dims; the next llama_decode on a draft context picks them up. - public C API: llama_set_target_feat_raw + llama_kv_cache_seq_compact_tree. - dflash-draft.cpp builds these inputs through the new helper instead of standalone ggml_set_input nodes. Phase 4 spec-decode driver: - common/speculative-tree-driver.{h,cpp} implements the 10-step DDTree loop: noise embed lookup → target_feat repack from capture → draft forward → top-K → build_ddtree → tree-verify → posterior → follow_verified_tree → KV compact → SSM rollback via snapshot+replay. The snapshot+replay path is slower than the reference's per-layer capture but matches Phase 2 invariants. Documented in source. - examples/speculative-tree/main.cpp CLI wrapper, gated by LLAMA_BUILD_EXAMPLES_SPECULATIVE_TREE=ON (default OFF).

tests/test-speculative-tree-e2e.cpp: runs the same prompt through chain-mode greedy decode (reference) and through the spec-decode driver (via llama_speculative_tree_driver_step), dumps both token streams as int32 LE binaries. Acceptance: with --temp 0 the first min(chain_n, spec_n) tokens must be bit-identical (DDTree is lossless for greedy sampling). CMake target gated by -DLLAMA_BUILD_TESTS_SPECULATIVE_TREE_E2E=ON (default OFF, not registered with ctest). Test 4.B (vs test_dflash golden trajectory) deferred — chain-vs-spec comparison gives strong functional verification independently.

Bug: the driver assumed llama_get_hidden_capture_data returned cumulative hidden states across decodes, but the capture buffer only holds the most recent decode's n_tokens columns. Castle test 4.A diverged at token 1 because the driver was slicing 24 cols from a buffer of just 1. Fix: driver now owns a [5*n_embd, target_feat_cap=8192] linear ring indexed by committed_pos. After each target decode (prompt prefill, prompt-style chain steps, or tree-verify) the new tokens' hidden states get copied from the per-decode capture into the ring. On spec-step start, the driver slices the most recent ctx_len = min( committed, 2048) columns and feeds them via llama_set_target_feat_raw. Tree verify ingest uses accepted_dfs[] so only accepted nodes' hidden states enter the ring (matches the dflash reference semantics). New public API: - llama_speculative_tree_driver_ingest_prompt_capture(d, n_prompt) Call after the chain-mode prompt prefill that populated capture, before the first spec step. example/main.cpp and tests/test-speculative-tree-e2e.cpp both call the new API at the right spot. Limitation: 8192-column linear buffer (no rotation); aborts on overflow. True ring rotation is Phase 5 follow-up.

Phase 1 wrote the tree mask assuming KV slot j == ubatch token j and n_past = 0. After prompt prefill, that's wrong: tree tokens land at slots [n_past, n_past+n_tree) and prompt cells stay at [0, n_past). Each tree query needs to see ALL prompt cells plus its tree ancestors. Switch to the full kv-cells walk: identify the prompt/tree boundary via min(ubatch->pos), then for each KV cell decide visibility based on whether its position is past (always visible) or matches one of the query's ancestor positions in the current tree. Mirrors the dflash reference (test_dflash.cpp:423-442).

Driver returns [root_echo, draft_accepts..., bonus] where bonus is the next step's root_token and is NOT in KV. Test was treating the whole result as committed (advancing committed_pos by result.size() and appending all to output), so each step double-counted the bonus from the previous step. Result: spec output had a duplicate token at every step boundary, diverging from the chain reference. Mirror the example's correct accounting: out += result[:-1], root_token = result.back(), committed_pos += result.size() - 1.

Result: {"status":"keep","tps":7.73072,"e2e_tps":1.251564,"spec_sec":25.568,"gen_tokens":32,"steps":17,"committed":33,"step_ms":243.49,"pack_ms":4.49,"draft_ms":171.03,"topk_ms":19.51,"exact_ms":48.43,"exact_decode_ms":0,"acceptance":1.941}

Result: {"status":"keep","tps":8.388756,"e2e_tps":1.262128,"spec_sec":25.354,"gen_tokens":32,"steps":17,"committed":33,"step_ms":224.39,"pack_ms":2.24,"draft_ms":154.39,"topk_ms":19.77,"exact_ms":47.96,"exact_decode_ms":0,"acceptance":1.941}

Result: {"status":"keep","tps":8.751467,"e2e_tps":1.277649,"spec_sec":25.046,"gen_tokens":32,"steps":17,"committed":33,"step_ms":215.09,"pack_ms":1.09,"draft_ms":144.76,"topk_ms":20.79,"exact_ms":48.41,"exact_decode_ms":0,"acceptance":1.941}

Result: {"status":"keep","tps":9.112863,"e2e_tps":1.2838,"spec_sec":24.926,"gen_tokens":32,"steps":17,"committed":33,"step_ms":206.56,"pack_ms":1.06,"draft_ms":137.65,"topk_ms":19.8,"exact_ms":48.01,"exact_decode_ms":0,"acceptance":1.941}

Result: {"status":"keep","tps":9.820771,"e2e_tps":1.295075,"spec_sec":24.709,"gen_tokens":32,"steps":16,"committed":33,"step_ms":203.65,"pack_ms":1.08,"draft_ms":134.43,"topk_ms":17.84,"exact_ms":50.27,"exact_decode_ms":0,"acceptance":2.062}

Result: {"status":"keep","tps":10.147906,"e2e_tps":1.29749,"spec_sec":24.663,"gen_tokens":32,"steps":14,"committed":33,"step_ms":225.24,"pack_ms":1.12,"draft_ms":145.35,"topk_ms":20.04,"exact_ms":58.69,"exact_decode_ms":0,"acceptance":2.357}

Result: {"status":"keep","tps":10.443251,"e2e_tps":1.302932,"spec_sec":24.56,"gen_tokens":32,"steps":14,"committed":33,"step_ms":218.87,"pack_ms":1.11,"draft_ms":140.22,"topk_ms":19.49,"exact_ms":58.02,"exact_decode_ms":0,"acceptance":2.357}

Result: {"status":"keep","tps":10.507099,"e2e_tps":1.308686,"spec_sec":24.452,"gen_tokens":32,"steps":14,"committed":33,"step_ms":217.54,"pack_ms":1.11,"draft_ms":138.33,"topk_ms":19.95,"exact_ms":58.11,"exact_decode_ms":0,"acceptance":2.357}

Result: {"status":"keep","tps":11.924636,"e2e_tps":1.328573,"spec_sec":24.086,"gen_tokens":32,"steps":14,"committed":33,"step_ms":191.68,"pack_ms":1.1,"draft_ms":114.78,"topk_ms":17.91,"exact_ms":57.86,"exact_decode_ms":0,"acceptance":2.357}

…exact-validation path on real_rendered_prompt Result: {"status":"keep","tps":9.411626}

Result: {"status":"keep","tps":10.195434,"e2e_tps":1.306336,"spec_sec":24.496,"gen_tokens":32,"steps":14,"committed":33,"step_ms":224.19,"pack_ms":1.16,"draft_ms":143.91,"topk_ms":20.75,"exact_ms":58.32,"acceptance":2.357}

Result: {"status":"keep","tps":10.147005,"acceptance":2.357,"committed":33,"draft_ms":145.55,"e2e_tps":1.302932,"exact_ms":58.27,"gen_tokens":32,"pack_ms":0.94,"spec_sec":24.56,"step_ms":225.26,"steps":14,"topk_ms":20.47}

…oes not regress baseline Result: {"status":"keep","tps":9.892726,"acceptance":2.429,"committed":34,"draft_ms":147.88,"e2e_tps":0.951446,"exact_ms":61.21,"gen_tokens":32,"pack_ms":1.11,"spec_sec":33.633,"step_ms":231.05,"steps":14,"topk_ms":20.82}

… speedup vs budget 32 Result: {"status":"keep","tps":11.808554,"acceptance":2.267,"committed":34,"draft_ms":106.3,"e2e_tps":1.32714,"exact_ms":55.5,"gen_tokens":32,"pack_ms":1.06,"spec_sec":24.112,"step_ms":180.66,"steps":15,"topk_ms":17.77}

Result: {"status":"keep","tps":10.364171,"acceptance":2.357,"committed":33,"draft_ms":142.29,"e2e_tps":1.309329,"exact_ms":57.78,"gen_tokens":32,"pack_ms":1.1,"spec_sec":24.44,"step_ms":220.54,"steps":14,"topk_ms":19.33}

… work, new best TPS Result: {"status":"keep","tps":10.254438,"acceptance":2.357,"committed":33,"draft_ms":144.06,"e2e_tps":1.306389,"exact_ms":57.84,"gen_tokens":32,"pack_ms":1.11,"spec_sec":24.495,"step_ms":222.9,"steps":14,"topk_ms":19.86}

…performs chain seed with K=4 Result: {"status":"keep","tps":10.334649,"acceptance":2.357,"committed":33,"draft_ms":142.02,"e2e_tps":1.29907,"exact_ms":57.89,"gen_tokens":32,"pack_ms":1.09,"spec_sec":24.633,"step_ms":221.17,"steps":14,"topk_ms":20.13}

Result: {"status":"keep","tps":10.589855,"e2e_tps":1.303038,"spec_sec":24.558,"gen_tokens":32,"steps":14,"committed":33,"step_ms":215.84,"pack_ms":1.14,"draft_ms":137.23,"topk_ms":19.59,"exact_ms":57.84,"acceptance":2.357}

…topk cost Result: {"status":"keep","tps":10.666453,"e2e_tps":1.305377,"spec_sec":24.514,"gen_tokens":32,"steps":14,"committed":33,"step_ms":214.29,"pack_ms":1.11,"draft_ms":140.9,"topk_ms":14.71,"exact_ms":57.53,"acceptance":2.357}

cubic-dev-ai

5 issues found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="autoresearch.md">

<violation number="1" location="autoresearch.md:7">
P2: The documented `tps` formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.</violation>
</file>

<file name="multi_prompt_probe.sh">

<violation number="1" location="multi_prompt_probe.sh:17">
P2: The script always outputs a final `METRIC tps=0`, which can make benchmark consumers read a false zero TPS result.</violation>
</file>

<file name="autoresearch.sh">

<violation number="1" location="autoresearch.sh:48">
P2: The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.</violation>
</file>

<file name="src/llama-context.cpp">

<violation number="1" location="src/llama-context.cpp:1200">
P2: The new failure path leaks `ggml_context` when per-layer buffer allocation fails before ownership is moved into `dflash_persist_ctxs_bufs`.</violation>
</file>

<file name="common/speculative-tree-driver.cpp">

<violation number="1" location="common/speculative-tree-driver.cpp:555">
P1: Clamp `top_k` to the vocabulary size before using it; otherwise `extract_top_k_logprobs` can access past the heap bounds when `top_k > n_vocab`.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

cubic-dev-ai · 2026-04-30T17:21:12Z

+    // Read draft logits: [block_size, n_vocab].
+    // Skip position 0 (root slot fixed to root_token) and use positions 1..block_size-1.
+    const int L = (int)block_size - 1; // draft positions with meaningful predictions
+    const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1);


P1: Clamp top_k to the vocabulary size before using it; otherwise extract_top_k_logprobs can access past the heap bounds when top_k > n_vocab.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At common/speculative-tree-driver.cpp, line 555: <comment>Clamp `top_k` to the vocabulary size before using it; otherwise `extract_top_k_logprobs` can access past the heap bounds when `top_k > n_vocab`.</comment> <file context> @@ -550,7 +552,7 @@ std::vector<llama_token> llama_speculative_tree_driver_step( // Skip position 0 (root slot fixed to root_token) and use positions 1..block_size-1. const int L = (int)block_size - 1; // draft positions with meaningful predictions - const int K = (d->params.budget > L) ? 8 : 1; + const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1); d->top_log_probs.resize((size_t)L * K); </file context>

Suggested change

const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1);

const int K = std::max(1, std::min((int)n_vocab, (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1)));

cubic-dev-ai · 2026-04-30T17:21:12Z

+Improve DFlash + DDTree decode throughput for the Qwen3.5-27B llama.cpp server port on Castle without sacrificing greedy bit-equal correctness in the e2e harness. The current user-visible TPS is far below the standalone DFlash baseline and below the target-only llama.cpp server reference.
+
+## Metrics
+- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown.


P2: The documented tps formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At autoresearch.md, line 7: <comment>The documented `tps` formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.</comment> <file context> @@ -0,0 +1,62 @@ +Improve DFlash + DDTree decode throughput for the Qwen3.5-27B llama.cpp server port on Castle without sacrificing greedy bit-equal correctness in the e2e harness. The current user-visible TPS is far below the standalone DFlash baseline and below the target-only llama.cpp server reference. + +## Metrics +- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown. +- **Secondary**: `e2e_tps`, `step_ms`, `draft_ms`, `exact_decode_ms`, `pack_ms`, `topk_ms`, `acceptance`, `gen_tokens`, `spec_sec`. + </file context>

Suggested change

- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown.

- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms / 1000.0)` from the harness timing breakdown.

cubic-dev-ai · 2026-04-30T17:21:12Z

+    committed=$(echo "$out" | python3 -c "import sys,re; m=re.search(r'committed=(\d+)', sys.stdin.read()); print(m.group(1) if m else '0')")
+    echo "result: prompt=$p tps=$tps same=$same diff=$diff steps=$steps committed=$committed"
+done
+echo "METRIC tps=0"


P2: The script always outputs a final METRIC tps=0, which can make benchmark consumers read a false zero TPS result.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At multi_prompt_probe.sh, line 17: <comment>The script always outputs a final `METRIC tps=0`, which can make benchmark consumers read a false zero TPS result.</comment> <file context> @@ -0,0 +1,17 @@ + committed=$(echo "$out" | python3 -c "import sys,re; m=re.search(r'committed=(\d+)', sys.stdin.read()); print(m.group(1) if m else '0')") + echo "result: prompt=$p tps=$tps same=$same diff=$diff steps=$steps committed=$committed" +done +echo "METRIC tps=0" </file context>

cubic-dev-ai · 2026-04-30T17:21:12Z

+}
+
+out_file=$(mktemp /tmp/autoresearch_ddtree.XXXXXX)
+ssh "$REMOTE" "cd '$REMOTE_DIR' && \


P2: The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At autoresearch.sh, line 48: <comment>The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.</comment> <file context> @@ -0,0 +1,134 @@ +} + +out_file=$(mktemp /tmp/autoresearch_ddtree.XXXXXX) +ssh "$REMOTE" "cd '$REMOTE_DIR' && \ + LLAMA_DDTREE_PROFILE='$PROFILE' \ + LLAMA_DDTREE_BLOCK_SIZE='$BLOCK_SIZE' \ </file context>

cubic-dev-ai · 2026-04-30T17:21:12Z

+        ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);
+        if (!buf) {
+            LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n",
+                            __func__, il, (long long)n_tokens);
+            dflash_persist_inter_l.clear();
+            dflash_persist_conv_l.clear();
+            dflash_persist_ctxs_bufs.clear();
+            dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens);
+            return;


P2: The new failure path leaks ggml_context when per-layer buffer allocation fails before ownership is moved into dflash_persist_ctxs_bufs.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At src/llama-context.cpp, line 1200: <comment>The new failure path leaks `ggml_context` when per-layer buffer allocation fails before ownership is moved into `dflash_persist_ctxs_bufs`.</comment> <file context> @@ -1205,24 +1196,20 @@ void llama_context::ensure_dflash_persist_capacity(int64_t n_tokens) { - size_t total_bytes = 0; - for (auto & [buft, ctx] : ctx_map) { - ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx.get(), buft); + ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft); if (!buf) { - LLAMA_LOG_ERROR("%s: failed to allocate persist buffers (n_tokens=%lld)\n", </file context>

Suggested change

ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);

if (!buf) {

LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n",

__func__, il, (long long)n_tokens);

dflash_persist_inter_l.clear();

dflash_persist_conv_l.clear();

dflash_persist_ctxs_bufs.clear();

dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens);

return;

ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);

if (!buf) {

LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n",

__func__, il, (long long)n_tokens);

dflash_persist_inter_l.clear();

dflash_persist_conv_l.clear();

dflash_persist_ctxs_bufs.clear();

dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens);

ggml_free(ctx);

return;

}

Add DFlash draft top-K runtime

davide and others added 30 commits April 17, 2026 11:33

Merge pull request #1 from dusterbloom/feature/tq3-kv-cache

1823460

feat: add TQ3_0 (TurboQuant 3.5bpv) KV cache type

ddtree: fix tree-batch ubatch assertion

29af9b5

split_equal increments n_used per token before calling ubatch_add, so the previous 'n_used == 0' guard fired on the first (and only) tree ubatch. Check 'all tokens emitted in this ubatch' instead, which is the actual semantic we want.

ddtree: relax n_kv assertion to n_kv >= n_tokens

3bad3a1

Tree mode allocates KV cells for the new batch tokens, but n_kv can be padded above n_tokens by the cache layer. Surplus columns are filled with -inf (treated as empty cells), so the assertion only needs n_kv to be at least n_tokens.

ddtree: fix rollback test driver — read prompt logits at last index

ab6024c

Only the last position has logits enabled (batch.logits[n-1]=1), so llama_get_logits_ith(ctx, 0) returned a null row and segfaulted. Read at n_prompt-1 instead.

ddtree: phase 3 test driver fixes

11fc0d5

- chain-capture: llama_kv_self_clear was removed; switch to llama_memory_clear(llama_get_memory(ctx), true). - dflash-draft: llama_model_token_embd_lookup is single-token; loop over the 16-token noise batch instead of passing an array.

ddtree: also copy hidden capture in decode() path

46ea556

Encoder-side copy was wired but the chain-capture test goes through llama_context::decode(); duplicate the same get_async block there.

Leechael added 19 commits April 30, 2026 04:35

benchmark target feature window 128

3f125a4

Result: {"status":"keep","tps":8.751467,"e2e_tps":1.277649,"spec_sec":25.046,"gen_tokens":32,"steps":17,"committed":33,"step_ms":215.09,"pack_ms":1.09,"draft_ms":144.76,"topk_ms":20.79,"exact_ms":48.41,"exact_decode_ms":0,"acceptance":1.941}

tune DDTree proposal temperature to 0.75

f282526

Result: {"status":"keep","tps":10.443251,"e2e_tps":1.302932,"spec_sec":24.56,"gen_tokens":32,"steps":14,"committed":33,"step_ms":218.87,"pack_ms":1.11,"draft_ms":140.22,"topk_ms":19.49,"exact_ms":58.02,"exact_decode_ms":0,"acceptance":2.357}

sweep DDTree proposal temperature 0.7

c499e6d

Result: {"status":"keep","tps":10.507099,"e2e_tps":1.308686,"spec_sec":24.452,"gen_tokens":32,"steps":14,"committed":33,"step_ms":217.54,"pack_ms":1.11,"draft_ms":138.33,"topk_ms":19.95,"exact_ms":58.11,"exact_decode_ms":0,"acceptance":2.357}

Phase 2 baseline: proposal_temp=0.7, budget=32, target_feat_ctx=128, …

166b674

…exact-validation path on real_rendered_prompt Result: {"status":"keep","tps":9.411626}

cubic-dev-ai Bot reviewed Apr 30, 2026

View reviewed changes

Leechael added 8 commits May 1, 2026 04:09

Add DDTree draft backend seam

65eee61

Reuse DFlash draft target feature graph input

ee3c039

Use encode for DFlash draft backend

c35c2d9

Add DFlash draft top-k runtime

444519b

Cache DFlash draft target features

c8c4501

Complete DFlash draft runtime

f1a8820

Address DFlash draft runtime review

d3ceab3

Merge pull request #2 from Leechael/spike/dflash-draft-backend

9f2e129

Add DFlash draft top-K runtime

This was referenced May 5, 2026

[meta] Branch & upstream status (DFlash + DDTree) #3

Open

[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels #4

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddtree: support dflash server cache and chain validation#1

ddtree: support dflash server cache and chain validation#1
Leechael wants to merge 87 commits into
masterfrom
codex/dflash-ddtree-server-cache-chainonly

Leechael commented Apr 29, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1);
	const int K = std::max(1, std::min((int)n_vocab, (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1)));

	- Primary: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown.
	- Primary: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms / 1000.0)` from the harness timing breakdown.

Conversation

Leechael commented Apr 29, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Notes

Summary by cubic

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Leechael commented Apr 29, 2026 •

edited by cubic-dev-ai Bot

Loading