Skip to content

ddtree: support dflash server cache and chain validation#1

Open
Leechael wants to merge 87 commits into
masterfrom
codex/dflash-ddtree-server-cache-chainonly
Open

ddtree: support dflash server cache and chain validation#1
Leechael wants to merge 87 commits into
masterfrom
codex/dflash-ddtree-server-cache-chainonly

Conversation

@Leechael
Copy link
Copy Markdown
Owner

@Leechael Leechael commented Apr 29, 2026

Summary

  • add/finish DDTree + DFlash server integration paths for the llama.cpp fork
  • share target output.weight with the DFlash draft model to avoid duplicate lm_head VRAM use
  • gate DFlash persist-buffer allocation behind LLAMA_DDTREE_FAST_ROLLBACK=1 so default snapshot/replay mode can fit full draft GPU offload
  • restore prompt cache/checkpoint support for DDTree by re-decoding the last 2048 prompt tokens to rebuild the DFlash target-feature ring
  • add a DDTree rebuild-boundary checkpoint so repeated long prompts restore close to the ring rebuild boundary
  • make default correctness mode use chain-only exact validation; keep batched tree verify behind LLAMA_DDTREE_FAST_BATCHED=1 or trace env vars
  • add profiling/logging and E2E coverage for the DDTree server path

Verification

  • local build: cmake --build build-cpu -j 8 --target test-speculative-tree-e2e llama-server
  • Castle CUDA build: cmake --build build-server -j 16 --target test-speculative-tree-e2e llama-server
  • Castle E2E: PASS: all 8 token positions are bit-equal
  • Castle server smoke with a 20,410-token rendered prompt, max_tokens=8:
    • request 1: 22.76s
    • request 2: 4.86s
    • snap=0.00 target_tree=0.00 posterior=0.00

Notes

This PR is for tracking the current DFlash + DDTree implementation work in the fork. Fast batched verification remains opt-in because prior diagnostics showed batched/exact divergences on real prompts.


Summary by cubic

Adds DDTree speculative decoding with a DFlash draft model and server integration using chain-only exact validation. Also introduces TQ3_0 KV cache for memory savings, chunked prefill FlashAttention for faster long‑prompt prefill, and a tunable target feature window for the draft.

  • New Features

    • DDTree in llama-server: --speculative-mode ddtree with -md <draft>; chain-only verify by default with grammar-aware validation; single-slot only.
    • Draft model support: new LLM_ARCH_DFLASH_DRAFT loader and forward; hidden-state capture on the target; target feature injection to the draft; tunable target feature window for draft conditioning.
    • Tree verify path: tree-mode batches via batch.parent_id, KV compaction after accept (spine-aware), and SSM/conv rollback using per-layer persist buffers.
    • APIs: llama_set_capture_hidden, llama_get_hidden_capture_data, llama_seq_snapshot/restore/release, llama_kv_cache_seq_compact_tree, llama_set_target_feat_raw.
    • KV cache: new TQ3_0 quantization with FWHT rotation and chunked prefill attention; use -ctk tq3_0 -ctv tq3_0.
    • Tooling: example llama-speculative-tree (off by default) and unit/e2e tests for DDTree and draft paths (opt-in via CMake flags).
    • Stats: split timing for exact validation in server traces.
  • Migration

    • To use DDTree in the server: run with --speculative-mode ddtree -md <draft-gguf>; optional --ddtree-budget, --ddtree-temp, --ddtree-no-chain-seed, and the target feature window flag; keep --parallel 1.
    • Draft GGUF must contain output.weight and declare arch dflash-draft.
    • Existing chain mode and non-DDTree paths are unchanged.

Written for commit b63fd0d. Summary will update on new commits. Review in cubic

davide and others added 30 commits April 17, 2026 11:33
Add three ggml extension ops for DDTree-style speculative decoding verify:
  - ggml_ssm_conv_tree        (src[2] = parent_ids)
  - ggml_gated_delta_net_tree (src[6] = parent_ids)
  - ggml_gated_delta_net_tree_persist (src[7] = persist_inter)

Tree mode is signalled via src[] slots; no new op codes. The gated_delta_net
result tensor is enlarged to pack per-token intermediate recurrent states so
the spec-decode loop can roll back without a replay forward.

Referenced by standalone Luce DFlash (https://github.com/Luce-Org/luce-dflash).
Implements TQ3_0 quantization for KV cache, achieving 22% memory
reduction vs Q4_0 (3.5 vs 4.5 bpv) with comparable decode speed.

Key components:
- block_tq3_0 struct (14 bytes/32 elements, group-quantized with FWHT rotation)
- CUDA quantize/dequantize kernels (cpy, set_rows, getrows)
- Flash attention vec kernel integration (vec_dot_KQ, dequantize_V)
- Chunked FA kernel for TQ3_0 prefill (fallback when MMA unavailable)
- FWHT rotation: K/V stored in rotated space, Q rotated at attention time
- Auto-routing to chunked kernel for TQ3_0 during prefill

Benchmarks on RTX 3090, Qwen3.5-27B at 262K context:
  KV memory: 3584 MiB (TQ3_0) vs 4608 MiB (Q4_0) — 22% savings
  Decode:    ~40 tok/s (identical)
  Prefill:   ~10% slower than Q4_0 (FWHT overhead)

Usage: llama-server -ctk tq3_0 -ctv tq3_0
feat: add TQ3_0 (TurboQuant 3.5bpv) KV cache type
Add tree-mode batch path for DDTree-style speculative decode verify.
Default (parent_id == NULL) keeps chain mode behavior unchanged.

Public API:
- llama_batch.parent_id (int32_t *, NULL = chain mode)
- llama_batch_init_tree() / free symmetrically handles parent_id

Internal:
- llama_ubatch propagates parent_id; asserts tree batches fit one ubatch
- llm_graph_input_tree builds parent_ids (i32) + tree_mask (f32 ancestor-only)
- llama_decode rejects tree batches on non-QWEN35 architectures
- delta-net dispatches to build_delta_net_tree when parent_ids is set
- qwen35 full-attn substitutes tree_mask for KQ_MASK; ssm_conv_tree replaces
  ssm_conv on linear-attn layers when parent_ids is present

Out of scope (Phase 2+): persist-buffer variant, snapshot/restore,
M-RoPE 4-axis positions (UNKNOWN-3 in roadmap).
Add tests/test-qwen35-tree.cpp: loads Qwen3.5 GGUF, runs forward in
either chain or tree mode, dumps F32 logits to a binary file
(int32 n_tokens, int32 vocab_size, then n_tokens*vocab_size floats).

Tree fixture format: JSON with committed_offset and a list of nodes
(flat_idx, token_id, parent_idx, depth). Phase 1 fixture is 5 nodes
(root + chain + 1 sibling) to exercise both ancestor-walk and branching.

CMake target gated by -DLLAMA_BUILD_TESTS_QWEN35_TREE=ON (off by default;
test requires a 16+ GB GGUF not in CI).

tests/.gitignore: re-include fixtures/ subtree.
split_equal increments n_used per token before calling ubatch_add, so
the previous 'n_used == 0' guard fired on the first (and only) tree
ubatch. Check 'all tokens emitted in this ubatch' instead, which is
the actual semantic we want.
build_inp_tree allocates the mask as F32 (for clarity in set_input),
but ggml_flash_attn_ext asserts mask->type == F16 when FA is enabled.
Cast at construction time so all downstream consumers see the right
type, mirroring how self_kq_mask_cnv is built.
Previous attempt allocated a separate tree_mask graph input and
substituted self_kq_mask_cnv. That orphaned the F32 self_kq_mask, so
gallocr left it without a backend buffer and set_input_kq_mask
asserted on dst->buffer == NULL.

Better: keep the standard kq_mask path intact and, when ubatch->parent_id
is set, have llama_kv_cache::set_input_kq_mask write the ancestor-only
pattern directly into dst. Phase 1 asserts n_stream==1 and
n_kv==n_tokens (no past); these will relax in later phases.

Drops:
- llm_graph_input_tree::inp_tree_mask (no longer needed)
- llm_graph_context::tree_mask field
- qwen35.cpp self_kq_mask_cnv substitution
Tree mode allocates KV cells for the new batch tokens, but n_kv can be
padded above n_tokens by the cache layer. Surplus columns are filled
with -inf (treated as empty cells), so the assertion only needs n_kv
to be at least n_tokens.
Public C API:
- llama_seq_snapshot / restore / release for recurrent state
- LLAMA_MEM_SNAPSHOT_INVALID sentinel

Internal:
- llama_memory_recurrent::{snapshot,restore,release} allocate per-layer
  backup tensors on the same backend as the main cache and use
  ggml_backend_tensor_copy to capture/restore SSM + conv state for one
  seq_id. No-op when called on non-recurrent memory.
- llama_kv_cache::seq_compact_tree copies K/V rows from accepted DFS
  slots to spine slots [0, commit_n) and truncates the cache logical
  length. v_trans=true path is skipped (qwen3.5 uses v_trans=false).
- build_delta_net_tree gains an optional persist_inter parameter that
  routes to ggml_gated_delta_net_tree_persist; existing callers pass
  nullptr.

Punted (Phase 2.4 in roadmap): wiring per-layer F16 persist buffers
through llm_build_qwen35 — needs clearer decision on lifetime/ownership
in llama_model vs llama_context. Documented in build_delta_net_tree
header comment.
tests/test-qwen35-tree-rollback.cpp: prompt prefill, snapshot, decode
K chain steps, restore, decode K chain steps with the same first
token, dump pre/post logits. Driver expects bit-equality after restore;
verify with scripts/compare_logits.py --abs-tol 0.

CMake target gated by -DLLAMA_BUILD_TESTS_QWEN35_TREE_ROLLBACK=ON
(default OFF, not registered with ctest).

Test 2.B (vs test_dflash golden state) deferred — depends on
test_dflash --dump-state-at-commit (Phase 0 prerequisite).
Only the last position has logits enabled (batch.logits[n-1]=1), so
llama_get_logits_ith(ctx, 0) returned a null row and segfaulted.
Read at n_prompt-1 instead.
Qwen3.5 hybrid uses llama_memory_hybrid which contains a
llama_memory_recurrent component plus a KV cache. The C wrappers
were dynamic_cast'ing to llama_memory_recurrent directly, which
fails on hybrid models — so llama_seq_snapshot returned
LLAMA_MEM_SNAPSHOT_INVALID even though the underlying recurrent
state exists.

Fall back to hyb->get_mem_recr() when the direct cast misses.
ggml_backend_tensor_copy dereferences src->buffer directly, not
src->view_src->buffer, so copying through a no_alloc view tensor
crashes with the buffer assertion. Use ggml_backend_tensor_get/set
with explicit row offsets instead — they correctly resolve view_src
and just bounce a single row through a host buffer per layer.
Tensor data alone isn't enough — after restore, the recurrent memory
still tracks the post-run-1 position, so M-RoPE rejects the next
batch because X (=23) >= Y (=16). Capture cell_id, cells[cell_id].pos,
cells[cell_id].src, and cells[seq_id].tail at snapshot time and
write them back in restore.
New arch LLM_ARCH_DFLASH_DRAFT (5-layer non-causal Qwen3 variant):
- llama-arch: register arch + 3 new tensor names (fc, hidden_norm, out_norm)
- llama-hparams: dflash_target_capture_layers / target_n_embd /
  mask_token_id / block_size fields
- llama-model: GGUF metadata parsing under "dflash_draft.*" keys, tensor
  loader, model fields for dflash_fc / dflash_hidden_norm
- src/models/dflash-draft.cpp: 5-layer forward — Q from noise, K/V from
  fc(target_feat) || noise concat, NEOX rope theta=10M, FA mask=null
  (non-causal), GQA 32:8, SwiGLU FFN, shared target lm_head via
  out_norm + lora_mm

Hidden capture for target qwen35 (opt-in):
- llama-context: capture_hidden flag + llama_set_capture_hidden /
  llama_get_hidden_capture C API
- llama-graph: t_hidden_capture result tensor
- qwen35.cpp: when capture_hidden, allocate [n_embd, 5*n_tokens] buffer
  and ggml_cpy post-FFN residual at the 5 capture layers; zero overhead
  when capture_hidden=false (no graph nodes added)

Token embed lookup helper:
- llama_model_token_embd_lookup: F32 / F16 (via fp16_to_fp32_row);
  -1 for out-of-range or unsupported quant types

Punted: persist-buffer wiring through llm_build_qwen35 (Phase 2.4 still
stands; will be revisited with Phase 4's DDTree driver). Mask token
embed source for the host driver is exposed via the API above; actual
spec-decode wiring lives in Phase 4.
tests/test-dflash-draft.cpp — loads target+draft, looks up
[last_tok, MASK*15] from target token_embd, reads target_feat binary,
runs draft forward, dumps 16 x vocab logits.

tests/test-qwen35-chain-capture.cpp — Mode A: capture enabled, asserts
shape/NaN/nonzero on capture buffer; Mode B: capture disabled, chain
re-decode, asserts logits bit-equal to a previously captured baseline.

tests/fixtures/ddtree/dflash_draft_metadata_smoke.json — expected
metadata fields for the converted draft GGUF (used by
scripts/check_dflash_draft_gguf.py at the super-repo level).

CMake target gated by -DLLAMA_BUILD_TESTS_DFLASH_DRAFT=ON
(default OFF, not registered with ctest).

Test 3.B (vs dflash dump-draft-output) deferred — depends on
test_dflash --dump-draft-output (Phase 0 prerequisite).
- chain-capture: llama_kv_self_clear was removed; switch to
  llama_memory_clear(llama_get_memory(ctx), true).
- dflash-draft: llama_model_token_embd_lookup is single-token; loop
  over the 16-token noise batch instead of passing an array.
qwen35.cpp allocates the capture buffer as [n_embd, 5*n_tokens]
(slots stacked along ne[1]). The test was asserting the transposed
[5*n_embd, n_tokens] form. Accept both — capture data is the same
either way; downstream consumers can pick whichever stride they prefer.
ggml_set_input on the capture buffer was wrong — the buffer is written
by GPU ggml_cpy ops during forward and read by host code afterward,
which is the output pattern. Drop set_input from qwen35.cpp; have
llm_graph_result::set_outputs register t_hidden_capture so the
backend syncs the writes back to host memory. Also reset the field
in llm_graph_result::reset.

Without this fix the host saw an all-zero buffer because the GPU
writes never propagated to the input-side host pointer.
qwen35.cpp: ggml_set_input on hidden_cap_buf is required for ggml-alloc
to give the leaf tensor a backend (OUTPUT-only path leaves backend_id
= -1 and crashes gallocr). Both flags coexist: INPUT for backend
assignment, OUTPUT for post-forward host sync.

llama_model_token_embd_lookup: handle quantized tok_embd via the
type traits' to_float dequantize path. Required for the dflash draft
test which feeds noise embeddings looked up from a Q4_K_M target.
Hidden capture: replaced pre-allocated INPUT buffer + cpy-to-view
pattern (which silently wrote zeros because gallocr put the dst on
CPU-pinned host while src lived on device) with the cleaner approach:
collect per-layer hidden states into cap_slots[5] using ggml_cont,
then ggml_concat them along dim 1 into [n_embd, 5*n_tokens] AFTER the
layer loop, register as t_hidden_capture (OUTPUT). The concat node is
a regular compute output — gallocr schedules it on the device backend
and the sched syncs it to host automatically (same pattern as
t_logits / t_embd).

dflash-draft FATTN: ggml_flash_attn_ext result tensor must have name
prefix "__fattn__-<il>" or sched_reserve asserts. Add the cb() call
that build_attn_mha makes; minimal mirror of the existing helper.
common/speculative-tree.{h,cpp}: pure CPU algorithms moved out of
test_dflash for reuse. Includes:
- build_ddtree (best-first heap with optional chain seed; budget
  counts total nodes including root)
- follow_verified_tree (greedy walk along target argmax)
- build_tree_visibility (ancestor-only mask from parent_idx)
- extract_top_k_logprobs (online logsumexp + min-heap top-K, descending)

tests/test-speculative-tree: 4 standalone test cases, no model
required. Registered with ctest via llama_build_and_test.
Validates: a small tree build, a greedy-walk acceptance trace,
a hand-computed visibility mask, and top-K extract numerics.
llama_get_hidden_capture returns a ggml_tensor* whose data field is a
device pointer when the model is offloaded — dereferencing it from
host code segfaults. Mirror the logits/embd pattern: keep an
std::vector<float> in llama_context, populate it via
ggml_backend_tensor_get_async during decode synchronization, and
expose it via a new llama_get_hidden_capture_data accessor that
returns a CPU pointer plus ne0/ne1 dims.

Update the chain-capture test to use the new accessor.
Encoder-side copy was wired but the chain-capture test goes through
llama_context::decode(); duplicate the same get_async block there.
target_feat injection (Phase 3.4 gap):
- llm_graph_input_target_feat (new) wires three dflash-draft inputs:
  target_feat_raw, pos_q, pos_k. set_input copies pending host data
  via memcpy.
- llama_context::set_target_feat_raw stashes data + dims; the next
  llama_decode on a draft context picks them up.
- public C API: llama_set_target_feat_raw + llama_kv_cache_seq_compact_tree.
- dflash-draft.cpp builds these inputs through the new helper instead
  of standalone ggml_set_input nodes.

Phase 4 spec-decode driver:
- common/speculative-tree-driver.{h,cpp} implements the 10-step DDTree
  loop: noise embed lookup → target_feat repack from capture →
  draft forward → top-K → build_ddtree → tree-verify → posterior →
  follow_verified_tree → KV compact → SSM rollback via snapshot+replay.
  The snapshot+replay path is slower than the reference's per-layer
  capture but matches Phase 2 invariants. Documented in source.
- examples/speculative-tree/main.cpp CLI wrapper, gated by
  LLAMA_BUILD_EXAMPLES_SPECULATIVE_TREE=ON (default OFF).
tests/test-speculative-tree-e2e.cpp: runs the same prompt through
chain-mode greedy decode (reference) and through the spec-decode driver
(via llama_speculative_tree_driver_step), dumps both token streams as
int32 LE binaries. Acceptance: with --temp 0 the first
min(chain_n, spec_n) tokens must be bit-identical (DDTree is lossless
for greedy sampling).

CMake target gated by -DLLAMA_BUILD_TESTS_SPECULATIVE_TREE_E2E=ON
(default OFF, not registered with ctest). Test 4.B (vs test_dflash
golden trajectory) deferred — chain-vs-spec comparison gives strong
functional verification independently.
Bug: the driver assumed llama_get_hidden_capture_data returned cumulative
hidden states across decodes, but the capture buffer only holds the most
recent decode's n_tokens columns. Castle test 4.A diverged at token 1
because the driver was slicing 24 cols from a buffer of just 1.

Fix: driver now owns a [5*n_embd, target_feat_cap=8192] linear ring
indexed by committed_pos. After each target decode (prompt prefill,
prompt-style chain steps, or tree-verify) the new tokens' hidden
states get copied from the per-decode capture into the ring. On
spec-step start, the driver slices the most recent ctx_len = min(
committed, 2048) columns and feeds them via llama_set_target_feat_raw.

Tree verify ingest uses accepted_dfs[] so only accepted nodes' hidden
states enter the ring (matches the dflash reference semantics).

New public API:
- llama_speculative_tree_driver_ingest_prompt_capture(d, n_prompt)
  Call after the chain-mode prompt prefill that populated capture,
  before the first spec step.

example/main.cpp and tests/test-speculative-tree-e2e.cpp both call
the new API at the right spot.

Limitation: 8192-column linear buffer (no rotation); aborts on
overflow. True ring rotation is Phase 5 follow-up.
Phase 1 wrote the tree mask assuming KV slot j == ubatch token j and
n_past = 0. After prompt prefill, that's wrong: tree tokens land at
slots [n_past, n_past+n_tree) and prompt cells stay at [0, n_past).
Each tree query needs to see ALL prompt cells plus its tree ancestors.

Switch to the full kv-cells walk: identify the prompt/tree boundary
via min(ubatch->pos), then for each KV cell decide visibility based
on whether its position is past (always visible) or matches one of
the query's ancestor positions in the current tree. Mirrors the
dflash reference (test_dflash.cpp:423-442).
Driver returns [root_echo, draft_accepts..., bonus] where bonus is the
next step's root_token and is NOT in KV. Test was treating the whole
result as committed (advancing committed_pos by result.size() and
appending all to output), so each step double-counted the bonus from
the previous step. Result: spec output had a duplicate token at every
step boundary, diverging from the chain reference.

Mirror the example's correct accounting: out += result[:-1], root_token
= result.back(), committed_pos += result.size() - 1.
Leechael added 19 commits April 30, 2026 04:35
Result: {"status":"keep","tps":7.73072,"e2e_tps":1.251564,"spec_sec":25.568,"gen_tokens":32,"steps":17,"committed":33,"step_ms":243.49,"pack_ms":4.49,"draft_ms":171.03,"topk_ms":19.51,"exact_ms":48.43,"exact_decode_ms":0,"acceptance":1.941}
Result: {"status":"keep","tps":8.388756,"e2e_tps":1.262128,"spec_sec":25.354,"gen_tokens":32,"steps":17,"committed":33,"step_ms":224.39,"pack_ms":2.24,"draft_ms":154.39,"topk_ms":19.77,"exact_ms":47.96,"exact_decode_ms":0,"acceptance":1.941}
Result: {"status":"keep","tps":8.751467,"e2e_tps":1.277649,"spec_sec":25.046,"gen_tokens":32,"steps":17,"committed":33,"step_ms":215.09,"pack_ms":1.09,"draft_ms":144.76,"topk_ms":20.79,"exact_ms":48.41,"exact_decode_ms":0,"acceptance":1.941}
Result: {"status":"keep","tps":9.112863,"e2e_tps":1.2838,"spec_sec":24.926,"gen_tokens":32,"steps":17,"committed":33,"step_ms":206.56,"pack_ms":1.06,"draft_ms":137.65,"topk_ms":19.8,"exact_ms":48.01,"exact_decode_ms":0,"acceptance":1.941}
Result: {"status":"keep","tps":9.820771,"e2e_tps":1.295075,"spec_sec":24.709,"gen_tokens":32,"steps":16,"committed":33,"step_ms":203.65,"pack_ms":1.08,"draft_ms":134.43,"topk_ms":17.84,"exact_ms":50.27,"exact_decode_ms":0,"acceptance":2.062}
Result: {"status":"keep","tps":10.147906,"e2e_tps":1.29749,"spec_sec":24.663,"gen_tokens":32,"steps":14,"committed":33,"step_ms":225.24,"pack_ms":1.12,"draft_ms":145.35,"topk_ms":20.04,"exact_ms":58.69,"exact_decode_ms":0,"acceptance":2.357}
Result: {"status":"keep","tps":10.443251,"e2e_tps":1.302932,"spec_sec":24.56,"gen_tokens":32,"steps":14,"committed":33,"step_ms":218.87,"pack_ms":1.11,"draft_ms":140.22,"topk_ms":19.49,"exact_ms":58.02,"exact_decode_ms":0,"acceptance":2.357}
Result: {"status":"keep","tps":10.507099,"e2e_tps":1.308686,"spec_sec":24.452,"gen_tokens":32,"steps":14,"committed":33,"step_ms":217.54,"pack_ms":1.11,"draft_ms":138.33,"topk_ms":19.95,"exact_ms":58.11,"exact_decode_ms":0,"acceptance":2.357}
Result: {"status":"keep","tps":11.924636,"e2e_tps":1.328573,"spec_sec":24.086,"gen_tokens":32,"steps":14,"committed":33,"step_ms":191.68,"pack_ms":1.1,"draft_ms":114.78,"topk_ms":17.91,"exact_ms":57.86,"exact_decode_ms":0,"acceptance":2.357}
…exact-validation path on real_rendered_prompt

Result: {"status":"keep","tps":9.411626}
Result: {"status":"keep","tps":10.195434,"e2e_tps":1.306336,"spec_sec":24.496,"gen_tokens":32,"steps":14,"committed":33,"step_ms":224.19,"pack_ms":1.16,"draft_ms":143.91,"topk_ms":20.75,"exact_ms":58.32,"acceptance":2.357}
Result: {"status":"keep","tps":10.147005,"acceptance":2.357,"committed":33,"draft_ms":145.55,"e2e_tps":1.302932,"exact_ms":58.27,"gen_tokens":32,"pack_ms":0.94,"spec_sec":24.56,"step_ms":225.26,"steps":14,"topk_ms":20.47}
…oes not regress baseline

Result: {"status":"keep","tps":9.892726,"acceptance":2.429,"committed":34,"draft_ms":147.88,"e2e_tps":0.951446,"exact_ms":61.21,"gen_tokens":32,"pack_ms":1.11,"spec_sec":33.633,"step_ms":231.05,"steps":14,"topk_ms":20.82}
… speedup vs budget 32

Result: {"status":"keep","tps":11.808554,"acceptance":2.267,"committed":34,"draft_ms":106.3,"e2e_tps":1.32714,"exact_ms":55.5,"gen_tokens":32,"pack_ms":1.06,"spec_sec":24.112,"step_ms":180.66,"steps":15,"topk_ms":17.77}
Result: {"status":"keep","tps":10.364171,"acceptance":2.357,"committed":33,"draft_ms":142.29,"e2e_tps":1.309329,"exact_ms":57.78,"gen_tokens":32,"pack_ms":1.1,"spec_sec":24.44,"step_ms":220.54,"steps":14,"topk_ms":19.33}
… work, new best TPS

Result: {"status":"keep","tps":10.254438,"acceptance":2.357,"committed":33,"draft_ms":144.06,"e2e_tps":1.306389,"exact_ms":57.84,"gen_tokens":32,"pack_ms":1.11,"spec_sec":24.495,"step_ms":222.9,"steps":14,"topk_ms":19.86}
…performs chain seed with K=4

Result: {"status":"keep","tps":10.334649,"acceptance":2.357,"committed":33,"draft_ms":142.02,"e2e_tps":1.29907,"exact_ms":57.89,"gen_tokens":32,"pack_ms":1.09,"spec_sec":24.633,"step_ms":221.17,"steps":14,"topk_ms":20.13}
Result: {"status":"keep","tps":10.589855,"e2e_tps":1.303038,"spec_sec":24.558,"gen_tokens":32,"steps":14,"committed":33,"step_ms":215.84,"pack_ms":1.14,"draft_ms":137.23,"topk_ms":19.59,"exact_ms":57.84,"acceptance":2.357}
…topk cost

Result: {"status":"keep","tps":10.666453,"e2e_tps":1.305377,"spec_sec":24.514,"gen_tokens":32,"steps":14,"committed":33,"step_ms":214.29,"pack_ms":1.11,"draft_ms":140.9,"topk_ms":14.71,"exact_ms":57.53,"acceptance":2.357}
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="autoresearch.md">

<violation number="1" location="autoresearch.md:7">
P2: The documented `tps` formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.</violation>
</file>

<file name="multi_prompt_probe.sh">

<violation number="1" location="multi_prompt_probe.sh:17">
P2: The script always outputs a final `METRIC tps=0`, which can make benchmark consumers read a false zero TPS result.</violation>
</file>

<file name="autoresearch.sh">

<violation number="1" location="autoresearch.sh:48">
P2: The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.</violation>
</file>

<file name="src/llama-context.cpp">

<violation number="1" location="src/llama-context.cpp:1200">
P2: The new failure path leaks `ggml_context` when per-layer buffer allocation fails before ownership is moved into `dflash_persist_ctxs_bufs`.</violation>
</file>

<file name="common/speculative-tree-driver.cpp">

<violation number="1" location="common/speculative-tree-driver.cpp:555">
P1: Clamp `top_k` to the vocabulary size before using it; otherwise `extract_top_k_logprobs` can access past the heap bounds when `top_k > n_vocab`.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread common/speculative-tree-driver.cpp Outdated
// Read draft logits: [block_size, n_vocab].
// Skip position 0 (root slot fixed to root_token) and use positions 1..block_size-1.
const int L = (int)block_size - 1; // draft positions with meaningful predictions
const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Clamp top_k to the vocabulary size before using it; otherwise extract_top_k_logprobs can access past the heap bounds when top_k > n_vocab.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At common/speculative-tree-driver.cpp, line 555:

<comment>Clamp `top_k` to the vocabulary size before using it; otherwise `extract_top_k_logprobs` can access past the heap bounds when `top_k > n_vocab`.</comment>

<file context>
@@ -550,7 +552,7 @@ std::vector<llama_token> llama_speculative_tree_driver_step(
     // Skip position 0 (root slot fixed to root_token) and use positions 1..block_size-1.
     const int L = (int)block_size - 1; // draft positions with meaningful predictions
-    const int K = (d->params.budget > L) ? 8 : 1;
+    const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1);
 
     d->top_log_probs.resize((size_t)L * K);
</file context>
Suggested change
const int K = (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1);
const int K = std::max(1, std::min((int)n_vocab, (d->params.top_k > 0) ? d->params.top_k : ((d->params.budget > L) ? 8 : 1)));

Comment thread autoresearch.md
Improve DFlash + DDTree decode throughput for the Qwen3.5-27B llama.cpp server port on Castle without sacrificing greedy bit-equal correctness in the e2e harness. The current user-visible TPS is far below the standalone DFlash baseline and below the target-only llama.cpp server reference.

## Metrics
- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The documented tps formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At autoresearch.md, line 7:

<comment>The documented `tps` formula is missing ms→s conversion, so it describes tokens/ms while labeling tok/s.</comment>

<file context>
@@ -0,0 +1,62 @@
+Improve DFlash + DDTree decode throughput for the Qwen3.5-27B llama.cpp server port on Castle without sacrificing greedy bit-equal correctness in the e2e harness. The current user-visible TPS is far below the standalone DFlash baseline and below the target-only llama.cpp server reference.
+
+## Metrics
+- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown.
+- **Secondary**: `e2e_tps`, `step_ms`, `draft_ms`, `exact_decode_ms`, `pack_ms`, `topk_ms`, `acceptance`, `gen_tokens`, `spec_sec`.
+
</file context>
Suggested change
- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms)` from the harness timing breakdown.
- **Primary**: `tps` (tok/s, higher is better) — decode-only speculative throughput, computed as `generated_tokens / (steps * step_ms / 1000.0)` from the harness timing breakdown.

Comment thread multi_prompt_probe.sh
committed=$(echo "$out" | python3 -c "import sys,re; m=re.search(r'committed=(\d+)', sys.stdin.read()); print(m.group(1) if m else '0')")
echo "result: prompt=$p tps=$tps same=$same diff=$diff steps=$steps committed=$committed"
done
echo "METRIC tps=0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The script always outputs a final METRIC tps=0, which can make benchmark consumers read a false zero TPS result.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At multi_prompt_probe.sh, line 17:

<comment>The script always outputs a final `METRIC tps=0`, which can make benchmark consumers read a false zero TPS result.</comment>

<file context>
@@ -0,0 +1,17 @@
+    committed=$(echo "$out" | python3 -c "import sys,re; m=re.search(r'committed=(\d+)', sys.stdin.read()); print(m.group(1) if m else '0')")
+    echo "result: prompt=$p tps=$tps same=$same diff=$diff steps=$steps committed=$committed"
+done
+echo "METRIC tps=0"
</file context>

Comment thread autoresearch.sh
}

out_file=$(mktemp /tmp/autoresearch_ddtree.XXXXXX)
ssh "$REMOTE" "cd '$REMOTE_DIR' && \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At autoresearch.sh, line 48:

<comment>The SSH command string is vulnerable to shell-quoting breakage/injection when env-provided values contain single quotes. Build the remote invocation without interpolating raw values into a quoted shell string.</comment>

<file context>
@@ -0,0 +1,134 @@
+}
+
+out_file=$(mktemp /tmp/autoresearch_ddtree.XXXXXX)
+ssh "$REMOTE" "cd '$REMOTE_DIR' && \
+  LLAMA_DDTREE_PROFILE='$PROFILE' \
+  LLAMA_DDTREE_BLOCK_SIZE='$BLOCK_SIZE' \
</file context>

Comment thread src/llama-context.cpp
Comment on lines +1200 to +1208
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);
if (!buf) {
LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n",
__func__, il, (long long)n_tokens);
dflash_persist_inter_l.clear();
dflash_persist_conv_l.clear();
dflash_persist_ctxs_bufs.clear();
dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens);
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The new failure path leaks ggml_context when per-layer buffer allocation fails before ownership is moved into dflash_persist_ctxs_bufs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/llama-context.cpp, line 1200:

<comment>The new failure path leaks `ggml_context` when per-layer buffer allocation fails before ownership is moved into `dflash_persist_ctxs_bufs`.</comment>

<file context>
@@ -1205,24 +1196,20 @@ void llama_context::ensure_dflash_persist_capacity(int64_t n_tokens) {
-    size_t total_bytes = 0;
-    for (auto & [buft, ctx] : ctx_map) {
-        ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx.get(), buft);
+        ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);
         if (!buf) {
-            LLAMA_LOG_ERROR("%s: failed to allocate persist buffers (n_tokens=%lld)\n",
</file context>
Suggested change
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);
if (!buf) {
LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n",
__func__, il, (long long)n_tokens);
dflash_persist_inter_l.clear();
dflash_persist_conv_l.clear();
dflash_persist_ctxs_bufs.clear();
dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens);
return;
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(ctx, buft);
if (!buf) {
LLAMA_LOG_ERROR("%s: failed to allocate persist buffer for layer %d (n_tokens=%lld)\n",
__func__, il, (long long)n_tokens);
dflash_persist_inter_l.clear();
dflash_persist_conv_l.clear();
dflash_persist_ctxs_bufs.clear();
dflash_persist_failed_n_tokens = std::max(dflash_persist_failed_n_tokens, n_tokens);
ggml_free(ctx);
return;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants