[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack) by Leechael · Pull Request #5 · Leechael/llama.cpp-dflash-ggml

Leechael · 2026-05-05T17:57:32Z

Tracking: #3 (Phase 1, Track A, llama layer)

Stack

This is PR-2, stacked on PR-1 (#4):

PR-1: track-a/ggml → base/luce-org-tq3
PR-2 (this): track-a/llama → track-a/ggml

When reviewing/testing, use track-a/llama as the working tree — it includes both layers.

Scope

All llama-level changes for DFlash + DDTree.

Major areas

batch — parent_id field + llama_batch_init_tree + tree-mode ubatch propagation
graph — build_inp_tree, ancestor mask in kq_mask, read_only_tree graph reuse
kv-cache — llama_kv_cache_seq_compact_tree
memory-recurrent — snapshot/restore/release API (replaceable by upstream ggml-org#19493 checkpoint)
context — DFlash persist buffers, rollback, recurrent_tail_pos, capture_hidden, dflash_draft_top_k, target_feat injection
model — LLM_ARCH_DFLASH_DRAFT load path, dflash_target_capture_layers hparams
src/models/dflash-draft.cpp — 5-layer draft graph (3 modes: full / fuse_only / kv_update_only); shared lm_head with target
src/models/qwen35.cpp — hidden-state capture for 5 layers + tree-mode dispatch
common/speculative-tree.{h,cpp} — DDTree builder + verifier walk + visibility mask
common/speculative-tree-driver.{h,cpp} — spec-decode coordinator: chain validate, fast batched, fast rollback, AR fallback, snapshot replay
common/speculative-draft-backend.{h,cpp} — draft backend abstraction
common/sampling — common_sampler_grammar_token_valid for grammar-aware verify
tools/server/server-context.cpp — DDTree slot lifecycle + grammar-aware verify_cbs
examples/speculative-tree/main.cpp — standalone CLI
tests — test-speculative-tree*.cpp, test-qwen35-*.cpp, test-dflash-draft.cpp unit + acceptance suites

Upstream overlap reference

Area	Upstream PR	Action
`dflash-draft.cpp` model + capture	ggml-org#22105 (open)	Track A keeps castle impl; Track B will replace with ggml-org#22105
Recurrent snapshot/restore	ggml-org#19493 (merged)	drop/migrate eventually
`LLAMA_DDTREE_*` env knobs	none	fork-only
DDTree builder + driver	none	fork-only / wait unified spec API

Stat

55 files changed, 10857 insertions, 113 deletions.

Status

Not for merge into luce-org master. This is a fork-side organizational record of the castle stack as of cleanup/no-research-artifacts (= spike/dflash-verifier-fastpath minus research artifacts).

Summary by cubic

Add DDTree speculative decoding backed by a DFlash draft model to enable multi-token verification per target forward. This introduces a new draft-model arch, tree-mode decode path, server support, a CLI example, and tests.

New Features
- Tree-mode decoding: llama_batch.parent_id, llama_batch_init_tree, ancestor mask in kq_mask, and read-only graph reuse.
- New draft model: LLM_ARCH_DFLASH_DRAFT (5-layer), target hidden-state injection, shared lm_head, configurable draft top-K.
- Qwen3.5 capture: 5-layer hidden capture, context APIs to enable/read capture, and grammar-aware verify via common_sampler_grammar_token_valid.
- KV/recurrent: seq_compact_tree for accepted spines; snapshot/restore/release API plus DFlash persist rollback fast path.
- Spec driver: DDTree builder/visibility mask, draft backend, and a step coordinator that supports chain-validate, fast batched verify, rollback, and AR fallback.
- Integration: server --speculative-mode ddtree with draft model support; examples/speculative-tree CLI; unit/acceptance tests.
Migration
- Server: pass --speculative-mode ddtree and -md <draft.gguf>; optional --ddtree-budget, --ddtree-temp, --ddtree-no-chain-seed.
- API additions (opt‑in): llama_batch.parent_id (tree only), llama_mem_snapshot_id with llama_seq_snapshot/restore/release, and context helpers for hidden capture and draft top-K.
- Constraints (current phase): single slot (--parallel 1), Qwen3.5 target with draft pairing, draft top-K auto unless overridden.

^{Written for commit 819cc8b. Summary will update on new commits.}

All llama-level changes for DFlash + DDTree, on top of luce-org ggml stack. This is the entire castle implementation — DFlash drafter, DDTree builder & verifier, persist-rollback recurrent cache, server slot integration, tests, plus the LLAMA_DDTREE_* fast-path env knobs. Major areas: - batch: parent_id field + llama_batch_init_tree + tree-mode ubatch propagation - graph: build_inp_tree + ancestor mask in kq_mask + read_only_tree reuse - kv-cache: llama_kv_cache_seq_compact_tree - memory-recurrent: snapshot/restore/release API (note: replaceable by upstream ggml-org#19493 checkpoint mechanism) - context: dflash persist buffers + rollback + recurrent_tail_pos + capture_hidden + dflash_draft_top_k + target_feat injection - model: LLM_ARCH_DFLASH_DRAFT load path + capture_layers hparams - src/models/dflash-draft.cpp: 5-layer draft graph (3 modes: full / fuse_only / kv_update_only), shared lm_head with target - src/models/qwen35.cpp: hidden-state capture for 5 layers + tree-mode dispatch - common/speculative-tree.{h,cpp}: ddtree builder + verifier walk + visibility mask - common/speculative-tree-driver.{h,cpp}: spec-decode coordinator with chain validate, fast batched, fast rollback, AR fallback, snapshot replay paths - common/speculative-draft-backend.{h,cpp}: draft backend abstraction - common/sampling: common_sampler_grammar_token_valid for grammar-aware verify - tools/server/server-context.cpp: DDTree slot lifecycle + grammar-aware verify_cbs - examples/speculative-tree/main.cpp: standalone CLI - tests/test-speculative-tree*.cpp + test-qwen35-*.cpp + test-dflash-draft.cpp: unit + acceptance suites Track A: production castle stack. Stacked on track-a/ggml. Tracking: #3 (Phase 1, Track A, llama layer)

Leechael mentioned this pull request May 5, 2026

[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels #4

Draft

github-actions Bot added documentation Improvements or additions to documentation testing examples python server model labels May 5, 2026

Leechael mentioned this pull request May 5, 2026

[meta] Branch & upstream status (DFlash + DDTree) #3

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack)#5

[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack)#5
Leechael wants to merge 1 commit into
track-a/ggmlfrom
track-a/llama

Leechael commented May 5, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Leechael commented May 5, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack

Scope

Major areas

Upstream overlap reference

Stat

Status

Summary by cubic

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Leechael commented May 5, 2026 •

edited by cubic-dev-ai Bot

Loading