[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack)#5
Draft
Leechael wants to merge 1 commit intotrack-a/ggmlfrom
Draft
[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack)#5Leechael wants to merge 1 commit intotrack-a/ggmlfrom
Leechael wants to merge 1 commit intotrack-a/ggmlfrom
Conversation
All llama-level changes for DFlash + DDTree, on top of luce-org ggml stack.
This is the entire castle implementation — DFlash drafter, DDTree builder &
verifier, persist-rollback recurrent cache, server slot integration, tests,
plus the LLAMA_DDTREE_* fast-path env knobs.
Major areas:
- batch: parent_id field + llama_batch_init_tree + tree-mode ubatch propagation
- graph: build_inp_tree + ancestor mask in kq_mask + read_only_tree reuse
- kv-cache: llama_kv_cache_seq_compact_tree
- memory-recurrent: snapshot/restore/release API
(note: replaceable by upstream ggml-org#19493 checkpoint mechanism)
- context: dflash persist buffers + rollback + recurrent_tail_pos +
capture_hidden + dflash_draft_top_k + target_feat injection
- model: LLM_ARCH_DFLASH_DRAFT load path + capture_layers hparams
- src/models/dflash-draft.cpp: 5-layer draft graph (3 modes: full / fuse_only /
kv_update_only), shared lm_head with target
- src/models/qwen35.cpp: hidden-state capture for 5 layers + tree-mode dispatch
- common/speculative-tree.{h,cpp}: ddtree builder + verifier walk + visibility mask
- common/speculative-tree-driver.{h,cpp}: spec-decode coordinator with chain validate,
fast batched, fast rollback, AR fallback, snapshot replay paths
- common/speculative-draft-backend.{h,cpp}: draft backend abstraction
- common/sampling: common_sampler_grammar_token_valid for grammar-aware verify
- tools/server/server-context.cpp: DDTree slot lifecycle + grammar-aware verify_cbs
- examples/speculative-tree/main.cpp: standalone CLI
- tests/test-speculative-tree*.cpp + test-qwen35-*.cpp + test-dflash-draft.cpp:
unit + acceptance suites
Track A: production castle stack. Stacked on track-a/ggml.
Tracking: #3 (Phase 1, Track A, llama layer)
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking: #3 (Phase 1, Track A, llama layer)
Stack
This is PR-2, stacked on PR-1 (#4):
track-a/ggml→base/luce-org-tq3track-a/llama→track-a/ggmlWhen reviewing/testing, use
track-a/llamaas the working tree — it includes both layers.Scope
All llama-level changes for DFlash + DDTree.
Major areas
parent_idfield +llama_batch_init_tree+ tree-mode ubatch propagationbuild_inp_tree, ancestor mask inkq_mask,read_only_treegraph reusellama_kv_cache_seq_compact_treerecurrent_tail_pos,capture_hidden,dflash_draft_top_k,target_featinjectionLLM_ARCH_DFLASH_DRAFTload path,dflash_target_capture_layershparamssrc/models/dflash-draft.cpp— 5-layer draft graph (3 modes: full / fuse_only / kv_update_only); shared lm_head with targetsrc/models/qwen35.cpp— hidden-state capture for 5 layers + tree-mode dispatchcommon/speculative-tree.{h,cpp}— DDTree builder + verifier walk + visibility maskcommon/speculative-tree-driver.{h,cpp}— spec-decode coordinator: chain validate, fast batched, fast rollback, AR fallback, snapshot replaycommon/speculative-draft-backend.{h,cpp}— draft backend abstractioncommon/sampling—common_sampler_grammar_token_validfor grammar-aware verifytools/server/server-context.cpp— DDTree slot lifecycle + grammar-awareverify_cbsexamples/speculative-tree/main.cpp— standalone CLItest-speculative-tree*.cpp,test-qwen35-*.cpp,test-dflash-draft.cppunit + acceptance suitesUpstream overlap reference
dflash-draft.cppmodel + captureLLAMA_DDTREE_*env knobsStat
55 files changed, 10857 insertions, 113 deletions.
Status
Not for merge into luce-org master. This is a fork-side organizational record of the castle stack as of
cleanup/no-research-artifacts(=spike/dflash-verifier-fastpathminus research artifacts).Summary by cubic
Add DDTree speculative decoding backed by a DFlash draft model to enable multi-token verification per target forward. This introduces a new draft-model arch, tree-mode decode path, server support, a CLI example, and tests.
New Features
llama_batch.parent_id,llama_batch_init_tree, ancestor mask inkq_mask, and read-only graph reuse.LLM_ARCH_DFLASH_DRAFT(5-layer), target hidden-state injection, sharedlm_head, configurable draft top-K.common_sampler_grammar_token_valid.seq_compact_treefor accepted spines; snapshot/restore/release API plus DFlash persist rollback fast path.--speculative-mode ddtreewith draft model support;examples/speculative-treeCLI; unit/acceptance tests.Migration
--speculative-mode ddtreeand-md <draft.gguf>; optional--ddtree-budget,--ddtree-temp,--ddtree-no-chain-seed.llama_batch.parent_id(tree only),llama_mem_snapshot_idwithllama_seq_snapshot/restore/release, and context helpers for hidden capture and draft top-K.--parallel 1), Qwen3.5 target with draft pairing, draft top-K auto unless overridden.Written for commit 819cc8b. Summary will update on new commits.