Skip to content

[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack)#5

Draft
Leechael wants to merge 1 commit intotrack-a/ggmlfrom
track-a/llama
Draft

[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack)#5
Leechael wants to merge 1 commit intotrack-a/ggmlfrom
track-a/llama

Conversation

@Leechael
Copy link
Copy Markdown
Owner

@Leechael Leechael commented May 5, 2026

Tracking: #3 (Phase 1, Track A, llama layer)

Stack

This is PR-2, stacked on PR-1 (#4):

  • PR-1: track-a/ggmlbase/luce-org-tq3
  • PR-2 (this): track-a/llamatrack-a/ggml

When reviewing/testing, use track-a/llama as the working tree — it includes both layers.

Scope

All llama-level changes for DFlash + DDTree.

Major areas

  • batchparent_id field + llama_batch_init_tree + tree-mode ubatch propagation
  • graphbuild_inp_tree, ancestor mask in kq_mask, read_only_tree graph reuse
  • kv-cachellama_kv_cache_seq_compact_tree
  • memory-recurrent — snapshot/restore/release API (replaceable by upstream ggml-org#19493 checkpoint)
  • context — DFlash persist buffers, rollback, recurrent_tail_pos, capture_hidden, dflash_draft_top_k, target_feat injection
  • modelLLM_ARCH_DFLASH_DRAFT load path, dflash_target_capture_layers hparams
  • src/models/dflash-draft.cpp — 5-layer draft graph (3 modes: full / fuse_only / kv_update_only); shared lm_head with target
  • src/models/qwen35.cpp — hidden-state capture for 5 layers + tree-mode dispatch
  • common/speculative-tree.{h,cpp} — DDTree builder + verifier walk + visibility mask
  • common/speculative-tree-driver.{h,cpp} — spec-decode coordinator: chain validate, fast batched, fast rollback, AR fallback, snapshot replay
  • common/speculative-draft-backend.{h,cpp} — draft backend abstraction
  • common/samplingcommon_sampler_grammar_token_valid for grammar-aware verify
  • tools/server/server-context.cpp — DDTree slot lifecycle + grammar-aware verify_cbs
  • examples/speculative-tree/main.cpp — standalone CLI
  • teststest-speculative-tree*.cpp, test-qwen35-*.cpp, test-dflash-draft.cpp unit + acceptance suites

Upstream overlap reference

Area Upstream PR Action
dflash-draft.cpp model + capture ggml-org#22105 (open) Track A keeps castle impl; Track B will replace with ggml-org#22105
Recurrent snapshot/restore ggml-org#19493 (merged) drop/migrate eventually
LLAMA_DDTREE_* env knobs none fork-only
DDTree builder + driver none fork-only / wait unified spec API

Stat

55 files changed, 10857 insertions, 113 deletions.

Status

Not for merge into luce-org master. This is a fork-side organizational record of the castle stack as of cleanup/no-research-artifacts (= spike/dflash-verifier-fastpath minus research artifacts).


Summary by cubic

Add DDTree speculative decoding backed by a DFlash draft model to enable multi-token verification per target forward. This introduces a new draft-model arch, tree-mode decode path, server support, a CLI example, and tests.

  • New Features

    • Tree-mode decoding: llama_batch.parent_id, llama_batch_init_tree, ancestor mask in kq_mask, and read-only graph reuse.
    • New draft model: LLM_ARCH_DFLASH_DRAFT (5-layer), target hidden-state injection, shared lm_head, configurable draft top-K.
    • Qwen3.5 capture: 5-layer hidden capture, context APIs to enable/read capture, and grammar-aware verify via common_sampler_grammar_token_valid.
    • KV/recurrent: seq_compact_tree for accepted spines; snapshot/restore/release API plus DFlash persist rollback fast path.
    • Spec driver: DDTree builder/visibility mask, draft backend, and a step coordinator that supports chain-validate, fast batched verify, rollback, and AR fallback.
    • Integration: server --speculative-mode ddtree with draft model support; examples/speculative-tree CLI; unit/acceptance tests.
  • Migration

    • Server: pass --speculative-mode ddtree and -md <draft.gguf>; optional --ddtree-budget, --ddtree-temp, --ddtree-no-chain-seed.
    • API additions (opt‑in): llama_batch.parent_id (tree only), llama_mem_snapshot_id with llama_seq_snapshot/restore/release, and context helpers for hidden capture and draft top-K.
    • Constraints (current phase): single slot (--parallel 1), Qwen3.5 target with draft pairing, draft top-K auto unless overridden.

Written for commit 819cc8b. Summary will update on new commits.

All llama-level changes for DFlash + DDTree, on top of luce-org ggml stack.
This is the entire castle implementation — DFlash drafter, DDTree builder &
verifier, persist-rollback recurrent cache, server slot integration, tests,
plus the LLAMA_DDTREE_* fast-path env knobs.

Major areas:
- batch: parent_id field + llama_batch_init_tree + tree-mode ubatch propagation
- graph: build_inp_tree + ancestor mask in kq_mask + read_only_tree reuse
- kv-cache: llama_kv_cache_seq_compact_tree
- memory-recurrent: snapshot/restore/release API
  (note: replaceable by upstream ggml-org#19493 checkpoint mechanism)
- context: dflash persist buffers + rollback + recurrent_tail_pos +
  capture_hidden + dflash_draft_top_k + target_feat injection
- model: LLM_ARCH_DFLASH_DRAFT load path + capture_layers hparams
- src/models/dflash-draft.cpp: 5-layer draft graph (3 modes: full / fuse_only /
  kv_update_only), shared lm_head with target
- src/models/qwen35.cpp: hidden-state capture for 5 layers + tree-mode dispatch
- common/speculative-tree.{h,cpp}: ddtree builder + verifier walk + visibility mask
- common/speculative-tree-driver.{h,cpp}: spec-decode coordinator with chain validate,
  fast batched, fast rollback, AR fallback, snapshot replay paths
- common/speculative-draft-backend.{h,cpp}: draft backend abstraction
- common/sampling: common_sampler_grammar_token_valid for grammar-aware verify
- tools/server/server-context.cpp: DDTree slot lifecycle + grammar-aware verify_cbs
- examples/speculative-tree/main.cpp: standalone CLI
- tests/test-speculative-tree*.cpp + test-qwen35-*.cpp + test-dflash-draft.cpp:
  unit + acceptance suites

Track A: production castle stack. Stacked on track-a/ggml.

Tracking: #3 (Phase 1, Track A, llama layer)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation examples model python server testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant