Skip to content

[meta] Branch & upstream status (DFlash + DDTree) #3

@Leechael

Description

@Leechael

This issue is the long-term status snapshot for the castle DFlash/DDTree fork. It will be updated as upstream evolves. Comments below record incremental phase progress; this body holds the current state.

Last updated: 2026-05-06


Branches on this fork

Branch Commit Role Status
master a279d0f0f Sync with Luce-Org/llama.cpp-dflash-ggml:master (which mirrors ggml-org/llama.cpp:master) Tracks upstream
spike/dflash-verifier-fastpath 428de4508 Original frozen reference of the castle research spike (~100+ commits, ~18k LoC) Frozen, don't touch
cleanup/no-research-artifacts ec8c0aa98 History-rewritten copy of the spike with autoresearch.* / multi_prompt_probe.sh / scripts/bench_dflash_datasets_llamacpp.py removed via git filter-repo. No common ancestor with other branches (filter rewrote all hashes), so cannot be PR'd. Used as the source for splitting track-a/*. Frozen reference
base/luce-org-tq3 1823460262 Pointer at luce-org PR #1 merge commit (Merge pull request #1 from dusterbloom/feature/tq3-kv-cache). Used as PR base for track-a/ggml. Frozen pin
track-a/ggml c3692ea68 All castle ggml-layer changes on top of luce-org PR #1 (= luce-org PR #2-#5 mirrored + castle-only ggml extensions: CPU SSM tree, WITH_PERSIST template). PR #4 (draft)
track-a/llama 819cc8b8c All castle llama-layer changes (DFlash drafter, DDTree builder/verifier/driver, persist-rollback, server slot, tests, LLAMA_DDTREE_* knobs). Stacked on track-a/ggml. PR #5 (draft, stacked)
track-b/dflash-on-22105 67cb0d507 Mirror of ggml-org/llama.cpp PR ggml-org#22105 head. No castle adjustments needed — built and benchmarked stock on castle hardware. PR #6 (draft, reference)

PRs on this fork:

Upstream PR map

ggml-org/llama.cpp (canonical upstream)

PR State Topic Relation to this fork
#22397 merged 2026-04-28 spec params refactor (--spec-*) Will affect any castle CLI plumbing if/when ported
#19493 merged 2026-04-19 server: speculative checkpointing for hybrid SSM Castle's llama_seq_snapshot/restore/release is a parallel implementation; ggml-org#19493 is the upstream successor.
#22227 merged 2026-04-22 speculative-simple checkpoint integration Same area as ggml-org#19493
#22105 OPEN since 2026-04-19 DFlash drafter (am17an / ruixiang63) track-b/dflash-on-22105 mirrors this. Author waits on ggml-org#18039 + unified spec API.
#18039 OPEN since 2025-12-14 EAGLE3 (NVIDIA + GGML) Hot, blocked on ggerganov's "unified spec API" refactor. Blocks ggml-org#22105.
#21089 OPEN since 2026-03-27 CPU TBQ3_0 / TBQ4_0 KV cache Conceptually overlaps with luce-org's TQ3_0 (different name + scope; CPU only)
#21038 merged 2026-04-01 Hadamard rotation for activation outliers ggerganov's preemptive baseline before vibe-coded TurboQuant PRs. Different from luce-org's turbo_wht

Luce-Org/llama.cpp-dflash-ggml (luce-org)

PR State Topic In our fork
#1 (137228317 + merge 182346026) merged TQ3_0 KV cache (dusterbloom) yes — pinned via base/luce-org-tq3
#2 merged fattn-chunked routing fix (mrciffa) yes — included in track-a/ggml
#3 merged sm_120 consumer Blackwell fix (easel) yes — included in track-a/ggml
#4 merged cuMem pool race fix (easel) yes — included in track-a/ggml
#5 merged turbo_wht parallel (mrciffa) yes — included in track-a/ggml
b16de6590 (direct push?) landed on luce-dflash tree-mode SSM/GDN kernels (davide) yes — included via 1823460262 base + ggml diff
(none) castle's CPU SSM tree kernel + WITH_PERSIST template not yet PR'd to luce-org — currently fork-only in track-a/ggml

Topology

ggml-org/llama.cpp:master  (upstream master)
        │
        ├── #22397 ✓  spec params refactor
        ├── #19493 ✓  spec checkpointing
        ├── #22227 ✓  spec-simple checkpoint
        ├── #21038 ✓  Hadamard rotation
        │
        ├── #18039 ◯  EAGLE3 (waits ggerganov refactor)
        │     └─ blocks #22105 ◯  DFlash drafter ────────┐
        │                                                │
        └── #21089 ◯  TBQ3_0 CPU                         │
                                                         │
                                                         │ track-b mirrors
                                                         │ this PR head
Luce-Org/llama.cpp-dflash-ggml:luce-dflash               │
        │                                                │
        ├── PR #1 ✓  TQ3_0 KV (dusterbloom)              │
        │     │  ── 1823460262  ←─────  base/luce-org-tq3 (pin)
        │     │
        │     ├── PR #2 ✓  fattn-chunked fix             │
        │     ├── PR #3 ✓  sm120 fix                     │
        │     ├── PR #4 ✓  vmm pool fix                  │
        │     └── PR #5 ✓  turbo_wht parallel            │
        │                                                │
        └── b16de6590  tree-mode kernels (davide)        │
                                                         │
   Leechael/llama.cpp-dflash-ggml (this fork)            │
        │                                                │
        ├── spike/dflash-verifier-fastpath (frozen)      │
        │                                                │
        ├── cleanup/no-research-artifacts                │
        │   └── (history-rewritten, no merge candidate)  │
        │                                                │
        ├── track-a/ggml (PR #4 here)                    │
        │   └── track-a/llama (PR #5 here, stacked)      │
        │       — castle DFlash + DDTree full stack      │
        │                                                │
        └── track-b/dflash-on-22105 (PR #6) ◀────────────┘
            — upstream #22105 mirror, no adjustments

Latest benchmark snapshot

Castle hardware: RTX 4090 (sm_89), CUDA 12.6, target = Qwen3.5-27B-Q4_K_M.gguf (16 GB), draft model varies by stack.

30-prompt mean (HumanEval / GSM8K / Math500, 10 each, seed=42 shuffle, gen=256):

stack avg tok/s bit-equal speedup vs AR notes
AR baseline 46.4 10/10 (def) 1.0× chain decode
castle self-impl exact-gated ~40 10/10 0.87× validate_tree_with_chain — N chain decodes/step
upstream ggml-org#22105 stock (track-b) 91 bit-eq via ggml-org#19493 checkpoint 2.0× --dflash, q8_0 KV, n-batch 2048
castle + TARGET_TOP1 123 4.7/10 2.7× unsafe knob; AL preserved
castle + unsafe trust batched 137 4.7/10 3.0× unsafe knobs; sacrifices correctness

(castle self-impl numbers from docs/ddtree-dataset-eval-plan.md on track-a/llama. ggml-org#22105 numbers from PR #6 description / castle bench 2026-05-06.)

What triggers a re-review of this fork

Trigger Likely action
ggml-org#22105 merged Drop or rename track-b; decide whether to port castle DDTree on top
ggml-org#18039 merged + ggerganov publishes unified spec API design Reassess castle's batch.parent_id / tree-mode kernels — may become PR-able to ggml-org
luce-org adds CI / accepts external PRs PR castle's CPU SSM tree + WITH_PERSIST template extensions
Castle benchmark needs >2× AR with bit-equal Port castle DDTree on top of track-b (#22105 + DDTree); estimated 1-2 weeks
Castle wants to drop unsafe-trust-batched Switch castle production from track-a to track-b and accept 91 tok/s with full correctness
None of the above for ≥3 months Re-run 30-prompt benchmark on whatever is latest, update snapshot here

Open work explicitly not scheduled

How to refresh this snapshot

When circling back:

  1. gh pr view 22105 --json state,mergedAt --repo ggml-org/llama.cpp — check if upstream DFlash merged
  2. gh pr view 18039 --json state,mergedAt --repo ggml-org/llama.cpp — check EAGLE3
  3. Re-run bench_track_b.py on castle if hardware/stack changed
  4. Update tables above + add a comment noting what changed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions