This issue is the long-term status snapshot for the castle DFlash/DDTree fork. It will be updated as upstream evolves. Comments below record incremental phase progress; this body holds the current state.
Last updated: 2026-05-06
Branches on this fork
| Branch |
Commit |
Role |
Status |
master |
a279d0f0f |
Sync with Luce-Org/llama.cpp-dflash-ggml:master (which mirrors ggml-org/llama.cpp:master) |
Tracks upstream |
spike/dflash-verifier-fastpath |
428de4508 |
Original frozen reference of the castle research spike (~100+ commits, ~18k LoC) |
Frozen, don't touch |
cleanup/no-research-artifacts |
ec8c0aa98 |
History-rewritten copy of the spike with autoresearch.* / multi_prompt_probe.sh / scripts/bench_dflash_datasets_llamacpp.py removed via git filter-repo. No common ancestor with other branches (filter rewrote all hashes), so cannot be PR'd. Used as the source for splitting track-a/*. |
Frozen reference |
base/luce-org-tq3 |
1823460262 |
Pointer at luce-org PR #1 merge commit (Merge pull request #1 from dusterbloom/feature/tq3-kv-cache). Used as PR base for track-a/ggml. |
Frozen pin |
track-a/ggml |
c3692ea68 |
All castle ggml-layer changes on top of luce-org PR #1 (= luce-org PR #2-#5 mirrored + castle-only ggml extensions: CPU SSM tree, WITH_PERSIST template). |
PR #4 (draft) |
track-a/llama |
819cc8b8c |
All castle llama-layer changes (DFlash drafter, DDTree builder/verifier/driver, persist-rollback, server slot, tests, LLAMA_DDTREE_* knobs). Stacked on track-a/ggml. |
PR #5 (draft, stacked) |
track-b/dflash-on-22105 |
67cb0d507 |
Mirror of ggml-org/llama.cpp PR ggml-org#22105 head. No castle adjustments needed — built and benchmarked stock on castle hardware. |
PR #6 (draft, reference) |
PRs on this fork:
Upstream PR map
ggml-org/llama.cpp (canonical upstream)
| PR |
State |
Topic |
Relation to this fork |
| #22397 |
merged 2026-04-28 |
spec params refactor (--spec-*) |
Will affect any castle CLI plumbing if/when ported |
| #19493 |
merged 2026-04-19 |
server: speculative checkpointing for hybrid SSM |
Castle's llama_seq_snapshot/restore/release is a parallel implementation; ggml-org#19493 is the upstream successor. |
| #22227 |
merged 2026-04-22 |
speculative-simple checkpoint integration |
Same area as ggml-org#19493 |
| #22105 |
OPEN since 2026-04-19 |
DFlash drafter (am17an / ruixiang63) |
track-b/dflash-on-22105 mirrors this. Author waits on ggml-org#18039 + unified spec API. |
| #18039 |
OPEN since 2025-12-14 |
EAGLE3 (NVIDIA + GGML) |
Hot, blocked on ggerganov's "unified spec API" refactor. Blocks ggml-org#22105. |
| #21089 |
OPEN since 2026-03-27 |
CPU TBQ3_0 / TBQ4_0 KV cache |
Conceptually overlaps with luce-org's TQ3_0 (different name + scope; CPU only) |
| #21038 |
merged 2026-04-01 |
Hadamard rotation for activation outliers |
ggerganov's preemptive baseline before vibe-coded TurboQuant PRs. Different from luce-org's turbo_wht |
Luce-Org/llama.cpp-dflash-ggml (luce-org)
| PR |
State |
Topic |
In our fork |
#1 (137228317 + merge 182346026) |
merged |
TQ3_0 KV cache (dusterbloom) |
yes — pinned via base/luce-org-tq3 |
| #2 |
merged |
fattn-chunked routing fix (mrciffa) |
yes — included in track-a/ggml |
| #3 |
merged |
sm_120 consumer Blackwell fix (easel) |
yes — included in track-a/ggml |
| #4 |
merged |
cuMem pool race fix (easel) |
yes — included in track-a/ggml |
| #5 |
merged |
turbo_wht parallel (mrciffa) |
yes — included in track-a/ggml |
b16de6590 (direct push?) |
landed on luce-dflash |
tree-mode SSM/GDN kernels (davide) |
yes — included via 1823460262 base + ggml diff |
| (none) |
— |
castle's CPU SSM tree kernel + WITH_PERSIST template |
not yet PR'd to luce-org — currently fork-only in track-a/ggml |
Topology
ggml-org/llama.cpp:master (upstream master)
│
├── #22397 ✓ spec params refactor
├── #19493 ✓ spec checkpointing
├── #22227 ✓ spec-simple checkpoint
├── #21038 ✓ Hadamard rotation
│
├── #18039 ◯ EAGLE3 (waits ggerganov refactor)
│ └─ blocks #22105 ◯ DFlash drafter ────────┐
│ │
└── #21089 ◯ TBQ3_0 CPU │
│
│ track-b mirrors
│ this PR head
Luce-Org/llama.cpp-dflash-ggml:luce-dflash │
│ │
├── PR #1 ✓ TQ3_0 KV (dusterbloom) │
│ │ ── 1823460262 ←───── base/luce-org-tq3 (pin)
│ │
│ ├── PR #2 ✓ fattn-chunked fix │
│ ├── PR #3 ✓ sm120 fix │
│ ├── PR #4 ✓ vmm pool fix │
│ └── PR #5 ✓ turbo_wht parallel │
│ │
└── b16de6590 tree-mode kernels (davide) │
│
Leechael/llama.cpp-dflash-ggml (this fork) │
│ │
├── spike/dflash-verifier-fastpath (frozen) │
│ │
├── cleanup/no-research-artifacts │
│ └── (history-rewritten, no merge candidate) │
│ │
├── track-a/ggml (PR #4 here) │
│ └── track-a/llama (PR #5 here, stacked) │
│ — castle DFlash + DDTree full stack │
│ │
└── track-b/dflash-on-22105 (PR #6) ◀────────────┘
— upstream #22105 mirror, no adjustments
Latest benchmark snapshot
Castle hardware: RTX 4090 (sm_89), CUDA 12.6, target = Qwen3.5-27B-Q4_K_M.gguf (16 GB), draft model varies by stack.
30-prompt mean (HumanEval / GSM8K / Math500, 10 each, seed=42 shuffle, gen=256):
| stack |
avg tok/s |
bit-equal |
speedup vs AR |
notes |
| AR baseline |
46.4 |
10/10 (def) |
1.0× |
chain decode |
| castle self-impl exact-gated |
~40 |
10/10 |
0.87× |
validate_tree_with_chain — N chain decodes/step |
upstream ggml-org#22105 stock (track-b) |
91 |
bit-eq via ggml-org#19493 checkpoint |
2.0× |
--dflash, q8_0 KV, n-batch 2048 |
| castle + TARGET_TOP1 |
123 |
4.7/10 |
2.7× |
unsafe knob; AL preserved |
| castle + unsafe trust batched |
137 |
4.7/10 |
3.0× |
unsafe knobs; sacrifices correctness |
(castle self-impl numbers from docs/ddtree-dataset-eval-plan.md on track-a/llama. ggml-org#22105 numbers from PR #6 description / castle bench 2026-05-06.)
What triggers a re-review of this fork
| Trigger |
Likely action |
| ggml-org#22105 merged |
Drop or rename track-b; decide whether to port castle DDTree on top |
| ggml-org#18039 merged + ggerganov publishes unified spec API design |
Reassess castle's batch.parent_id / tree-mode kernels — may become PR-able to ggml-org |
| luce-org adds CI / accepts external PRs |
PR castle's CPU SSM tree + WITH_PERSIST template extensions |
| Castle benchmark needs >2× AR with bit-equal |
Port castle DDTree on top of track-b (#22105 + DDTree); estimated 1-2 weeks |
| Castle wants to drop unsafe-trust-batched |
Switch castle production from track-a to track-b and accept 91 tok/s with full correctness |
| None of the above for ≥3 months |
Re-run 30-prompt benchmark on whatever is latest, update snapshot here |
Open work explicitly not scheduled
How to refresh this snapshot
When circling back:
gh pr view 22105 --json state,mergedAt --repo ggml-org/llama.cpp — check if upstream DFlash merged
gh pr view 18039 --json state,mergedAt --repo ggml-org/llama.cpp — check EAGLE3
- Re-run
bench_track_b.py on castle if hardware/stack changed
- Update tables above + add a comment noting what changed
This issue is the long-term status snapshot for the castle DFlash/DDTree fork. It will be updated as upstream evolves. Comments below record incremental phase progress; this body holds the current state.
Last updated: 2026-05-06
Branches on this fork
mastera279d0f0fLuce-Org/llama.cpp-dflash-ggml:master(which mirrorsggml-org/llama.cpp:master)spike/dflash-verifier-fastpath428de4508cleanup/no-research-artifactsec8c0aa98autoresearch.*/multi_prompt_probe.sh/scripts/bench_dflash_datasets_llamacpp.pyremoved viagit filter-repo. No common ancestor with other branches (filter rewrote all hashes), so cannot be PR'd. Used as the source for splittingtrack-a/*.base/luce-org-tq31823460262Merge pull request #1 from dusterbloom/feature/tq3-kv-cache). Used as PR base fortrack-a/ggml.track-a/ggmlc3692ea68track-a/llama819cc8b8cLLAMA_DDTREE_*knobs). Stacked ontrack-a/ggml.track-b/dflash-on-2210567cb0d507ggml-org/llama.cppPR ggml-org#22105 head. No castle adjustments needed — built and benchmarked stock on castle hardware.PRs on this fork:
Upstream PR map
ggml-org/llama.cpp(canonical upstream)--spec-*)llama_seq_snapshot/restore/releaseis a parallel implementation; ggml-org#19493 is the upstream successor.track-b/dflash-on-22105mirrors this. Author waits on ggml-org#18039 + unified spec API.TQ3_0(different name + scope; CPU only)turbo_whtLuce-Org/llama.cpp-dflash-ggml(luce-org)137228317+ merge182346026)TQ3_0KV cache (dusterbloom)base/luce-org-tq3track-a/ggmltrack-a/ggmltrack-a/ggmltrack-a/ggmlb16de6590(direct push?)luce-dflash1823460262base + ggml difftrack-a/ggmlTopology
Latest benchmark snapshot
Castle hardware: RTX 4090 (sm_89), CUDA 12.6, target =
Qwen3.5-27B-Q4_K_M.gguf(16 GB), draft model varies by stack.30-prompt mean (HumanEval / GSM8K / Math500, 10 each, seed=42 shuffle, gen=256):
validate_tree_with_chain— N chain decodes/steptrack-b)--dflash, q8_0 KV, n-batch 2048(castle self-impl numbers from
docs/ddtree-dataset-eval-plan.mdontrack-a/llama. ggml-org#22105 numbers from PR #6 description / castle bench2026-05-06.)What triggers a re-review of this fork
track-b; decide whether to port castle DDTree on topbatch.parent_id/ tree-mode kernels — may become PR-able to ggml-orgtrack-b(#22105+ DDTree); estimated 1-2 weekstrack-atotrack-band accept 91 tok/s with full correctness30-prompt benchmarkon whatever is latest, update snapshot hereOpen work explicitly not scheduled
parent_idbatch + ancestor mask + tree-mode kernels + speculative-tree + driver ontotrack-b; rewrite anything that conflicts with [Speculative decoding] feat: add DFlash support ggml-org/llama.cpp#22105's dflash drafter; rebench)track-a/ggmlminus PR Add DFlash draft top-K runtime #2-[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack) #5 mirror = castle-only diff; ~280 LoC across 5 files)llama_seq_snapshot/restore/releaseto upstream server : speculative checkpointing ggml-org/llama.cpp#19493 checkpoint APILLAMA_DDTREE_*env knobs to a documented subsetHow to refresh this snapshot
When circling back:
gh pr view 22105 --json state,mergedAt --repo ggml-org/llama.cpp— check if upstream DFlash mergedgh pr view 18039 --json state,mergedAt --repo ggml-org/llama.cpp— check EAGLE3bench_track_b.pyon castle if hardware/stack changed