[meta] Branch & upstream status (DFlash + DDTree)

**This issue is the long-term status snapshot for the castle DFlash/DDTree fork. It will be updated as upstream evolves. Comments below record incremental phase progress; this body holds the current state.**

Last updated: 2026-05-06

---

## Branches on this fork

| Branch | Commit | Role | Status |
|---|---|---|---|
| [`master`](../tree/master) | `a279d0f0f` | Sync with `Luce-Org/llama.cpp-dflash-ggml:master` (which mirrors `ggml-org/llama.cpp:master`) | Tracks upstream |
| [`spike/dflash-verifier-fastpath`](../tree/spike/dflash-verifier-fastpath) | `428de4508` | Original frozen reference of the castle research spike (~100+ commits, ~18k LoC) | Frozen, don't touch |
| [`cleanup/no-research-artifacts`](../tree/cleanup/no-research-artifacts) | `ec8c0aa98` | History-rewritten copy of the spike with `autoresearch.*` / `multi_prompt_probe.sh` / `scripts/bench_dflash_datasets_llamacpp.py` removed via `git filter-repo`. **No common ancestor with other branches** (filter rewrote all hashes), so cannot be PR'd. Used as the source for splitting `track-a/*`. | Frozen reference |
| [`base/luce-org-tq3`](../tree/base/luce-org-tq3) | `1823460262` | Pointer at luce-org PR #1 merge commit (`Merge pull request #1 from dusterbloom/feature/tq3-kv-cache`). Used as PR base for `track-a/ggml`. | Frozen pin |
| [`track-a/ggml`](../tree/track-a/ggml) | `c3692ea68` | All castle ggml-layer changes on top of luce-org PR #1 (= luce-org PR #2-#5 mirrored + castle-only ggml extensions: CPU SSM tree, WITH_PERSIST template). | PR #4 (draft) |
| [`track-a/llama`](../tree/track-a/llama) | `819cc8b8c` | All castle llama-layer changes (DFlash drafter, DDTree builder/verifier/driver, persist-rollback, server slot, tests, `LLAMA_DDTREE_*` knobs). Stacked on `track-a/ggml`. | PR #5 (draft, stacked) |
| [`track-b/dflash-on-22105`](../tree/track-b/dflash-on-22105) | `67cb0d507` | Mirror of `ggml-org/llama.cpp` PR #22105 head. No castle adjustments needed — built and benchmarked stock on castle hardware. | PR #6 (draft, reference) |

PRs on this fork:
- [#4 — track-a/ggml](https://github.com/Leechael/llama.cpp-dflash-ggml/pull/4)
- [#5 — track-a/llama (stacked on #4)](https://github.com/Leechael/llama.cpp-dflash-ggml/pull/5)
- [#6 — track-b/dflash-on-22105 (reference)](https://github.com/Leechael/llama.cpp-dflash-ggml/pull/6)

## Upstream PR map

### `ggml-org/llama.cpp` (canonical upstream)

| PR | State | Topic | Relation to this fork |
|---|---|---|---|
| [#22397](https://github.com/ggml-org/llama.cpp/pull/22397) | merged 2026-04-28 | spec params refactor (`--spec-*`) | Will affect any castle CLI plumbing if/when ported |
| [#19493](https://github.com/ggml-org/llama.cpp/pull/19493) | merged 2026-04-19 | server: speculative checkpointing for hybrid SSM | **Castle's `llama_seq_snapshot/restore/release` is a parallel implementation; #19493 is the upstream successor.** |
| [#22227](https://github.com/ggml-org/llama.cpp/pull/22227) | merged 2026-04-22 | speculative-simple checkpoint integration | Same area as #19493 |
| [#22105](https://github.com/ggml-org/llama.cpp/pull/22105) | OPEN since 2026-04-19 | DFlash drafter (am17an / ruixiang63) | **`track-b/dflash-on-22105` mirrors this.** Author waits on #18039 + unified spec API. |
| [#18039](https://github.com/ggml-org/llama.cpp/pull/18039) | OPEN since 2025-12-14 | EAGLE3 (NVIDIA + GGML) | Hot, blocked on ggerganov's "unified spec API" refactor. Blocks #22105. |
| [#21089](https://github.com/ggml-org/llama.cpp/pull/21089) | OPEN since 2026-03-27 | CPU TBQ3_0 / TBQ4_0 KV cache | Conceptually overlaps with luce-org's `TQ3_0` (different name + scope; CPU only) |
| [#21038](https://github.com/ggml-org/llama.cpp/pull/21038) | merged 2026-04-01 | Hadamard rotation for activation outliers | ggerganov's preemptive baseline before vibe-coded TurboQuant PRs. Different from luce-org's `turbo_wht` |

### `Luce-Org/llama.cpp-dflash-ggml` (luce-org)

| PR | State | Topic | In our fork |
|---|---|---|---|
| #1 (`137228317` + merge `182346026`) | merged | `TQ3_0` KV cache (dusterbloom) | yes — pinned via `base/luce-org-tq3` |
| #2 | merged | fattn-chunked routing fix (mrciffa) | yes — included in `track-a/ggml` |
| #3 | merged | sm_120 consumer Blackwell fix (easel) | yes — included in `track-a/ggml` |
| #4 | merged | cuMem pool race fix (easel) | yes — included in `track-a/ggml` |
| #5 | merged | turbo_wht parallel (mrciffa) | yes — included in `track-a/ggml` |
| `b16de6590` (direct push?) | landed on `luce-dflash` | tree-mode SSM/GDN kernels (davide) | yes — included via `1823460262` base + ggml diff |
| (none) | — | castle's CPU SSM tree kernel + WITH_PERSIST template | **not yet PR'd to luce-org** — currently fork-only in `track-a/ggml` |

## Topology

```
ggml-org/llama.cpp:master  (upstream master)
        │
        ├── #22397 ✓  spec params refactor
        ├── #19493 ✓  spec checkpointing
        ├── #22227 ✓  spec-simple checkpoint
        ├── #21038 ✓  Hadamard rotation
        │
        ├── #18039 ◯  EAGLE3 (waits ggerganov refactor)
        │     └─ blocks #22105 ◯  DFlash drafter ────────┐
        │                                                │
        └── #21089 ◯  TBQ3_0 CPU                         │
                                                         │
                                                         │ track-b mirrors
                                                         │ this PR head
Luce-Org/llama.cpp-dflash-ggml:luce-dflash               │
        │                                                │
        ├── PR #1 ✓  TQ3_0 KV (dusterbloom)              │
        │     │  ── 1823460262  ←─────  base/luce-org-tq3 (pin)
        │     │
        │     ├── PR #2 ✓  fattn-chunked fix             │
        │     ├── PR #3 ✓  sm120 fix                     │
        │     ├── PR #4 ✓  vmm pool fix                  │
        │     └── PR #5 ✓  turbo_wht parallel            │
        │                                                │
        └── b16de6590  tree-mode kernels (davide)        │
                                                         │
   Leechael/llama.cpp-dflash-ggml (this fork)            │
        │                                                │
        ├── spike/dflash-verifier-fastpath (frozen)      │
        │                                                │
        ├── cleanup/no-research-artifacts                │
        │   └── (history-rewritten, no merge candidate)  │
        │                                                │
        ├── track-a/ggml (PR #4 here)                    │
        │   └── track-a/llama (PR #5 here, stacked)      │
        │       — castle DFlash + DDTree full stack      │
        │                                                │
        └── track-b/dflash-on-22105 (PR #6) ◀────────────┘
            — upstream #22105 mirror, no adjustments
```

## Latest benchmark snapshot

Castle hardware: RTX 4090 (sm_89), CUDA 12.6, target = `Qwen3.5-27B-Q4_K_M.gguf` (16 GB), draft model varies by stack.

30-prompt mean (HumanEval / GSM8K / Math500, 10 each, seed=42 shuffle, gen=256):

| stack | avg tok/s | bit-equal | speedup vs AR | notes |
|---|---:|---:|---:|---|
| AR baseline | 46.4 | 10/10 (def) | 1.0× | chain decode |
| castle self-impl exact-gated | ~40 | 10/10 | 0.87× | `validate_tree_with_chain` — N chain decodes/step |
| **upstream #22105 stock (`track-b`)** | **91** | **bit-eq via #19493 checkpoint** | **2.0×** | `--dflash`, q8_0 KV, n-batch 2048 |
| castle + TARGET_TOP1 | 123 | 4.7/10 | 2.7× | unsafe knob; AL preserved |
| castle + unsafe trust batched | 137 | 4.7/10 | 3.0× | unsafe knobs; sacrifices correctness |

(castle self-impl numbers from `docs/ddtree-dataset-eval-plan.md` on `track-a/llama`. #22105 numbers from PR #6 description / castle bench `2026-05-06`.)

## What triggers a re-review of this fork

| Trigger | Likely action |
|---|---|
| **#22105 merged** | Drop or rename `track-b`; decide whether to port castle DDTree on top |
| **#18039 merged** + ggerganov publishes unified spec API design | Reassess castle's `batch.parent_id` / tree-mode kernels — may become PR-able to ggml-org |
| **luce-org adds CI / accepts external PRs** | PR castle's CPU SSM tree + WITH_PERSIST template extensions |
| **Castle benchmark needs >2× AR with bit-equal** | Port castle DDTree on top of `track-b` (`#22105` + DDTree); estimated 1-2 weeks |
| **Castle wants to drop unsafe-trust-batched** | Switch castle production from `track-a` to `track-b` and accept 91 tok/s with full correctness |
| **None of the above** for ≥3 months | Re-run `30-prompt benchmark` on whatever is latest, update snapshot here |

## Open work explicitly **not** scheduled

- [ ] DDTree-on-#22105 port (would need: cherry-pick castle's `parent_id` batch + ancestor mask + tree-mode kernels + speculative-tree + driver onto `track-b`; rewrite anything that conflicts with #22105's dflash drafter; rebench)
- [ ] PR castle ggml extensions to luce-org (`track-a/ggml` minus PR #2-#5 mirror = castle-only diff; ~280 LoC across 5 files)
- [ ] Migrate castle `llama_seq_snapshot/restore/release` to upstream #19493 checkpoint API
- [ ] Reduce 25+ `LLAMA_DDTREE_*` env knobs to a documented subset
- [ ] CI test for castle DDTree path (currently all integration tests are manual-run-only on 16 GB+ GGUF)

## How to refresh this snapshot

When circling back:

1. `gh pr view 22105 --json state,mergedAt --repo ggml-org/llama.cpp` — check if upstream DFlash merged
2. `gh pr view 18039 --json state,mergedAt --repo ggml-org/llama.cpp` — check EAGLE3
3. Re-run `bench_track_b.py` on castle if hardware/stack changed
4. Update tables above + add a comment noting what changed

Branch	Commit	Role	Status
`master`	`a279d0f0f`	Sync with `Luce-Org/llama.cpp-dflash-ggml:master` (which mirrors `ggml-org/llama.cpp:master`)	Tracks upstream
`spike/dflash-verifier-fastpath`	`428de4508`	Original frozen reference of the castle research spike (~100+ commits, ~18k LoC)	Frozen, don't touch
`cleanup/no-research-artifacts`	`ec8c0aa98`	History-rewritten copy of the spike with `autoresearch.` / `multi_prompt_probe.sh` / `scripts/bench_dflash_datasets_llamacpp.py` removed via `git filter-repo`. No common ancestor with other branches* (filter rewrote all hashes), so cannot be PR'd. Used as the source for splitting `track-a/*`.	Frozen reference
`base/luce-org-tq3`	`1823460262`	Pointer at luce-org PR #1 merge commit (`Merge pull request #1 from dusterbloom/feature/tq3-kv-cache`). Used as PR base for `track-a/ggml`.	Frozen pin
`track-a/ggml`	`c3692ea68`	All castle ggml-layer changes on top of luce-org PR #1 (= luce-org PR #2-#5 mirrored + castle-only ggml extensions: CPU SSM tree, WITH_PERSIST template).	PR #4 (draft)
`track-a/llama`	`819cc8b8c`	All castle llama-layer changes (DFlash drafter, DDTree builder/verifier/driver, persist-rollback, server slot, tests, `LLAMA_DDTREE_*` knobs). Stacked on `track-a/ggml`.	PR #5 (draft, stacked)
`track-b/dflash-on-22105`	`67cb0d507`	Mirror of `ggml-org/llama.cpp` PR ggml-org#22105 head. No castle adjustments needed — built and benchmarked stock on castle hardware.	PR #6 (draft, reference)

Trigger	Likely action
ggml-org#22105 merged	Drop or rename `track-b`; decide whether to port castle DDTree on top
ggml-org#18039 merged + ggerganov publishes unified spec API design	Reassess castle's `batch.parent_id` / tree-mode kernels — may become PR-able to ggml-org
luce-org adds CI / accepts external PRs	PR castle's CPU SSM tree + WITH_PERSIST template extensions
Castle benchmark needs >2× AR with bit-equal	Port castle DDTree on top of `track-b` (`#22105` + DDTree); estimated 1-2 weeks
Castle wants to drop unsafe-trust-batched	Switch castle production from `track-a` to `track-b` and accept 91 tok/s with full correctness
None of the above for ≥3 months	Re-run `30-prompt benchmark` on whatever is latest, update snapshot here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[meta] Branch & upstream status (DFlash + DDTree) #3

Branches on this fork

Upstream PR map

`ggml-org/llama.cpp` (canonical upstream)

`Luce-Org/llama.cpp-dflash-ggml` (luce-org)

Topology

Latest benchmark snapshot

What triggers a re-review of this fork

Open work explicitly not scheduled

How to refresh this snapshot

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

PR	State	Topic	Relation to this fork
#22397	merged 2026-04-28	spec params refactor (`--spec-*`)	Will affect any castle CLI plumbing if/when ported
#19493	merged 2026-04-19	server: speculative checkpointing for hybrid SSM	Castle's `llama_seq_snapshot/restore/release` is a parallel implementation; ggml-org#19493 is the upstream successor.
#22227	merged 2026-04-22	speculative-simple checkpoint integration	Same area as ggml-org#19493
#22105	OPEN since 2026-04-19	DFlash drafter (am17an / ruixiang63)	`track-b/dflash-on-22105` mirrors this. Author waits on ggml-org#18039 + unified spec API.
#18039	OPEN since 2025-12-14	EAGLE3 (NVIDIA + GGML)	Hot, blocked on ggerganov's "unified spec API" refactor. Blocks ggml-org#22105.
#21089	OPEN since 2026-03-27	CPU TBQ3_0 / TBQ4_0 KV cache	Conceptually overlaps with luce-org's `TQ3_0` (different name + scope; CPU only)
#21038	merged 2026-04-01	Hadamard rotation for activation outliers	ggerganov's preemptive baseline before vibe-coded TurboQuant PRs. Different from luce-org's `turbo_wht`

PR	State	Topic	In our fork
#1 (`137228317` + merge `182346026`)	merged	`TQ3_0` KV cache (dusterbloom)	yes — pinned via `base/luce-org-tq3`
#2	merged	fattn-chunked routing fix (mrciffa)	yes — included in `track-a/ggml`
#3	merged	sm_120 consumer Blackwell fix (easel)	yes — included in `track-a/ggml`
#4	merged	cuMem pool race fix (easel)	yes — included in `track-a/ggml`
#5	merged	turbo_wht parallel (mrciffa)	yes — included in `track-a/ggml`
`b16de6590` (direct push?)	landed on `luce-dflash`	tree-mode SSM/GDN kernels (davide)	yes — included via `1823460262` base + ggml diff
(none)	—	castle's CPU SSM tree kernel + WITH_PERSIST template	not yet PR'd to luce-org — currently fork-only in `track-a/ggml`

stack	avg tok/s	bit-equal	speedup vs AR	notes
AR baseline	46.4	10/10 (def)	1.0×	chain decode
castle self-impl exact-gated	~40	10/10	0.87×	`validate_tree_with_chain` — N chain decodes/step
upstream ggml-org#22105 stock (`track-b`)	91	bit-eq via ggml-org#19493 checkpoint	2.0×	`--dflash`, q8_0 KV, n-batch 2048
castle + TARGET_TOP1	123	4.7/10	2.7×	unsafe knob; AL preserved
castle + unsafe trust batched	137	4.7/10	3.0×	unsafe knobs; sacrifices correctness

[meta] Branch & upstream status (DFlash + DDTree) #3

Description

Branches on this fork

Upstream PR map

ggml-org/llama.cpp (canonical upstream)

Luce-Org/llama.cpp-dflash-ggml (luce-org)

Topology

Latest benchmark snapshot

What triggers a re-review of this fork

Open work explicitly not scheduled

How to refresh this snapshot

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

`ggml-org/llama.cpp` (canonical upstream)

`Luce-Org/llama.cpp-dflash-ggml` (luce-org)