[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels by Leechael · Pull Request #4 · Leechael/llama.cpp-dflash-ggml

Leechael · 2026-05-05T17:57:06Z

Tracking: #3 (Phase 1, Track A, ggml layer)

Stack

This is PR-1 in a stacked pair:

PR-1 (this): track-a/ggml → base/luce-org-tq3 (luce-org PR ddtree: support dflash server cache and chain validation #1 merged commit 1823460262)
PR-2 (next): track-a/llama → track-a/ggml

Scope

All ggml-layer changes that sit on top of luce-org PR #1 (TQ3_0 KV cache + tree-mode SSM/GDN kernels merged via PR #1 = 1823460262).

luce-org-side fixes (already PR'd to luce-org separately as PR #2-#5, included here as a stacked snapshot)

3e80ebc8a fattn-chunked routing fix (fattn-chunked.cu, fattn.cu)
c253e49b9 consumer Blackwell sm_120 skip (no FP4 MMA)
6de9f7bb2 cuMem pool extension race fix
07fe012aa turbo_wht parallelization (1 → 128 threads/block)

castle-side extensions (no upstream yet)

CPU-side ssm_conv_tree kernel (ggml/src/ggml-cpu/ops.cpp)
WITH_PERSIST template branch for ssm_conv_tree (ggml-cuda/ssm-conv.cu)
ggml op declarations / registrations (ggml.h, ggml.c)

Stat

5 files changed, 265 insertions, 33 deletions.

Status

Not for merge into luce-org master — this is a fork-side organizational record. Castle's working set assumes this branch is merged on top of the luce-org 1823460262 snapshot.

Summary by cubic

Adds tree-mode SSM conv with per-token persistent state for DDTree/DFS decoding, plus CPU/CUDA support and a new API ggml_ssm_conv_tree_persist. Also makes chunked flash-attention prefill more resilient under low VRAM by auto-reducing chunk size. Part of Linear #3 (Phase 1, Track A).

New Features
- New ggml_ssm_conv_tree_persist(ctx, sx, c, parent_ids, persist_inter) op that writes each token’s (K-1)-element post-state to persist_inter (contiguous F32, shape [K-1, d_inner, n_tokens, n_seqs]).
- CPU: ssm_conv forward now supports tree mode via parent_ids and optionally emits per-token post-state; gated_delta_net_one_chunk supports parent_ids rollback and external state persistence (F32/F16).
- CUDA: adds WITH_PERSIST path to tree-mode ssm_conv and plumbs persist_inter through ggml_cuda_op_ssm_conv.
Bug Fixes
- ggml-cuda/fattn-chunked.cu: handle scratch OOM by halving tbq_chunk, freeing failed buffers, and retrying instead of aborting (with a warning).

^{Written for commit c3692ea. Summary will update on new commits.}

Aggregates all ggml-layer changes that sit on top of luce-org PR #1 (merge 1823460, TQ3_0 KV cache + b16de65 tree-mode SSM/GDN kernels): luce-org-side fixes (already PR'd separately to luce-org as PR #2-#5): - 3e80ebc fattn-chunked routing fix (fattn-chunked.cu, fattn.cu) - c253e49 consumer Blackwell sm_120 skip - 6de9f7b cuMem pool extension race fix - 07fe012 turbo_wht parallelization castle-side extensions (no upstream yet): - CPU-side ssm_conv tree kernel (ggml/src/ggml-cpu/ops.cpp) - WITH_PERSIST template branch for ssm_conv_tree (ssm-conv.cu) - ggml op declarations / registrations (ggml.h, ggml.c) Tracking: #3 (Phase 1, Track A, ggml layer)

github-actions Bot added Nvidia GPU ggml labels May 5, 2026

This was referenced May 5, 2026

[Track A / PR-2] llama: castle DFlash drafter + DDTree spec-decode (full feature stack) #5

Draft

[meta] Branch & upstream status (DFlash + DDTree) #3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels#4

[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels#4
Leechael wants to merge 1 commit into
base/luce-org-tq3from
track-a/ggml

Leechael commented May 5, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Leechael commented May 5, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Stack

Scope

luce-org-side fixes (already PR'd to luce-org separately as PR #2-#5, included here as a stacked snapshot)

castle-side extensions (no upstream yet)

Stat

Status

Summary by cubic

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Leechael commented May 5, 2026 •

edited by cubic-dev-ai Bot

Loading