[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels#4
Draft
Leechael wants to merge 1 commit into
Draft
[Track A / PR-1] ggml: castle DDTree extensions on top of luce-org TQ3+tree-kernels#4Leechael wants to merge 1 commit into
Leechael wants to merge 1 commit into
Conversation
Aggregates all ggml-layer changes that sit on top of luce-org PR #1 (merge 1823460, TQ3_0 KV cache + b16de65 tree-mode SSM/GDN kernels): luce-org-side fixes (already PR'd separately to luce-org as PR #2-#5): - 3e80ebc fattn-chunked routing fix (fattn-chunked.cu, fattn.cu) - c253e49 consumer Blackwell sm_120 skip - 6de9f7b cuMem pool extension race fix - 07fe012 turbo_wht parallelization castle-side extensions (no upstream yet): - CPU-side ssm_conv tree kernel (ggml/src/ggml-cpu/ops.cpp) - WITH_PERSIST template branch for ssm_conv_tree (ssm-conv.cu) - ggml op declarations / registrations (ggml.h, ggml.c) Tracking: #3 (Phase 1, Track A, ggml layer)
This was referenced May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tracking: #3 (Phase 1, Track A, ggml layer)
Stack
This is PR-1 in a stacked pair:
track-a/ggml→base/luce-org-tq3(luce-org PR ddtree: support dflash server cache and chain validation #1 merged commit1823460262)track-a/llama→track-a/ggmlScope
All ggml-layer changes that sit on top of luce-org PR #1 (TQ3_0 KV cache + tree-mode SSM/GDN kernels merged via PR #1 =
1823460262).luce-org-side fixes (already PR'd to luce-org separately as PR #2-#5, included here as a stacked snapshot)
3e80ebc8afattn-chunked routing fix (fattn-chunked.cu,fattn.cu)c253e49b9consumer Blackwellsm_120skip (no FP4 MMA)6de9f7bb2cuMem pool extension race fix07fe012aaturbo_whtparallelization (1 → 128 threads/block)castle-side extensions (no upstream yet)
ssm_conv_treekernel (ggml/src/ggml-cpu/ops.cpp)WITH_PERSISTtemplate branch forssm_conv_tree(ggml-cuda/ssm-conv.cu)ggml.h,ggml.c)Stat
5 files changed, 265 insertions, 33 deletions.
Status
Not for merge into luce-org master — this is a fork-side organizational record. Castle's working set assumes this branch is merged on top of the luce-org
1823460262snapshot.Summary by cubic
Adds tree-mode SSM conv with per-token persistent state for DDTree/DFS decoding, plus CPU/CUDA support and a new API
ggml_ssm_conv_tree_persist. Also makes chunked flash-attention prefill more resilient under low VRAM by auto-reducing chunk size. Part of Linear #3 (Phase 1, Track A).New Features
ggml_ssm_conv_tree_persist(ctx, sx, c, parent_ids, persist_inter)op that writes each token’s (K-1)-element post-state topersist_inter(contiguous F32, shape [K-1, d_inner, n_tokens, n_seqs]).ssm_convforward now supports tree mode viaparent_idsand optionally emits per-token post-state;gated_delta_net_one_chunksupportsparent_idsrollback and external state persistence (F32/F16).WITH_PERSISTpath to tree-modessm_convand plumbspersist_interthroughggml_cuda_op_ssm_conv.Bug Fixes
ggml-cuda/fattn-chunked.cu: handle scratch OOM by halvingtbq_chunk, freeing failed buffers, and retrying instead of aborting (with a warning).Written for commit c3692ea. Summary will update on new commits.