perf(cuda): add in-place GDN state write by dusterbloom · Pull Request #24 · Luce-Org/lucebox-ggml

dusterbloom · 2026-06-28T14:33:45Z

What

Adds an opt-in in-place final-state write path for the CUDA gated-delta-net (GDN) op.

New API: ggml_gated_delta_net_inplace().
The in-place variant sets op_params[1] so CUDA writes the final recurrent state directly to
src[5] (state) instead of the packed result tensor's final-state slot.
Existing ggml_gated_delta_net() behavior is unchanged.
The intermediate-skip / tensor-compaction path is intentionally left to the already-merged
lucebox-ggml#26 API (ggml_gated_delta_net_set_skip_intermediate).

Why

The Lucebox pure-AR decode graph currently needs a downstream copy from the GDN packed result into
the persistent SSM state. Writing the final state directly removes that copy without changing
existing callers.

This is now a single-concern PR. The previous mixed version also included props-check skipping and
launch-wrapper work; that concern is split into #27.

Validation

Rebased on current luce-dflash after perf(qwen): skip unused GDN intermediate writes #26: base 070b7d7d.
Built CUDA backend locally with CUDA 12.6 and explicit SM86:
- env CUDACXX=/usr/local/cuda/bin/nvcc cmake -S /tmp/ggml-pr24-inplace -B /tmp/ggml-pr24-inplace/ build-cuda126-sm86 -DGGML_CUDA=ON -DGGML_CCACHE=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86
- cmake --build /tmp/ggml-pr24-inplace/build-cuda126-sm86 --target ggml-cuda -j 8
A first configure attempt without explicit CMAKE_CUDA_ARCHITECTURES=86 failed because this
environment cannot probe CUDA_ARCHITECTURES=native without a visible GPU.

github-actions Bot added ggml CUDA labels Jun 28, 2026

dusterbloom force-pushed the perf/gdn-inplace-props-skip branch from 3e57037 to 9fc7329 Compare June 30, 2026 19:26

dusterbloom changed the title ~~perf(cuda): in-place GDN state write (pure-AR) + thread-local props-check skip~~ perf(cuda): add in-place GDN state write Jun 30, 2026

dusterbloom mentioned this pull request Jun 30, 2026

perf(qwen35): use in-place GDN state write in pure AR Luce-Org/lucebox-hub#473

Draft

perf(cuda): add in-place GDN state write

f321139

dusterbloom force-pushed the perf/gdn-inplace-props-skip branch from c279bc4 to f321139 Compare July 1, 2026 10:01

dusterbloom mentioned this pull request Jul 1, 2026

perf(cuda): recover qwen dense decode path #27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(cuda): add in-place GDN state write#24

perf(cuda): add in-place GDN state write#24
dusterbloom wants to merge 1 commit into
Luce-Org:luce-dflashfrom
dusterbloom:perf/gdn-inplace-props-skip

dusterbloom commented Jun 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

dusterbloom commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 28, 2026 •

edited

Loading