Skip to content

perf(cuda): add in-place GDN state write#24

Open
dusterbloom wants to merge 1 commit into
Luce-Org:luce-dflashfrom
dusterbloom:perf/gdn-inplace-props-skip
Open

perf(cuda): add in-place GDN state write#24
dusterbloom wants to merge 1 commit into
Luce-Org:luce-dflashfrom
dusterbloom:perf/gdn-inplace-props-skip

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 28, 2026

Copy link
Copy Markdown

What

Adds an opt-in in-place final-state write path for the CUDA gated-delta-net (GDN) op.

  • New API: ggml_gated_delta_net_inplace().
  • The in-place variant sets op_params[1] so CUDA writes the final recurrent state directly to
    src[5] (state) instead of the packed result tensor's final-state slot.
  • Existing ggml_gated_delta_net() behavior is unchanged.
  • The intermediate-skip / tensor-compaction path is intentionally left to the already-merged
    lucebox-ggml#26 API (ggml_gated_delta_net_set_skip_intermediate).

Why

The Lucebox pure-AR decode graph currently needs a downstream copy from the GDN packed result into
the persistent SSM state. Writing the final state directly removes that copy without changing
existing callers.

This is now a single-concern PR. The previous mixed version also included props-check skipping and
launch-wrapper work; that concern is split into #27.

Validation

  • Rebased on current luce-dflash after perf(qwen): skip unused GDN intermediate writes #26: base 070b7d7d.
  • Built CUDA backend locally with CUDA 12.6 and explicit SM86:
    • env CUDACXX=/usr/local/cuda/bin/nvcc cmake -S /tmp/ggml-pr24-inplace -B /tmp/ggml-pr24-inplace/ build-cuda126-sm86 -DGGML_CUDA=ON -DGGML_CCACHE=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=86
    • cmake --build /tmp/ggml-pr24-inplace/build-cuda126-sm86 --target ggml-cuda -j 8
  • A first configure attempt without explicit CMAKE_CUDA_ARCHITECTURES=86 failed because this
    environment cannot probe CUDA_ARCHITECTURES=native without a visible GPU.

@dusterbloom dusterbloom force-pushed the perf/gdn-inplace-props-skip branch from 3e57037 to 9fc7329 Compare June 30, 2026 19:26
@dusterbloom dusterbloom changed the title perf(cuda): in-place GDN state write (pure-AR) + thread-local props-check skip perf(cuda): add in-place GDN state write Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant