Commit a6ebe8a

committed

Runtime dispatch: recurrent (T=1) vs chunked (T>1) inside triton_op

Move decode/prefill dispatch inside the chunk_gated_delta_rule triton_op instead of using torch.cond at model level. This follows the same pattern as the SDPA triton_op (pow2/non-pow2 dispatch) and avoids torch.cond incompatibility with AOTI's FunctionalTensor pipeline. Changes: - chunk_gated_delta_rule.py: Add fused recurrent Triton kernel for T=1, refactor chunked pipeline into _launch_chunked(), dispatch via Python if inside the @triton_op wrapper - model.py: Remove torch.cond from GatedDeltaNet.forward(), call triton_op directly (dispatch is internal) - export.py: Single-method export with dynamic seq_len dim - main.cpp: Fix create_text_llm_runner API signature

1 parent 5465d8b commit a6ebe8aCopy full SHA for a6ebe8a

4 files changed

+272

-202

lines changed

backends/cuda/triton/kernels
- chunk_gated_delta_rule.py
examples/models/qwen3_5_moe

4 files changed

+272

-202

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit a6ebe8a

4 files changed

4 files changed

File tree

4 files changed

4 files changed

0 commit comments