Commit a6ebe8a
committed
Runtime dispatch: recurrent (T=1) vs chunked (T>1) inside triton_op
Move decode/prefill dispatch inside the chunk_gated_delta_rule triton_op
instead of using torch.cond at model level. This follows the same pattern
as the SDPA triton_op (pow2/non-pow2 dispatch) and avoids torch.cond
incompatibility with AOTI's FunctionalTensor pipeline.
Changes:
- chunk_gated_delta_rule.py: Add fused recurrent Triton kernel for T=1,
refactor chunked pipeline into _launch_chunked(), dispatch via Python
if inside the @triton_op wrapper
- model.py: Remove torch.cond from GatedDeltaNet.forward(), call
triton_op directly (dispatch is internal)
- export.py: Single-method export with dynamic seq_len dim
- main.cpp: Fix create_text_llm_runner API signature1 parent 5465d8b commit a6ebe8a
File tree
4 files changed
+272
-202
lines changed- backends/cuda/triton/kernels
- examples/models/qwen3_5_moe
4 files changed
+272
-202
lines changed
0 commit comments