Commit ecca812
committed
Use max example seq_len when exporting Qwen3.5 MoE
The previous example used T=2, which caused AOTI to compile the
chunk_gated_delta_rule kernel for a single chunk (NT=1). At runtime,
prompts longer than 64 tokens (requiring NT>1 chunks) failed with
"Error resizing tensor at input 0". Using max_seq_len-1 as the
example ensures AOTI generalizes intermediate buffer sizes for the
full sequence length range.
Comparison against original export (tq4_sdpa fused kernel)
on H100 (Qwen3.5-35B-A3B, HQQ-INT4, max_seq_len=4096, 5 runs median):
Original (tq4_sdpa) Baseline (Triton SDPA)
Decode tok/s 68.4 61.7
Prefill tok/s 275.7 378.2
Baseline prefill is 1.37x faster; decode is 0.90x (tq4_sdpa's fused
decode kernel is faster than the tiled Triton SDPA at L_q=1). The
split-K commit addresses the decode gap.
This PR was authored with the assistance of Claude1 parent 40e361b commit ecca812
1 file changed
Lines changed: 7 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
626 | 626 | | |
627 | 627 | | |
628 | 628 | | |
| 629 | + | |
| 630 | + | |
| 631 | + | |
| 632 | + | |
629 | 633 | | |
630 | | - | |
631 | | - | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
632 | 637 | | |
633 | 638 | | |
634 | 639 | | |
| |||
0 commit comments