Commit 151692c
committed
Use max example seq_len when exporting Qwen3.5 MoE
The previous example used T=2, which caused AOTI to compile the
chunk_gated_delta_rule kernel for a single chunk (NT=1). At runtime,
prompts longer than 64 tokens (requiring NT>1 chunks) failed with
"Error resizing tensor at input 0". Using max_seq_len-1 as the
example ensures AOTI generalizes intermediate buffer sizes for the
full sequence length range.
Comparison against original export (tq4_sdpa fused kernel)
on H100 (Qwen3.5-35B-A3B, HQQ-INT4, max_seq_len=4096, 5 runs median):
Original (tq4_sdpa) Baseline (Triton SDPA)
Decode tok/s 68.4 61.7
Prefill tok/s 275.7 378.2
Baseline prefill is 1.37x faster; decode is 0.90x (tq4_sdpa's fused
decode kernel is faster than the tiled Triton SDPA at L_q=1). The
split-K commit addresses the decode gap.1 parent 0e977fd commit 151692c
1 file changed
Lines changed: 7 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
425 | 425 | | |
426 | 426 | | |
427 | 427 | | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
428 | 432 | | |
429 | | - | |
430 | | - | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
431 | 436 | | |
432 | 437 | | |
433 | 438 | | |
| |||
0 commit comments