Commit e06db27
committed
Use max example seq_len when exporting Qwen3.5 MoE
The previous example used T=2, which caused AOTI to compile the
chunk_gated_delta_rule kernel for a single chunk (NT=1). At runtime,
prompts longer than 64 tokens (requiring NT>1 chunks) failed with
"Error resizing tensor at input 0". Using max_seq_len-1 as the
example ensures AOTI generalizes intermediate buffer sizes for the
full sequence length range.
Comparison against original export (tq4_sdpa fused kernel)
on H100 (Qwen3.5-35B-A3B, HQQ-INT4, max_seq_len=4096, 5 runs median):
Original (tq4_sdpa) Baseline (Triton SDPA)
Decode tok/s 68.4 61.7
Prefill tok/s 275.7 378.2
Baseline prefill is 1.37x faster; decode is 0.90x (tq4_sdpa's fused
decode kernel is faster than the tiled Triton SDPA at L_q=1). The
split-K commit addresses the decode gap.1 parent 82641e8 commit e06db27
1 file changed
Lines changed: 7 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
398 | 398 | | |
399 | 399 | | |
400 | 400 | | |
401 | | - | |
402 | | - | |
403 | | - | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
404 | 408 | | |
405 | 409 | | |
406 | 410 | | |
| |||
0 commit comments