Commit 217ad45
authored
AutoTune MoE kernel block sizes for accelerate inference (#18551)
## Summary
This PR introduces Triton autotuning for MoE kernels, improving Qwen3.5
MoE model inference from **66.8 token/s → 77.7 token/s**.
## Motivation
Profiling the Qwen3.5 MoE model (prior to GQA/MQA support in Triton
SDPA) shows MoE dominates GPU time:
| Category | Total (ms) | % GPU |
|---|---|---|
| **MoE** | **1,420** | **54.7%** |
| Triton fused ops | 433 | 16.7% |
| SDPA | 288 | 11.1% |
| int4mm | 240 | 9.2% |
| chunk_gated_delta_rule | 151 | 5.8% |
| Router | 65 | 2.5% |
The `fused_moe` kernel is the single largest bottleneck, making it the
highest-leverage optimization target.
## Approach
Due to hardware constraints, exhaustive autotuning at `aoti-compile`
time is impractical. Instead, we:
1. **Benchmarked** all hyperparameter combinations for MoE kernels on an
A100 server ([full
results](https://gist.github.com/Gasoonjia/baae2475684d1246c82865ff5cbd949d))
2. **Selected** the top-5 configurations plus the original `(N=32,
K=32)` baseline
3. **Registered** them as `@triton.autotune` configs for the MoE kernels
## Results — MoE Kernel
| Kernel | Best Config | Baseline | Best | Improvement |
|---|---|---|---|---|
| GEMM1 | `(8, 256, w2, s2)` | 60.4 µs | 32.8 µs | **45.8% faster** |
| GEMM2 | `(8, 128, w2, s4)` | 29.2 µs | 26.1 µs | **10.6% faster** |
**MoE kernel overall: 89.6 µs → 58.9 µs (34.3% improvement)**
## Results — End-to-End Inference
| | Token/s |
|---|---|
| Baseline | 66.8 |
| With this PR | **77.7** |1 parent 186eb4b commit 217ad45
2 files changed
Lines changed: 37 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
424 | 424 | | |
425 | 425 | | |
426 | 426 | | |
| 427 | + | |
427 | 428 | | |
428 | 429 | | |
429 | 430 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
41 | 67 | | |
42 | 68 | | |
43 | 69 | | |
| |||
147 | 173 | | |
148 | 174 | | |
149 | 175 | | |
| 176 | + | |
150 | 177 | | |
151 | 178 | | |
152 | 179 | | |
| |||
294 | 321 | | |
295 | 322 | | |
296 | 323 | | |
297 | | - | |
298 | | - | |
299 | | - | |
300 | 324 | | |
301 | 325 | | |
302 | 326 | | |
303 | 327 | | |
304 | 328 | | |
| 329 | + | |
305 | 330 | | |
306 | 331 | | |
307 | 332 | | |
308 | | - | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
309 | 337 | | |
310 | 338 | | |
311 | 339 | | |
| |||
327 | 355 | | |
328 | 356 | | |
329 | 357 | | |
330 | | - | |
331 | | - | |
332 | 358 | | |
333 | 359 | | |
334 | 360 | | |
| |||
338 | 364 | | |
339 | 365 | | |
340 | 366 | | |
341 | | - | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
342 | 371 | | |
343 | 372 | | |
344 | 373 | | |
| |||
360 | 389 | | |
361 | 390 | | |
362 | 391 | | |
363 | | - | |
364 | | - | |
365 | 392 | | |
366 | 393 | | |
367 | 394 | | |
| |||
0 commit comments