Commit 579bec2
committed
Add MoE benchmark for Triton fused_moe kernel vs eager/compile baselines
Sweeps prompt lengths [1..4095] with Qwen3.5-35B-A3B shapes (256 experts,
top-8, INT4 W4A16). Validates correctness against loop-based eager reference
at small M, benchmarks vectorized eager, torch.compile, and Triton fused_moe.
Handles OOM gracefully at large M where eager/compile dequantize all experts.
This PR was authored with the assistance of Claude.1 parent 168794b commit 579bec2
1 file changed
Lines changed: 423 additions & 0 deletions
0 commit comments