Commit 579bec2

committed

Add MoE benchmark for Triton fused_moe kernel vs eager/compile baselines

Sweeps prompt lengths [1..4095] with Qwen3.5-35B-A3B shapes (256 experts, top-8, INT4 W4A16). Validates correctness against loop-based eager reference at small M, benchmarks vectorized eager, torch.compile, and Triton fused_moe. Handles OOM gracefully at large M where eager/compile dequantize all experts. This PR was authored with the assistance of Claude.

1 parent 168794b commit 579bec2Copy full SHA for 579bec2

1 file changed

backends/cuda/benchmarks
- benchmark_moe.py

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 579bec2

File tree

0 commit comments