Enalbe fused softmax/sigmoid + topk path for 1024 experts#252
Enalbe fused softmax/sigmoid + topk path for 1024 experts#252JianyuLi01 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Adds explicit support for the fused topk gating kernel path when num_experts == 1024, aiming to improve MoE routing performance on XPU for large expert counts.
Changes:
- Adds a
case 1024specialization intopk_gating_kernel_launcherto dispatch toLAUNCH_TOPK(1024, ...).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| case 1024: | ||
| LAUNCH_TOPK( | ||
| 1024, WARPS_PER_TB, BYTES_PER_LDG_POWER_OF_2, ScoringFuncParam); | ||
| break; |
There was a problem hiding this comment.
With the new 1024-expert fast path, topk_softmax/topk_sigmoid will still allocate scoring_workspace because needs_workspace = !is_pow_2 || num_experts > 256 (see topk_softmax/topk_sigmoid). For num_experts == 1024 the switch now takes the fused LAUNCH_TOPK(1024, ...) path, which doesn’t use scoring_workspace, so this becomes avoidable extra device memory traffic/pressure for the exact case this PR is optimizing. Consider tightening needs_workspace (e.g., allocate only for the default path) or introducing a small helper that mirrors the switch cases to decide when workspace is actually needed.
|
@jerrychenhf PTAL. |
Per measuring, the fused path delivers better performance when the number of experts is 1024. 1 token + 1024 experts: average uplift ~3% 64 tokens + 1024 experts: average uplift ~6% 128 tokens + 1024 experts: average uplift ~7% 256 tokens + 1024 experts: average uplift ~45% Current MoE models do not yet support as many as 1024 experts. However, when customers compare performance at 1024 experts, this optimization can provide better performance metrics. Signed-off-by: LiJianyu <jianyu.li@intel.com>
|
is there any model using 1024 experts? |
|
Thanks for checking. |
Per measuring, the fused path delivers better performance when the number of experts is 1024 on B60.
1 token + 1024 experts: average uplift ~3%
64 tokens + 1024 experts: average uplift ~6%
128 tokens + 1024 experts: average uplift ~7%
256 tokens + 1024 experts: average uplift ~45%
Current MoE models do not yet support as many as 1024 experts. However, when customers compare performance at 1024 experts, this optimization can provide better performance metrics.
p.s. benchmark/benchmark_topk.py already covers the performance of 1024 experts.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Improve the performance of topk_softmax/topk_sigmoid
Test Plan
Test Result
(Optional) Documentation Update
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)