Skip to content

Enalbe fused softmax/sigmoid + topk path for 1024 experts#252

Open
JianyuLi01 wants to merge 2 commits into
vllm-project:mainfrom
JianyuLi01:main
Open

Enalbe fused softmax/sigmoid + topk path for 1024 experts#252
JianyuLi01 wants to merge 2 commits into
vllm-project:mainfrom
JianyuLi01:main

Conversation

@JianyuLi01
Copy link
Copy Markdown

Per measuring, the fused path delivers better performance when the number of experts is 1024 on B60.
1 token + 1024 experts: average uplift ~3%
64 tokens + 1024 experts: average uplift ~6%
128 tokens + 1024 experts: average uplift ~7%
256 tokens + 1024 experts: average uplift ~45%
Current MoE models do not yet support as many as 1024 experts. However, when customers compare performance at 1024 experts, this optimization can provide better performance metrics.
p.s. benchmark/benchmark_topk.py already covers the performance of 1024 experts.

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Improve the performance of topk_softmax/topk_sigmoid

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Copilot AI review requested due to automatic review settings April 3, 2026 05:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds explicit support for the fused topk gating kernel path when num_experts == 1024, aiming to improve MoE routing performance on XPU for large expert counts.

Changes:

  • Adds a case 1024 specialization in topk_gating_kernel_launcher to dispatch to LAUNCH_TOPK(1024, ...).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread csrc/moe/topk.cpp
Comment on lines +741 to +744
case 1024:
LAUNCH_TOPK(
1024, WARPS_PER_TB, BYTES_PER_LDG_POWER_OF_2, ScoringFuncParam);
break;
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new 1024-expert fast path, topk_softmax/topk_sigmoid will still allocate scoring_workspace because needs_workspace = !is_pow_2 || num_experts > 256 (see topk_softmax/topk_sigmoid). For num_experts == 1024 the switch now takes the fused LAUNCH_TOPK(1024, ...) path, which doesn’t use scoring_workspace, so this becomes avoidable extra device memory traffic/pressure for the exact case this PR is optimizing. Consider tightening needs_workspace (e.g., allocate only for the default path) or introducing a small helper that mirrors the switch cases to decide when workspace is actually needed.

Copilot uses AI. Check for mistakes.
@jikunshang
Copy link
Copy Markdown
Member

@jerrychenhf PTAL.

Per measuring, the fused path delivers better performance when the number of experts is 1024.
1 token + 1024 experts: average uplift ~3%
64 tokens + 1024 experts: average uplift ~6%
128 tokens + 1024 experts: average uplift ~7%
256 tokens + 1024 experts: average uplift ~45%
Current MoE models do not yet support as many as 1024 experts. However, when customers compare performance at 1024 experts, this optimization can provide better performance metrics.

Signed-off-by: LiJianyu <jianyu.li@intel.com>
@jikunshang
Copy link
Copy Markdown
Member

is there any model using 1024 experts?

@JianyuLi01
Copy link
Copy Markdown
Author

Thanks for checking.
We don't really see models with 1024 experts today. This change only delivers better performance for vllm-xpu-kernels/benchmark/benchmark_topk.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants