Enalbe fused softmax/sigmoid + topk path for 1024 experts by JianyuLi01 · Pull Request #252 · vllm-project/vllm-xpu-kernels

JianyuLi01 · 2026-04-03T05:08:24Z

Per measuring, the fused path delivers better performance when the number of experts is 1024 on B60.
1 token + 1024 experts: average uplift ~3%
64 tokens + 1024 experts: average uplift ~6%
128 tokens + 1024 experts: average uplift ~7%
256 tokens + 1024 experts: average uplift ~45%
Current MoE models do not yet support as many as 1024 experts. However, when customers compare performance at 1024 experts, this optimization can provide better performance metrics.
p.s. benchmark/benchmark_topk.py already covers the performance of 1024 experts.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Improve the performance of topk_softmax/topk_sigmoid

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Copilot

Pull request overview

Adds explicit support for the fused topk gating kernel path when num_experts == 1024, aiming to improve MoE routing performance on XPU for large expert counts.

Changes:

Adds a case 1024 specialization in topk_gating_kernel_launcher to dispatch to LAUNCH_TOPK(1024, ...).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T05:11:14Z

+    case 1024:
+      LAUNCH_TOPK(
+          1024, WARPS_PER_TB, BYTES_PER_LDG_POWER_OF_2, ScoringFuncParam);
+      break;


With the new 1024-expert fast path, topk_softmax/topk_sigmoid will still allocate scoring_workspace because needs_workspace = !is_pow_2 || num_experts > 256 (see topk_softmax/topk_sigmoid). For num_experts == 1024 the switch now takes the fused LAUNCH_TOPK(1024, ...) path, which doesn’t use scoring_workspace, so this becomes avoidable extra device memory traffic/pressure for the exact case this PR is optimizing. Consider tightening needs_workspace (e.g., allocate only for the default path) or introducing a small helper that mirrors the switch cases to decide when workspace is actually needed.

jikunshang · 2026-04-07T00:15:41Z

@jerrychenhf PTAL.

Per measuring, the fused path delivers better performance when the number of experts is 1024. 1 token + 1024 experts: average uplift ~3% 64 tokens + 1024 experts: average uplift ~6% 128 tokens + 1024 experts: average uplift ~7% 256 tokens + 1024 experts: average uplift ~45% Current MoE models do not yet support as many as 1024 experts. However, when customers compare performance at 1024 experts, this optimization can provide better performance metrics. Signed-off-by: LiJianyu <jianyu.li@intel.com>

jikunshang · 2026-04-14T12:40:36Z

is there any model using 1024 experts?

JianyuLi01 · 2026-04-15T00:53:11Z

Thanks for checking.
We don't really see models with 1024 experts today. This change only delivers better performance for vllm-xpu-kernels/benchmark/benchmark_topk.py.

Copilot AI review requested due to automatic review settings April 3, 2026 05:08

Copilot started reviewing on behalf of JianyuLi01 April 3, 2026 05:08 View session

JianyuLi01 force-pushed the main branch from 5ebd748 to 79fc4b3 Compare April 3, 2026 05:10

Copilot AI reviewed Apr 3, 2026

View reviewed changes

JianyuLi01 force-pushed the main branch from 79fc4b3 to a742d4b Compare April 7, 2026 05:07

Merge branch 'main' into main

cd67d0d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enalbe fused softmax/sigmoid + topk path for 1024 experts#252

Enalbe fused softmax/sigmoid + topk path for 1024 experts#252
JianyuLi01 wants to merge 2 commits into
vllm-project:mainfrom
JianyuLi01:main

JianyuLi01 commented Apr 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 3, 2026

Uh oh!

jikunshang commented Apr 7, 2026

Uh oh!

jikunshang commented Apr 14, 2026

Uh oh!

JianyuLi01 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JianyuLi01 commented Apr 3, 2026

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

jikunshang commented Apr 7, 2026

Uh oh!

jikunshang commented Apr 14, 2026

Uh oh!

JianyuLi01 commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants