optimized fused_grouped_topk SYCL kernel for MoE expert routing#253
optimized fused_grouped_topk SYCL kernel for MoE expert routing#253xiaolong-intel wants to merge 13 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the XPU MoE “grouped topk” path by introducing a substantially revised SYCL implementation for fused_grouped_topk, updating its Torch bindings, and adjusting test coverage to validate correctness against a PyTorch baseline.
Changes:
- Add
_moe_Cimport to the Python fused MoE interface to ensure MoE ops are registered. - Rework
csrc/moe/fused_grouped_topk.cppwith a new SYCL kernel implementation and update the Torch op schema formatting. - Update grouped-topk tests and baselines (param ranges, determinism settings, and wrapper behavior).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
vllm_xpu_kernels/fused_moe_interface.py |
Ensures _moe_C is imported so MoE ops (including grouped topk) are available. |
tests/test_grouped_topk.py |
Expands parameter coverage (notably token counts and expert counts) for grouped topk validation. |
tests/ops/grouped_topk_op.py |
Adjusts the reference implementation for determinism and updates the SYCL wrapper behavior. |
csrc/moe/torch_bindings.cpp |
Reformats/declares the fused_grouped_topk op schema. |
csrc/moe/moe_ops.h |
Minor header formatting around the fused_grouped_topk declaration. |
csrc/moe/fused_grouped_topk.cpp |
Replaces the previous kernel with a new SYCL implementation and host-side dispatch/validation logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @pytest.mark.parametrize("n_token", [64, 50000,100000]) | ||
| @pytest.mark.parametrize("n_hidden", [1024, 2048]) | ||
| @pytest.mark.parametrize("n_expert", [16]) | ||
| @pytest.mark.parametrize("topk", [2]) | ||
| @pytest.mark.parametrize("n_expert", [128, 256]) | ||
| @pytest.mark.parametrize("topk", [8]) |
There was a problem hiding this comment.
The default parametrization makes this test extremely large (n_token up to 100000, plus multiple values for n_hidden/n_expert/dtype/etc.), which is likely to cause very long runtimes and/or XPU OOM in the normal (non-MINI) test run. Consider keeping unit-test sizes small by default and gating large shapes behind the existing MINI_PYTEST_PARAMS/CI env, or moving the large-shape coverage to a dedicated benchmark/perf test.
| auto input_size = gating_output.sizes(); | ||
| int64_t num_tokens = input_size[0]; | ||
| int64_t num_experts = input_size[1]; | ||
| int64_t n_group = n_expert_group; | ||
| int64_t topk_group = n_topk_group; | ||
| int64_t topk = n_topk; | ||
|
|
||
| TORCH_CHECK(hidden_states.sizes()[0] == gating_output.sizes()[0], | ||
| "Number of tokens mismatch"); | ||
| TORCH_CHECK(input_size.size() == 2, "gating_output must be a 2D Tensor"); |
There was a problem hiding this comment.
num_tokens = input_size[0] / num_experts = input_size[1] is computed before verifying gating_output is 2D, which can read out of bounds for unexpected inputs. Also, the previous implementation validated bias shape matched num_experts; that check is now missing, but the kernel indexes routingBias[...] assuming length == num_experts. Please add TORCH_CHECKs to validate gating_output.dim()==2 before indexing sizes, and if bias is provided ensure it is 1D (or broadcastable in an explicitly supported way) with bias.numel()==num_experts (and ideally on the same device).
| auto input_size = gating_output.sizes(); | |
| int64_t num_tokens = input_size[0]; | |
| int64_t num_experts = input_size[1]; | |
| int64_t n_group = n_expert_group; | |
| int64_t topk_group = n_topk_group; | |
| int64_t topk = n_topk; | |
| TORCH_CHECK(hidden_states.sizes()[0] == gating_output.sizes()[0], | |
| "Number of tokens mismatch"); | |
| TORCH_CHECK(input_size.size() == 2, "gating_output must be a 2D Tensor"); | |
| TORCH_CHECK(gating_output.dim() == 2, "gating_output must be a 2D Tensor"); | |
| auto input_size = gating_output.sizes(); | |
| int64_t num_tokens = input_size[0]; | |
| int64_t num_experts = input_size[1]; | |
| int64_t n_group = n_expert_group; | |
| int64_t topk_group = n_topk_group; | |
| int64_t topk = n_topk; | |
| TORCH_CHECK(hidden_states.sizes()[0] == gating_output.sizes()[0], | |
| "Number of tokens mismatch"); | |
| if (has_bias) { | |
| TORCH_CHECK(bias->dim() == 1, "bias must be a 1D Tensor"); | |
| TORCH_CHECK(bias->numel() == num_experts, | |
| "bias must have num_experts elements"); | |
| TORCH_CHECK(bias->device() == gating_output.device(), | |
| "bias must be on the same device as gating_output"); | |
| } |
|
|
||
| constexpr int MaxExpertCandidatesPerLane = NumDeepseekExperts / WARP_SIZE; | ||
| T localCandidateScores[MaxExpertCandidatesPerLane]; | ||
| IdxT localCandidateIdx[MaxExpertCandidatesPerLane]; |
There was a problem hiding this comment.
In the grouped-routing path, selectedExpertScores/Idx are sized to DefaultMaxNumTopExperts (8), but topk is only checked as <= 32. If topk > 8 with UseGroups=true, reduceTopK writes past these arrays. Either enforce topk <= DefaultMaxNumTopExperts for the grouped path, or add additional kernel instantiations / dynamic storage sized for larger topk.
| IdxT localCandidateIdx[MaxExpertCandidatesPerLane]; | |
| IdxT localCandidateIdx[MaxExpertCandidatesPerLane]; | |
| TORCH_CHECK( | |
| topk <= DefaultMaxNumTopExperts, | |
| "Grouped routing only supports topk <= ", DefaultMaxNumTopExperts, | |
| ", but got topk=", topk, | |
| ". Increase grouped-path storage/kernel support for larger topk values."); |
| int32_t totalCandidates = topkGroup * numExpertsPerGroup; | ||
| for (int32_t candidate = laneIdx; candidate < totalCandidates; | ||
| candidate += WARP_SIZE) { | ||
| int32_t localSlot = candidate / WARP_SIZE; | ||
| int32_t selectedGroup = candidate / numExpertsPerGroup; | ||
| int32_t expertInGroup = candidate % numExpertsPerGroup; | ||
| int32_t gid = selectedGroupIdx[selectedGroup]; | ||
| int32_t idx = gid * numExpertsPerGroup + expertInGroup; | ||
| T candidateScore = neg_inf<T>(); | ||
|
|
||
| T input = scoresToken[idx]; | ||
| if (is_finite(input)) { | ||
| T score = apply_scoring<SF>(input); | ||
| candidateScore = score; | ||
| if (has_bias) { | ||
| candidateScore = candidateScore + sycl_cast<T, BiasT>(routingBias[idx]); | ||
| } | ||
| } | ||
|
|
||
| localCandidateScores[localSlot] = candidateScore; | ||
| localCandidateIdx[localSlot] = static_cast<IdxT>(idx); |
There was a problem hiding this comment.
localCandidateScores/localCandidateIdx are sized assuming at most NumDeepseekExperts candidates (256 total, 8 per lane), but totalCandidates = topkGroup * numExpertsPerGroup can exceed 256 for valid inputs (e.g., small numGroup with larger numExpertsPerGroup). When that happens, localSlot = candidate / WARP_SIZE will exceed MaxExpertCandidatesPerLane and write out of bounds. Add a host-side TORCH_CHECK that topkGroup * numExpertsPerGroup <= 256 (or dispatch a kernel variant sized for the actual maximum candidates).
|
what's benefit come from? how do you benchmark? |
Benifit comes from:
I used https://github.com/xiaolong-intel/vllm-xpu-kernels/blob/grouped_topk/tests/test_grouped_topk.py. to test the precision and accuracy of the op. |
| assert hidden_states.size(0) == gating_output.size(0), ( | ||
| "Number of tokens mismatch") | ||
| if scoring_func == "softmax": | ||
| scores = torch.softmax(gating_output, dim=-1) |
There was a problem hiding this comment.
Have you compared the performance of the softmax scoring_func?
There was a problem hiding this comment.
yes. I tested the kernel_time under softmax scoring_func with seqlenth=100000. The above is the original version, and the below is the optimized version.
I am currently testing with https://github.com/vllm-project/vllm-xpu-kernels/blob/main/benchmark/benchmark_grouped_topk.py , and I will provide the results later.

| try: | ||
| from . import _C # noqa: F401 | ||
| from . import _xpu_C # noqa: F401 | ||
| from . import _moe_C # noqa: F401 |
There was a problem hiding this comment.
Please do not add unnecessary import.
|
Please provide benchmark results as benchmark_grouped_topk do. |
| #include <sycl/sycl.hpp> | ||
|
|
||
| #include "../utils.h" | ||
| #include <c10/xpu/XPUStream.h> |
There was a problem hiding this comment.
please don't include here.
../utils.h should cover this.
| return 1.0f / (1.0f + sycl::native::exp(-x)); | ||
| } | ||
| // Type trait: bfloat16 -> float for computation, everything else stays as-is | ||
| template <typename T> |
There was a problem hiding this comment.
for such util, move to utils.h
There was a problem hiding this comment.
Okay, thank you for your suggestion, I will make the correction.
Hi mayuyuace: |
jikunshang
left a comment
There was a problem hiding this comment.
pls fix pre-commit issue and do not change oneDNN commit.
| n_expert_range = [16, 64, 128] | ||
| topk_range = [2, 4] | ||
| topk_group_range = [4, 8] | ||
| n_token_range = [50000] |
There was a problem hiding this comment.
please don't remove previous one.
|
|
||
| benchmark = get_benchmark() | ||
| benchmark.run(print_data=True, save_path=args.save_path) | ||
| benchmark.run(print_data=True, save_path=args.save_path) No newline at end of file |
| std::is_same_v<T, sycl::ext::oneapi::bfloat16> || | ||
| std::is_same_v<T, sycl::half>;; |
|
|
||
|
|
||
| @pytest.mark.parametrize("n_token", [1, 33, 64]) | ||
| @pytest.mark.parametrize("n_token", [64, 50000,100000]) |
There was a problem hiding this comment.
we don't want use such large shape in CI which may make job longer to complete.
| @pytest.mark.parametrize("topk_group", [2]) | ||
| @pytest.mark.parametrize("scoring_func", ["softmax", "sigmoid"]) | ||
| @pytest.mark.parametrize("topk_group", [4]) | ||
| @pytest.mark.parametrize("scoring_func", ["sigmoid","softmax"]) |
There was a problem hiding this comment.
please revert unnecessary change. same for blank line
pre-commit is completed. onednn. It seems I haven't made any changes😂, just synced with the latest vllm-xpu-kernels. |
|
You may need rebase the branch and only keep the key code changing. |
Okay, I understand, thank you |
|
can you rebase and fix DCO? |
367b868 to
fa68c74
Compare
fa68c74 to
367b868
Compare
515d662 to
367b868
Compare
Signed-off-by: xiaolong <xiaolong.guo@intel.com>
Signed-off-by: root <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
Signed-off-by: xiaolong-intel <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
Signed-off-by: root <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
Signed-off-by: xiaolong-intel <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
Signed-off-by: xiaolong-intel <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
Signed-off-by: root <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
Signed-off-by: xiaolong-intel <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
…onversion of renormalization Signed-off-by: root <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
367b868 to
5fb85ba
Compare
Signed-off-by: root <xiaolong.guo@intel.com> Signed-off-by: <xiaolong.guo@intel.com> Signed-off-by: xiaolong <xiaolong.guo@intel.com>
91fe795 to
b6e2254
Compare
Done.Thanks |
Signed-off-by: xiaolong <xiaolong.guo@intel.com>
Signed-off-by: root <xiaolong.guo@intel.com>

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
optimized
fused_grouped_topkSYCL kernel for MoE expert routingTest Plan
I wrote test cases in https://github.com/xiaolong-intel/vllm-xpu-kernels/blob/grouped_topk/tests/test_grouped_topk.py. Tested the consistency of the forward_xpu operator with the torch version of grouped_topk on B60
Test Result
test cases:


test results:
All test cases passed successfully
Tested operator performance on GPU B60 with the following configuration:
(Optional) Documentation Update
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)