[Benchmark] Add SGLANG_SIMULATE_UNIFORM_EXPERTS for balanced expert routing with dummy weights#25571
Merged
ByronHsu merged 3 commits intoMay 18, 2026
Conversation
…expert routing When benchmarking with `--load-format dummy`, random gate weights cause severe expert imbalance — some experts receive all tokens while others get none. This triggers DeepEP dispatch buffer overflows and OOM from hot-expert memory spikes, making it impossible to benchmark MoE models with dummy weights at scale. `SGLANG_SIMULATE_UNIFORM_EXPERTS=1` overrides the gating output with a deterministic round-robin expert assignment: each token picks `k` experts evenly spaced across all `num_experts`, with a random per-token offset for cross-token balance. This guarantees every EP rank receives roughly the same number of dispatched tokens. The override is applied before `_post_process_topk_ids` so EP remapping, fused shared expert handling, and logical-to-physical ID translation all work correctly. This flag is for profiling/benchmarking only and should not be used in production serving. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
Author
|
/tag-and-rerun-ci |
Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
|
This is very helpful |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
When benchmarking MoE models with
--load-format dummy, random gate weights cause severe expert imbalance. This flag forces a uniform expert distribution, which represents the most optimistic (best-case) token routing. This is useful for benchmarking to set an upper bound on serving performance.Change
Add
SGLANG_SIMULATE_UNIFORM_EXPERTS=1env var that overrides the gating output with a deterministic round-robin expert assignment:kexperts evenly spaced across allnum_experts(stride =num_experts // k)1/k) for all selected expertsThe override is applied before
_post_process_topk_idsso EP remapping, fused shared expert handling, and logical-to-physical ID translation all work correctly.Usage
This flag is for profiling/benchmarking only and should not be used in production serving.
Experiment
Test plan
CI States
Latest PR Test (Base): Run #26017132080⚠️ Not enabled — add
Latest PR Test (Extra):
run-ci-extralabel to opt in.