Skip to content

[Benchmark] Add SGLANG_SIMULATE_UNIFORM_EXPERTS for balanced expert routing with dummy weights#25571

Merged
ByronHsu merged 3 commits into
sgl-project:mainfrom
ByronHsu:byron/simulate-uniform-experts
May 18, 2026
Merged

[Benchmark] Add SGLANG_SIMULATE_UNIFORM_EXPERTS for balanced expert routing with dummy weights#25571
ByronHsu merged 3 commits into
sgl-project:mainfrom
ByronHsu:byron/simulate-uniform-experts

Conversation

@ByronHsu
Copy link
Copy Markdown
Collaborator

@ByronHsu ByronHsu commented May 18, 2026

Motivation

When benchmarking MoE models with --load-format dummy, random gate weights cause severe expert imbalance. This flag forces a uniform expert distribution, which represents the most optimistic (best-case) token routing. This is useful for benchmarking to set an upper bound on serving performance.

Change

Add SGLANG_SIMULATE_UNIFORM_EXPERTS=1 env var that overrides the gating output with a deterministic round-robin expert assignment:

  • Each token picks k experts evenly spaced across all num_experts (stride = num_experts // k)
  • A random per-token offset ensures cross-token balance across EP ranks
  • Uniform weights (1/k) for all selected experts

The override is applied before _post_process_topk_ids so EP remapping, fused shared expert handling, and logical-to-physical ID translation all work correctly.

Usage

SGLANG_SIMULATE_UNIFORM_EXPERTS=1 python -m sglang.launch_server \
  --model-path <model> --load-format dummy ...

This flag is for profiling/benchmarking only and should not be used in production serving.

Experiment

image

Test plan

  • Verified expert IDs are spread across all EP ranks with uniform distribution
  • Tested with Kimi K2 fp8 dummy weights on 4-node H200 (TP=32, DP=32, EP=32)
  • Confirmed no DeepEP dispatch assertion failures at bs=32 with spec decode
  • Works with overlap schedule (no shape changes in dispatch/combine tensors)

CI States

Latest PR Test (Base): Run #26017132080
Latest PR Test (Extra): ⚠️ Not enabled — add run-ci-extra label to opt in.

…expert routing

When benchmarking with `--load-format dummy`, random gate weights cause
severe expert imbalance — some experts receive all tokens while others
get none. This triggers DeepEP dispatch buffer overflows and OOM from
hot-expert memory spikes, making it impossible to benchmark MoE models
with dummy weights at scale.

`SGLANG_SIMULATE_UNIFORM_EXPERTS=1` overrides the gating output with
a deterministic round-robin expert assignment: each token picks `k`
experts evenly spaced across all `num_experts`, with a random per-token
offset for cross-token balance. This guarantees every EP rank receives
roughly the same number of dispatched tokens.

The override is applied before `_post_process_topk_ids` so EP remapping,
fused shared expert handling, and logical-to-physical ID translation
all work correctly.

This flag is for profiling/benchmarking only and should not be used in
production serving.

Co-authored-by: Cursor <cursoragent@cursor.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Co-authored-by: Cursor <cursoragent@cursor.com>
@ByronHsu
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Co-authored-by: Cursor <cursoragent@cursor.com>
@ishandhanani
Copy link
Copy Markdown
Collaborator

This is very helpful

@ByronHsu ByronHsu merged commit d96e593 into sgl-project:main May 18, 2026
113 of 120 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants