Skip to content

Sm90 mega moe on sgl dev#36

Merged
Fridge003 merged 2 commits into
sgl-project:devfrom
qiushixiaoyu:sm90-mega-moe-on-sgl-dev
Jun 15, 2026
Merged

Sm90 mega moe on sgl dev#36
Fridge003 merged 2 commits into
sgl-project:devfrom
qiushixiaoyu:sm90-mega-moe-on-sgl-dev

Conversation

@qiushixiaoyu

@qiushixiaoyu qiushixiaoyu commented May 19, 2026

Copy link
Copy Markdown

MegaMoE SM90 Perf Summary

Flash vs normal baseline

batch status fused avg us baseline avg us speedup timing ranks note
1 0 207.5 491.1 2.37x 8/8
2 0 235.2 614.5 2.61x 8/8
4 0 350.4 817.8 2.33x 8/8
8 0 423.8 938.1 2.21x 8/8
16 0 485.8 1056.9 2.18x 8/8
32 0 494.6 1090.2 2.20x 8/8
64 0 495.2 1070.6 2.16x 8/8
128 0 508.1 1062.2 2.09x 8/8
256 0 524.6 1169.4 2.23x 8/8
512 0 874.0 1237.8 1.42x 8/8
1024 0 1637.5 2091.4 1.28x 8/8
2048 0 2856.1 3572.2 1.25x 8/8
4096 0 5213.8 6375.6 1.22x 8/8
8192 0 9812.0 11974.9 1.22x 8/8

Flash vs low-latency baseline

batch status fused avg us baseline avg us speedup timing ranks note
1 1 172.6 199.0 1.15x 8/8 nonzero exit after timing
2 1 239.8 272.2 1.14x 8/8 nonzero exit after timing
4 1 349.5 390.4 1.12x 8/8 nonzero exit after timing
8 1 444.6 473.0 1.06x 8/8 nonzero exit after timing
16 1 497.0 514.9 1.04x 8/8 nonzero exit after timing
32 1 520.1 515.4 0.99x 8/8 nonzero exit after timing
64 1 505.4 521.0 1.03x 8/8 nonzero exit after timing
128 1 510.1 542.6 1.06x 8/8 nonzero exit after timing
256 1 530.0 616.5 1.16x 8/8 nonzero exit after timing

Pro vs normal baseline

batch status fused avg us baseline avg us speedup timing ranks note
1 0 415.4 908.9 2.19x 8/8
2 0 572.5 1275.4 2.23x 8/8
4 0 861.6 1850.9 2.15x 8/8
8 0 1256.6 2715.6 2.16x 8/8
16 0 1519.5 3169.6 2.09x 8/8
32 0 1562.2 3273.9 2.10x 8/8
64 0 1580.5 3277.9 2.07x 8/8
128 0 1586.9 3322.0 2.09x 8/8
256 0 1615.5 3360.5 2.08x 8/8
512 0 3021.2 3461.9 1.15x 8/8
1024 0 4773.5 5365.8 1.12x 8/8
2048 0 7856.6 8744.8 1.11x 8/8
4096 0 13650.2 15095.4 1.11x 8/8
8192 0 25454.1 28132.8 1.11x 8/8

Pro vs low-latency baseline

batch status fused avg us baseline avg us speedup timing ranks note
1 1 420.6 466.5 1.11x 8/8 nonzero exit after timing
2 1 587.1 635.1 1.08x 8/8 nonzero exit after timing
4 1 849.6 951.0 1.12x 8/8 nonzero exit after timing
8 1 1255.4 1344.4 1.07x 8/8 nonzero exit after timing
16 1 1529.1 1601.2 1.05x 8/8 nonzero exit after timing
32 1 1586.4 1676.0 1.06x 8/8 nonzero exit after timing
64 1 1567.8 1692.2 1.08x 8/8 nonzero exit after timing
128 1 1597.1 1727.5 1.08x 8/8 nonzero exit after timing
256 1 1611.9 1795.9 1.11x 8/8 nonzero exit after timing

Benchmark DeepSeekV4Flash

Shape Track off tok/s on tok/s on/off off RR/MC on RR/MC on SLO status
3500/1500 SLO-Compliant 4778.35 5007.36 1.05x 4.00/28 4.00/28 OK
3500/1500 Max-Throughput 4669.26 5027.19 1.08x 4.00/28 4.00/28 OK
32000/1500 SLO-Compliant 11373.42 14708.10 1.29x 2.00/12 2.20/20 OK
32000/1500 Max-Throughput 12742.43 15359.18 1.21x 2.00/20 2.20/24 VIOLATE
128000/1500 SLO-Compliant 13014.58 15362.97 1.18x 1.00/4 1.00/4 OK
128000/1500 Max-Throughput 15116.68 18694.88 1.24x 1.00/12 1.60/12 VIOLATE

OP Accuracy

correctness:28 scenarios PASS,max diff 0.0006

E2E Accuracy

MegaMoE Accuracy Invalid Latency(s) Output tok/s Throughput vs off
off 0.956 0.000 206.924 578.897 1.00x
on 0.952 0.000 151.799 788.357 1.36x

@Fridge003

Copy link
Copy Markdown
Collaborator

@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future

@qiushixiaoyu

qiushixiaoyu commented May 29, 2026

Copy link
Copy Markdown
Author

@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future

@Fridge003
ok. I already opened a PR(323) against the upstream repository, but it contained some changes from sgl-project/DeepGEMM. I’ll revise the patch and strip out those project-specific modifications. That said, I’ve heard it’s unlikely to be accepted into deepseek-ai/DeepGEMM.

Comment thread tests/test_mega_moe.py Outdated
Comment thread tests/test_mega_moe_sm90.py Outdated
Comment thread deep_gemm/utils/layout.py Outdated
Comment thread csrc/jit_kernels/heuristics/mega_moe.hpp Outdated
Comment thread csrc/jit/handle.hpp Outdated
Comment thread csrc/jit/compiler.hpp Outdated
Comment thread csrc/apis/mega.hpp Outdated
Comment thread csrc/apis/mega.hpp
Comment thread deep_gemm/include/deep_gemm/scheduler/mega_moe.cuh Outdated
Comment thread deep_gemm/mega/__init__.py Outdated
@yz-tang

yz-tang commented Jun 2, 2026

Copy link
Copy Markdown

When enable --run-low-latency-baseline, Will there be a performance degradation?

@qiushixiaoyu qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch 4 times, most recently from 067fc03 to 78772d1 Compare June 4, 2026 11:58
@qiushixiaoyu qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch from 78772d1 to fce68b3 Compare June 5, 2026 04:38
@qiushixiaoyu

Copy link
Copy Markdown
Author

When enable --run-low-latency-baseline, Will there be a performance degradation?

I don’t think so. This is only for comparing performance against the low-latency baseline. While testing, I found that the performance with small batch sizes is not very stable. I’m still investigating it.

@yz-tang

yz-tang commented Jun 9, 2026

Copy link
Copy Markdown

@qiushixiaoyu Can you share your sglang start cmd? I try use this PR, but sglang output is error。
My sglang start cmd:

SGLANG_DSV4_FP4_EXPERTS=0 \
sglang serve \
  --trust-remote-code \
  --model-path /data1/DeepSeek-V4-Flash-FP8 \
  --tp 8 \
  --moe-a2a-backend megamoe \
  --tool-call-parser deepseekv4 \
  --reasoning-parser deepseek-v4 \
  --host 0.0.0.0 \
  --port 8055

@qiushixiaoyu

qiushixiaoyu commented Jun 11, 2026

Copy link
Copy Markdown
Author

@qiushixiaoyu Can you share your sglang start cmd? I try use this PR, but sglang output is error。 My sglang start cmd:

SGLANG_DSV4_FP4_EXPERTS=0 \
sglang serve \
  --trust-remote-code \
  --model-path /data1/DeepSeek-V4-Flash-FP8 \
  --tp 8 \
  --moe-a2a-backend megamoe \
  --tool-call-parser deepseekv4 \
  --reasoning-parser deepseek-v4 \
  --host 0.0.0.0 \
  --port 8055

@yz-tang
export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
export SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096
python -m sglang.launch_server
--trust-remote-code
--model-path "${MODEL_PATH}"
--tp 8
--ep-size 8
--chunked-prefill-size 4096
--moe-a2a-backend deepep
--moe-runner-backend deep_gemm
--deepep-mode auto
--cuda-graph-max-bs 32
--max-running-requests 32
--speculative-algo EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--enable-metrics
--host 0.0.0.0
--port "${SERVER_PORT}"
--mem-fraction-static 0.75
--tool-call-parser deepseekv4
--reasoning-parser deepseek-v4

I still have an SGLang change PR that hasn’t been merged yet.
sgl-project/sglang@5776a42

@Fridge003 Fridge003 merged commit 35d4d8c into sgl-project:dev Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants