Sm90 mega moe on sgl dev by qiushixiaoyu · Pull Request #36 · sgl-project/DeepGEMM

qiushixiaoyu · 2026-05-19T09:05:01Z

MegaMoE SM90 Perf Summary

Flash vs normal baseline

batch	fused avg us	baseline avg us	speedup	timing ranks
1	207.5	491.1	2.37x	8/8
2	235.2	614.5	2.61x	8/8
4	350.4	817.8	2.33x	8/8
8	423.8	938.1	2.21x	8/8
16	485.8	1056.9	2.18x	8/8
32	494.6	1090.2	2.20x	8/8
64	495.2	1070.6	2.16x	8/8
128	508.1	1062.2	2.09x	8/8
256	524.6	1169.4	2.23x	8/8
512	874.0	1237.8	1.42x	8/8
1024	1637.5	2091.4	1.28x	8/8
2048	2856.1	3572.2	1.25x	8/8
4096	5213.8	6375.6	1.22x	8/8
8192	9812.0	11974.9	1.22x	8/8

Flash vs low-latency baseline

batch	status	fused avg us	baseline avg us	speedup	timing ranks	note
1	1	172.6	199.0	1.15x	8/8	nonzero exit after timing
2	1	239.8	272.2	1.14x	8/8	nonzero exit after timing
4	1	349.5	390.4	1.12x	8/8	nonzero exit after timing
8	1	444.6	473.0	1.06x	8/8	nonzero exit after timing
16	1	497.0	514.9	1.04x	8/8	nonzero exit after timing
32	1	520.1	515.4	0.99x	8/8	nonzero exit after timing
64	1	505.4	521.0	1.03x	8/8	nonzero exit after timing
128	1	510.1	542.6	1.06x	8/8	nonzero exit after timing
256	1	530.0	616.5	1.16x	8/8	nonzero exit after timing

Pro vs normal baseline

batch	fused avg us	baseline avg us	speedup	timing ranks
1	415.4	908.9	2.19x	8/8
2	572.5	1275.4	2.23x	8/8
4	861.6	1850.9	2.15x	8/8
8	1256.6	2715.6	2.16x	8/8
16	1519.5	3169.6	2.09x	8/8
32	1562.2	3273.9	2.10x	8/8
64	1580.5	3277.9	2.07x	8/8
128	1586.9	3322.0	2.09x	8/8
256	1615.5	3360.5	2.08x	8/8
512	3021.2	3461.9	1.15x	8/8
1024	4773.5	5365.8	1.12x	8/8
2048	7856.6	8744.8	1.11x	8/8
4096	13650.2	15095.4	1.11x	8/8
8192	25454.1	28132.8	1.11x	8/8

Pro vs low-latency baseline

batch	status	fused avg us	baseline avg us	speedup	timing ranks	note
1	1	420.6	466.5	1.11x	8/8	nonzero exit after timing
2	1	587.1	635.1	1.08x	8/8	nonzero exit after timing
4	1	849.6	951.0	1.12x	8/8	nonzero exit after timing
8	1	1255.4	1344.4	1.07x	8/8	nonzero exit after timing
16	1	1529.1	1601.2	1.05x	8/8	nonzero exit after timing
32	1	1586.4	1676.0	1.06x	8/8	nonzero exit after timing
64	1	1567.8	1692.2	1.08x	8/8	nonzero exit after timing
128	1	1597.1	1727.5	1.08x	8/8	nonzero exit after timing
256	1	1611.9	1795.9	1.11x	8/8	nonzero exit after timing

Benchmark DeepSeekV4Flash

Shape	Track	off tok/s	on tok/s	on/off	off RR/MC	on RR/MC	on SLO status
3500/1500	SLO-Compliant	4778.35	5007.36	1.05x	4.00/28	4.00/28	OK
3500/1500	Max-Throughput	4669.26	5027.19	1.08x	4.00/28	4.00/28	OK
32000/1500	SLO-Compliant	11373.42	14708.10	1.29x	2.00/12	2.20/20	OK
32000/1500	Max-Throughput	12742.43	15359.18	1.21x	2.00/20	2.20/24	VIOLATE
128000/1500	SLO-Compliant	13014.58	15362.97	1.18x	1.00/4	1.00/4	OK
128000/1500	Max-Throughput	15116.68	18694.88	1.24x	1.00/12	1.60/12	VIOLATE

OP Accuracy

correctness：28 scenarios PASS，max diff 0.0006

E2E Accuracy

MegaMoE	Accuracy	Invalid	Latency(s)	Output tok/s	Throughput vs off
off	0.956	0.000	206.924	578.897	1.00x
on	0.952	0.000	151.799	788.357	1.36x

Fridge003 · 2026-05-29T06:52:14Z

@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future

qiushixiaoyu · 2026-05-29T07:20:15Z

@qiushixiaoyu Can we upstream this change to original DeepGemm? So that we can use it more conveniently in the future

@Fridge003
ok. I already opened a PR(323) against the upstream repository, but it contained some changes from sgl-project/DeepGEMM. I’ll revise the patch and strip out those project-specific modifications. That said, I’ve heard it’s unlikely to be accepted into deepseek-ai/DeepGEMM.

yz-tang · 2026-06-02T06:55:55Z

When enable --run-low-latency-baseline, Will there be a performance degradation?

qiushixiaoyu · 2026-06-05T09:18:11Z

When enable --run-low-latency-baseline, Will there be a performance degradation?

I don’t think so. This is only for comparing performance against the low-latency baseline. While testing, I found that the performance with small batch sizes is not very stable. I’m still investigating it.

yz-tang · 2026-06-09T08:05:51Z

@qiushixiaoyu Can you share your sglang start cmd? I try use this PR, but sglang output is error。
My sglang start cmd:

SGLANG_DSV4_FP4_EXPERTS=0 \
sglang serve \
  --trust-remote-code \
  --model-path /data1/DeepSeek-V4-Flash-FP8 \
  --tp 8 \
  --moe-a2a-backend megamoe \
  --tool-call-parser deepseekv4 \
  --reasoning-parser deepseek-v4 \
  --host 0.0.0.0 \
  --port 8055

qiushixiaoyu · 2026-06-11T02:34:13Z

@qiushixiaoyu Can you share your sglang start cmd? I try use this PR, but sglang output is error。 My sglang start cmd:
SGLANG_DSV4_FP4_EXPERTS=0 \
sglang serve \
  --trust-remote-code \
  --model-path /data1/DeepSeek-V4-Flash-FP8 \
  --tp 8 \
  --moe-a2a-backend megamoe \
  --tool-call-parser deepseekv4 \
  --reasoning-parser deepseek-v4 \
  --host 0.0.0.0 \
  --port 8055

@yz-tang
export SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE=1
export SGLANG_OPT_FIX_MEGA_MOE_MEMORY=1
export SGLANG_OPT_DEEPGEMM_MEGA_MOE_NUM_MAX_TOKENS_PER_RANK=4096
python -m sglang.launch_server
--trust-remote-code
--model-path "${MODEL_PATH}"
--tp 8
--ep-size 8
--chunked-prefill-size 4096
--moe-a2a-backend deepep
--moe-runner-backend deep_gemm
--deepep-mode auto
--cuda-graph-max-bs 32
--max-running-requests 32
--speculative-algo EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
--enable-metrics
--host 0.0.0.0
--port "${SERVER_PORT}"
--mem-fraction-static 0.75
--tool-call-parser deepseekv4
--reasoning-parser deepseek-v4

I still have an SGLang change PR that hasn’t been merged yet.
sgl-project/sglang@5776a42

This was referenced May 19, 2026

MegaMOE adaptation for SM90 #24

Closed

DeepSeek V4 Roadmap sgl-project/sglang#23602

Open

Fridge003 requested changes May 29, 2026

View reviewed changes

qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch 4 times, most recently from 067fc03 to 78772d1 Compare June 4, 2026 11:58

Add SM90 MegaMoE support with TVM FFI bindings

fce68b3

qiushixiaoyu force-pushed the sm90-mega-moe-on-sgl-dev branch from 78772d1 to fce68b3 Compare June 5, 2026 04:38

Apply SM90 MegaMoE kernel fixes

3f9268b

Fridge003 merged commit 35d4d8c into sgl-project:dev Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sm90 mega moe on sgl dev#36

Sm90 mega moe on sgl dev#36
Fridge003 merged 2 commits into
sgl-project:devfrom
qiushixiaoyu:sm90-mega-moe-on-sgl-dev

qiushixiaoyu commented May 19, 2026 •

edited

Loading

Uh oh!

Fridge003 commented May 29, 2026

Uh oh!

qiushixiaoyu commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yz-tang commented Jun 2, 2026

Uh oh!

qiushixiaoyu commented Jun 5, 2026

Uh oh!

yz-tang commented Jun 9, 2026

Uh oh!

qiushixiaoyu commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qiushixiaoyu commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MegaMoE SM90 Perf Summary

Flash vs normal baseline

Flash vs low-latency baseline

Pro vs normal baseline

Pro vs low-latency baseline

Benchmark DeepSeekV4Flash

OP Accuracy

E2E Accuracy

Uh oh!

Fridge003 commented May 29, 2026

Uh oh!

qiushixiaoyu commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yz-tang commented Jun 2, 2026

Uh oh!

qiushixiaoyu commented Jun 5, 2026

Uh oh!

yz-tang commented Jun 9, 2026

Uh oh!

qiushixiaoyu commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qiushixiaoyu commented May 19, 2026 •

edited

Loading

qiushixiaoyu commented May 29, 2026 •

edited

Loading

qiushixiaoyu commented Jun 11, 2026 •

edited

Loading