Add DSR1-MXFP4 recipe for MI355X (Team Jons contest submission, 2840/3000) by j0ons · Pull Request #786 · ROCm/ATOM

j0ons · 2026-05-14T12:21:06Z

PR: Team Jons — DSR1-MXFP4 launchers for MI355X (2840/3000)

Summary

This PR adds Team Jons's production-tested launcher scripts for DeepSeek-R1-0528-MXFP4
inference on AMD MI355X under the dsr1-fp4-atom-mtp-mi355x track. Two scripts cover all
three concurrencies and produced our locked leaderboard scores totaling 2840 / 3000.

Conc	tput_per_gpu	Score	Event ID
4	757.12	840 / 1000	`d2eb2378c2d540248005d9e1882a11b1`
32	2351.06	1000 / 1000	`474be027ba7c4ec992371ff5f50508f2`
128	3537.19	1000 / 1000	(May 8 submission)
Total	—	2840 / 3000	—

All three runs passed the harness GSM8K accuracy check (≥ 0.93). Measured GSM8K across the
locked runs: 0.9348–0.9500.

What's New

launch_atom_c4_level3_mtp_moefp4.sh — winning conc=4 launcher (757.12). Uses the
DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 weights with --level 3 and tight cudagraph capture
[1,2,4,8].
launch_atom_tp8_spec3_bigbatch.sh — winning conc=32 and conc=128 launcher
(2351.06 / 3537.19). Aggressive prefill batching via --max-num-batched-tokens 131072
and --max-num-seqs 256.
submit_c4_moefp4.sh — wrapper that boots the conc=4 server, waits healthy, then runs
./dsr1_benchmark submit Jons.
launch_atom_c4_level3.sh and run_dsr1_c4only_moefp4.sh — earlier reference launchers
kept for reproducibility against intermediate submissions.

Shared baseline across all three:

TP=8, --kv_cache_dtype fp8, --max-model-len 10240
--method mtp --num-speculative-tokens 3 (max supported by AITER's fp8 MLA kernel —
qo_len ≤ 4 enforced in asm_mla.cu:281; DSR1's MTP head is only trained for spec=3).

Key technical contribution

The single biggest insight was that DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 is faster than
the standard DeepSeek-R1-0528-MXFP4 at conc=4 despite identical architecture. The
MoE-FP4 quantization gives lower Mean TPOT (5.64 ms vs 5.95 ms), translating to
+15 tok/s/GPU at peak (757.12 vs 742). Counter-intuitive because the MoEFP4 variant
is smaller (350 GB vs 376 GB) and uses fewer shards (76 vs 82) — at conc=4 the
smaller active MoE footprint fits cache better, dominating any quant-error overhead.
At conc=32/128, the standard model is faster (we benched both at all three concs).

What Was Tried and Did Not Help (Negative Results)

Full catalogue in TECHNICAL_APPROACH.md. Highlights of confirmed dead ends on
rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2:

--enable-dp-attention, --data-parallel-size > 1, --enable_prefix_caching:
engine-init bugs in ATOM v0.1.2.
--num-speculative-tokens ≥ 4 in fp8: hard C++ assert in asm_mla.cu:281
(qo_len ≤ 4).
--num-speculative-tokens ≥ 4 in bf16: GSM8K collapses to ~0.05 (DSR1 MTP head not
trained for spec > 3).
TP=4: MoE memory-access fault on batch sizes > captured cudagraph sizes; also bench
divides by 8 so no rank gain.
SGLang v0.5.9-rocm700-mi35x with EAGLE/NEXTN: 4 cascading bugs in MTP+TP=8+MXFP4
load path (partial patches included in prototypes/sglang_patches/).
The "AMD env stack" (HIP_FORCE_DEV_KERNARG=1 etc.) alone: −75 % at conc=4 on our
config (only works paired with very specific kernel sequencing — not general).

Bonus Prototypes (Not Used in Submitted Runs)

prototypes/triton_mla_fp8_multi.py — a Triton fp8 MLA decode kernel that bypasses the
qo_len ≤ 4 ASM cap. Functionally correct (GSM8K 0.9447 at qo_len=4) but ~8× slower than
the hand-tuned ASM kernel; would need significant kernel-level optimization to be useful.

prototypes/sglang_patches/deepseek_weight_loader.py — partial SGLang MTP loader fixes
documenting the 4-bug chain we hit (3 patched, 4th gemm_a8w8_bpreshuffle fp8-output
issue remains).

Files in This Submission

team-jons/
├── README.md                       — overview + leaderboard standings
├── TECHNICAL_APPROACH.md           — what i changed + dead-knob catalogue
├── PERFORMANCE_METRICS.md          — full bench numbers
├── launchers/
│   ├── launch_atom_c4_level3.sh                — earlier c4 reference
│   ├── launch_atom_c4_level3_mtp_moefp4.sh     — ⭐ winning conc=4 (757.12)
│   ├── launch_atom_tp8_spec3_bigbatch.sh       — winning conc=32 AND conc=128
│   ├── run_dsr1_c4only_moefp4.sh               — c4 driver
│   └── submit_c4_moefp4.sh                     — c4 submit wrapper
├── results/
│   ├── peak_c4_757_moefp4.json                          — 757.12 c4 result
│   ├── submit_c32_bb_level3_20260513T092735Z.json.json  — 2351.06 c32 result
│   ├── submit_bigbatch_c128_20260508T172456Z.json.json  — 3537.19 c128 result
│   └── submit_tp8_fp8_level3_c4_20260512T173100Z.json.json  — earlier c4 baseline
└── prototypes/
    ├── TRITON_FP8_MLA_HANDOFF.md
    ├── triton_mla_fp8_multi.py
    └── sglang_patches/deepseek_weight_loader.py

How to Reproduce

# conc=128 (Score 1000/1000) — also used for conc=32
bash dsr1-fp4-atom-mtp-mi355x/team-jons/launchers/launch_atom_tp8_spec3_bigbatch.sh &
# wait for "Uvicorn running on http://0.0.0.0:8888"
CONC=128 ./dsr1_benchmark submit <team>
CONC=32  ./dsr1_benchmark submit <team>

# conc=4 (Score 840/1000) — uses the MoEFP4 model variant
bash dsr1-fp4-atom-mtp-mi355x/team-jons/launchers/launch_atom_c4_level3_mtp_moefp4.sh &
CONC=4   ./dsr1_benchmark submit <team>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a80a36bf7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-14T12:23:31Z

+  sleep 15
+  [ $((SECONDS % 60)) -lt 15 ] && log "[${SECONDS}s] waiting"
+done
+log "=== healthy"


Fail when the server never becomes healthy

If the server hangs or takes longer than 15 minutes without emitting one of the grepped fatal strings, this loop exits by timeout and immediately logs === healthy, then runs dsr1_benchmark submit against an unavailable endpoint. The perf driver in this same commit tracks a HEALTHY flag and exits on timeout, so the submit wrapper should do the same to avoid recording failed submissions as if the server came up.

Useful? React with 👍 / 👎.

j0ons added 2 commits May 14, 2026 16:14

DeepSeek-R1-MXFP4-MI355X-Jons

5b75211

Update README.md

a80a36b

chatgpt-codex-connector Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DSR1-MXFP4 recipe for MI355X (Team Jons contest submission, 2840/3000)#786

Add DSR1-MXFP4 recipe for MI355X (Team Jons contest submission, 2840/3000)#786
j0ons wants to merge 2 commits into
ROCm:mainfrom
j0ons:main

j0ons commented May 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

j0ons commented May 14, 2026

PR: Team Jons — DSR1-MXFP4 launchers for MI355X (2840/3000)

Summary

What's New

Key technical contribution

What Was Tried and Did Not Help (Negative Results)

Bonus Prototypes (Not Used in Submitted Runs)

Files in This Submission

How to Reproduce

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant