Add DSR1-MXFP4 recipe for MI355X (Team Jons contest submission, 2840/3000)#786
Add DSR1-MXFP4 recipe for MI355X (Team Jons contest submission, 2840/3000)#786j0ons wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a80a36bf7f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| sleep 15 | ||
| [ $((SECONDS % 60)) -lt 15 ] && log "[${SECONDS}s] waiting" | ||
| done | ||
| log "=== healthy" |
There was a problem hiding this comment.
Fail when the server never becomes healthy
If the server hangs or takes longer than 15 minutes without emitting one of the grepped fatal strings, this loop exits by timeout and immediately logs === healthy, then runs dsr1_benchmark submit against an unavailable endpoint. The perf driver in this same commit tracks a HEALTHY flag and exits on timeout, so the submit wrapper should do the same to avoid recording failed submissions as if the server came up.
Useful? React with 👍 / 👎.
PR: Team Jons — DSR1-MXFP4 launchers for MI355X (2840/3000)
Summary
This PR adds Team Jons's production-tested launcher scripts for
DeepSeek-R1-0528-MXFP4inference on AMD MI355X under the
dsr1-fp4-atom-mtp-mi355xtrack. Two scripts cover allthree concurrencies and produced our locked leaderboard scores totaling 2840 / 3000.
d2eb2378c2d540248005d9e1882a11b1474be027ba7c4ec992371ff5f50508f2All three runs passed the harness GSM8K accuracy check (≥ 0.93). Measured GSM8K across the
locked runs: 0.9348–0.9500.
What's New
launch_atom_c4_level3_mtp_moefp4.sh— winning conc=4 launcher (757.12). Uses theDeepSeek-R1-0528-MXFP4-MTP-MoEFP4weights with--level 3and tight cudagraph capture[1,2,4,8].launch_atom_tp8_spec3_bigbatch.sh— winning conc=32 and conc=128 launcher(2351.06 / 3537.19). Aggressive prefill batching via
--max-num-batched-tokens 131072and
--max-num-seqs 256.submit_c4_moefp4.sh— wrapper that boots the conc=4 server, waits healthy, then runs./dsr1_benchmark submit Jons.launch_atom_c4_level3.shandrun_dsr1_c4only_moefp4.sh— earlier reference launcherskept for reproducibility against intermediate submissions.
Shared baseline across all three:
--kv_cache_dtype fp8,--max-model-len 10240--method mtp --num-speculative-tokens 3(max supported by AITER's fp8 MLA kernel —qo_len ≤ 4 enforced in
asm_mla.cu:281; DSR1's MTP head is only trained for spec=3).Key technical contribution
The single biggest insight was that
DeepSeek-R1-0528-MXFP4-MTP-MoEFP4is faster thanthe standard
DeepSeek-R1-0528-MXFP4at conc=4 despite identical architecture. TheMoE-FP4 quantization gives lower Mean TPOT (5.64 ms vs 5.95 ms), translating to
+15 tok/s/GPU at peak (757.12 vs 742). Counter-intuitive because the MoEFP4 variant
is smaller (350 GB vs 376 GB) and uses fewer shards (76 vs 82) — at conc=4 the
smaller active MoE footprint fits cache better, dominating any quant-error overhead.
At conc=32/128, the standard model is faster (we benched both at all three concs).
What Was Tried and Did Not Help (Negative Results)
Full catalogue in
TECHNICAL_APPROACH.md. Highlights of confirmed dead ends onrocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2:--enable-dp-attention,--data-parallel-size > 1,--enable_prefix_caching:engine-init bugs in ATOM v0.1.2.
--num-speculative-tokens ≥ 4in fp8: hard C++ assert inasm_mla.cu:281(
qo_len ≤ 4).--num-speculative-tokens ≥ 4in bf16: GSM8K collapses to ~0.05 (DSR1 MTP head nottrained for spec > 3).
divides by 8 so no rank gain.
v0.5.9-rocm700-mi35xwith EAGLE/NEXTN: 4 cascading bugs in MTP+TP=8+MXFP4load path (partial patches included in
prototypes/sglang_patches/).HIP_FORCE_DEV_KERNARG=1etc.) alone: −75 % at conc=4 on ourconfig (only works paired with very specific kernel sequencing — not general).
Bonus Prototypes (Not Used in Submitted Runs)
prototypes/triton_mla_fp8_multi.py— a Triton fp8 MLA decode kernel that bypasses theqo_len ≤ 4ASM cap. Functionally correct (GSM8K 0.9447 at qo_len=4) but ~8× slower thanthe hand-tuned ASM kernel; would need significant kernel-level optimization to be useful.
prototypes/sglang_patches/deepseek_weight_loader.py— partial SGLang MTP loader fixesdocumenting the 4-bug chain we hit (3 patched, 4th
gemm_a8w8_bpreshufflefp8-outputissue remains).
Files in This Submission
How to Reproduce