Skip to content

Add DSR1-MXFP4 recipe for MI355X (Team Jons contest submission, 2840/3000)#786

Open
j0ons wants to merge 2 commits into
ROCm:mainfrom
j0ons:main
Open

Add DSR1-MXFP4 recipe for MI355X (Team Jons contest submission, 2840/3000)#786
j0ons wants to merge 2 commits into
ROCm:mainfrom
j0ons:main

Conversation

@j0ons
Copy link
Copy Markdown

@j0ons j0ons commented May 14, 2026

PR: Team Jons — DSR1-MXFP4 launchers for MI355X (2840/3000)

Summary

This PR adds Team Jons's production-tested launcher scripts for DeepSeek-R1-0528-MXFP4
inference on AMD MI355X under the dsr1-fp4-atom-mtp-mi355x track. Two scripts cover all
three concurrencies and produced our locked leaderboard scores totaling 2840 / 3000.

Conc tput_per_gpu Score Event ID
4 757.12 840 / 1000 d2eb2378c2d540248005d9e1882a11b1
32 2351.06 1000 / 1000 474be027ba7c4ec992371ff5f50508f2
128 3537.19 1000 / 1000 (May 8 submission)
Total 2840 / 3000

All three runs passed the harness GSM8K accuracy check (≥ 0.93). Measured GSM8K across the
locked runs: 0.9348–0.9500.

What's New

  1. launch_atom_c4_level3_mtp_moefp4.sh — winning conc=4 launcher (757.12). Uses the
    DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 weights with --level 3 and tight cudagraph capture
    [1,2,4,8].
  2. launch_atom_tp8_spec3_bigbatch.sh — winning conc=32 and conc=128 launcher
    (2351.06 / 3537.19). Aggressive prefill batching via --max-num-batched-tokens 131072
    and --max-num-seqs 256.
  3. submit_c4_moefp4.sh — wrapper that boots the conc=4 server, waits healthy, then runs
    ./dsr1_benchmark submit Jons.
  4. launch_atom_c4_level3.sh and run_dsr1_c4only_moefp4.sh — earlier reference launchers
    kept for reproducibility against intermediate submissions.

Shared baseline across all three:

  • TP=8, --kv_cache_dtype fp8, --max-model-len 10240
  • --method mtp --num-speculative-tokens 3 (max supported by AITER's fp8 MLA kernel —
    qo_len ≤ 4 enforced in asm_mla.cu:281; DSR1's MTP head is only trained for spec=3).

Key technical contribution

The single biggest insight was that DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 is faster than
the standard DeepSeek-R1-0528-MXFP4 at conc=4
despite identical architecture. The
MoE-FP4 quantization gives lower Mean TPOT (5.64 ms vs 5.95 ms), translating to
+15 tok/s/GPU at peak (757.12 vs 742). Counter-intuitive because the MoEFP4 variant
is smaller (350 GB vs 376 GB) and uses fewer shards (76 vs 82) — at conc=4 the
smaller active MoE footprint fits cache better, dominating any quant-error overhead.
At conc=32/128, the standard model is faster (we benched both at all three concs).

What Was Tried and Did Not Help (Negative Results)

Full catalogue in TECHNICAL_APPROACH.md. Highlights of confirmed dead ends on
rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2:

  • --enable-dp-attention, --data-parallel-size > 1, --enable_prefix_caching:
    engine-init bugs in ATOM v0.1.2.
  • --num-speculative-tokens ≥ 4 in fp8: hard C++ assert in asm_mla.cu:281
    (qo_len ≤ 4).
  • --num-speculative-tokens ≥ 4 in bf16: GSM8K collapses to ~0.05 (DSR1 MTP head not
    trained for spec > 3).
  • TP=4: MoE memory-access fault on batch sizes > captured cudagraph sizes; also bench
    divides by 8 so no rank gain.
  • SGLang v0.5.9-rocm700-mi35x with EAGLE/NEXTN: 4 cascading bugs in MTP+TP=8+MXFP4
    load path (partial patches included in prototypes/sglang_patches/).
  • The "AMD env stack" (HIP_FORCE_DEV_KERNARG=1 etc.) alone: −75 % at conc=4 on our
    config (only works paired with very specific kernel sequencing — not general).

Bonus Prototypes (Not Used in Submitted Runs)

prototypes/triton_mla_fp8_multi.py — a Triton fp8 MLA decode kernel that bypasses the
qo_len ≤ 4 ASM cap. Functionally correct (GSM8K 0.9447 at qo_len=4) but ~8× slower than
the hand-tuned ASM kernel; would need significant kernel-level optimization to be useful.

prototypes/sglang_patches/deepseek_weight_loader.py — partial SGLang MTP loader fixes
documenting the 4-bug chain we hit (3 patched, 4th gemm_a8w8_bpreshuffle fp8-output
issue remains).

Files in This Submission

team-jons/
├── README.md                       — overview + leaderboard standings
├── TECHNICAL_APPROACH.md           — what i changed + dead-knob catalogue
├── PERFORMANCE_METRICS.md          — full bench numbers
├── launchers/
│   ├── launch_atom_c4_level3.sh                — earlier c4 reference
│   ├── launch_atom_c4_level3_mtp_moefp4.sh     — ⭐ winning conc=4 (757.12)
│   ├── launch_atom_tp8_spec3_bigbatch.sh       — winning conc=32 AND conc=128
│   ├── run_dsr1_c4only_moefp4.sh               — c4 driver
│   └── submit_c4_moefp4.sh                     — c4 submit wrapper
├── results/
│   ├── peak_c4_757_moefp4.json                          — 757.12 c4 result
│   ├── submit_c32_bb_level3_20260513T092735Z.json.json  — 2351.06 c32 result
│   ├── submit_bigbatch_c128_20260508T172456Z.json.json  — 3537.19 c128 result
│   └── submit_tp8_fp8_level3_c4_20260512T173100Z.json.json  — earlier c4 baseline
└── prototypes/
    ├── TRITON_FP8_MLA_HANDOFF.md
    ├── triton_mla_fp8_multi.py
    └── sglang_patches/deepseek_weight_loader.py

How to Reproduce

# conc=128 (Score 1000/1000) — also used for conc=32
bash dsr1-fp4-atom-mtp-mi355x/team-jons/launchers/launch_atom_tp8_spec3_bigbatch.sh &
# wait for "Uvicorn running on http://0.0.0.0:8888"
CONC=128 ./dsr1_benchmark submit <team>
CONC=32  ./dsr1_benchmark submit <team>

# conc=4 (Score 840/1000) — uses the MoEFP4 model variant
bash dsr1-fp4-atom-mtp-mi355x/team-jons/launchers/launch_atom_c4_level3_mtp_moefp4.sh &
CONC=4   ./dsr1_benchmark submit <team>

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a80a36bf7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

sleep 15
[ $((SECONDS % 60)) -lt 15 ] && log "[${SECONDS}s] waiting"
done
log "=== healthy"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fail when the server never becomes healthy

If the server hangs or takes longer than 15 minutes without emitting one of the grepped fatal strings, this loop exits by timeout and immediately logs === healthy, then runs dsr1_benchmark submit against an unavailable endpoint. The perf driver in this same commit tracks a HEALTHY flag and exits on timeout, so the submit wrapper should do the same to avoid recording failed submissions as if the server came up.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant