Skip to content

[Klaud Cold] Add minimaxm3-fp4-mi355x-atom#1812

Closed
indianspeedster wants to merge 4 commits into
SemiAnalysisAI:mainfrom
indianspeedster:feat/minimaxm3-fp4-mi355x-atom
Closed

[Klaud Cold] Add minimaxm3-fp4-mi355x-atom#1812
indianspeedster wants to merge 4 commits into
SemiAnalysisAI:mainfrom
indianspeedster:feat/minimaxm3-fp4-mi355x-atom

Conversation

@indianspeedster

@indianspeedster indianspeedster commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the minimaxm3-fp4-mi355x-atom config — MiniMax-M3 MXFP4 (amd/MiniMax-M3-MXFP4) on MI355X, single-node atom engine — for the 1k/1k and 8k/1k fixed-seq-len cells, TP4.

Follows the ROCm/ATOM MiniMax-M3 recipe (FP4 on 4×MI355 section).

  • .github/configs/amd-master.yaml: new config entry + search space (TP4, conc 1→128, image rocm/atom-dev:M3).
  • benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_atom.sh: atom serve script — --block-size 128 (mandatory for MiniMax MSA), --gpu-memory-utilization 0.8, --trust-remote-code. KV cache left at the default dtype: this MXFP4 checkpoint ships no calibrated FP8 KV scales, so --kv_cache_dtype fp8 asserts (k_scale is None) in the MSA fused_qknorm kernel during init.
  • runners/launch_mi355x-amds.sh: route amd/MiniMax-M3* weights to the NFS cache (alongside the existing MiniMaxAI/MiniMax-M3* rule).
  • perf-changelog entry.

Validation

  • generate_sweep_configs.py test-config16 configs: minimaxm3_1k1k and minimaxm3_8k1k, each TP4 at conc {1,2,4,8,16,32,64,128}; max-model-len = 2304 (1k1k) / 9472 (8k1k); framework atom.
  • Smoke-tested on real MI355X hardware (TP4 / conc-1 / 1k1k): atom server came up across 4 ranks, served, and the benchmark wrote a well-formed result JSON.

🤖 Generated with Claude Code


Note

Low Risk
Benchmark and launch-routing changes only; no production serving or auth paths touched.

Overview
Adds day-zero fixed-seq-len benchmarking for MiniMax-M3 MXFP4 (amd/MiniMax-M3-MXFP4) on MI355X using the ATOM engine (minimaxm3-fp4-mi355x-atom).

The new amd-master.yaml entry uses image rocm/atom-dev:M3, TP4, concurrency 1→128, and 1k/1k and 8k/1k cells per the ROCm/ATOM recipe. A new serve script starts atom.entrypoints.openai_server with --block-size 128 (MSA requirement), 0.8 GPU memory utilization, and default KV cache dtype (no FP8 KV — the MXFP4 checkpoint has no calibrated scales). launch_mi355x-amds.sh now NFS-mounts weights for amd/MiniMax-M3* as well as MiniMaxAI/MiniMax-M3*. perf-changelog documents the new config key.

Reviewed by Cursor Bugbot for commit a68c303. Bugbot is set up for automated code reviews on this repo. Configure here.

Smoke-tested on MI355X (mia1-p01-g07): TP4 conc-1 1k1k served and benched
clean (mean TPOT 6.8ms). KV cache left at default dtype — amd/MiniMax-M3-MXFP4
has no calibrated FP8 KV scales, so --kv_cache_dtype fp8 asserts in the MSA
fused_qknorm kernel.
@indianspeedster indianspeedster changed the title [Klaud Cold] Add minimaxm3-fp4-mi355x-atom single-node atom benchmark [Klaud Cold] Add minimaxm3-fp4-mi355x-atom Jun 17, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9a2b0f4. Configure here.

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_atom.sh
@functionstackx

Copy link
Copy Markdown
Collaborator

@indianspeedster thanks for the contribution, can u or one of ur teammates add create an upstream branch & add full-sweep-enabled PR validation

@andyluo7

Copy link
Copy Markdown
Collaborator

@functionstackx done — mirrored this to an upstream branch so the GPU sweep can run (fork PRs can't access the self-hosted runners): #1813 (feat/minimaxm3-fp4-mi355x-atom on the upstream repo, same commits, credits @indianspeedster). Added the full-sweep-enabled label and kicked off full-sweep PR validation there — it's running now across the full matrix (1k1k + 8k1k, TP4, conc 1→128) on MI355X: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27718177974

(Also: the earlier Cursor Bugbot MAX_MODEL_LEN comment is already resolved by commit a68c303, which uses the matrix $MAX_MODEL_LEN.)

@functionstackx

Copy link
Copy Markdown
Collaborator

thanks @andyluo7

superceded by #1813

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants