Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,13 @@ SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0

export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MHA=0

export TORCH_BLAS_PREFER_HIPBLASLT=1
export NCCL_MIN_NCHANNELS="${NCCL_MIN_NCHANNELS:-112}"
export GPU_MAX_HW_QUEUES="${GPU_MAX_HW_QUEUES:-2}"
Comment on lines +37 to +42

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnQinAMD thank you for the PR, can u please update the image in amd-master.yaml to include these changes & do an upstream branch instead of forked branch so that we can kick off CI?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnQinAMD can u also update https://github.com/vllm-project/recipes/tree/main with these new env vars

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohnQinAMD thank you for the PR, can u please update the image in amd-master.yaml to include these changes & do an upstream branch instead of forked branch so that we can kick off CI?

@functionstackx, a upstream branch has been created and will close this pr and switched to #1808 to trigger CI test.

The image tag in this pr currently use the default image tag as amd-master.yaml in main branch

@JohnQinAMD can u also update https://github.com/vllm-project/recipes/tree/main with these new env vars

updating the vllm-recipes via vllm-project/recipes#556


if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
fi
Expand Down
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3863,3 +3863,11 @@
- "Align MiniMax-M3 B200 vLLM fixed-sequence serving with MiniMax-M2.5 FP8 B200 settings by setting VLLM_FLOAT32_MATMUL_PRECISION=high and restoring max cudagraph capture size 2048."
- "Add TP4+EP4 coverage for MiniMax-M3 B200: DP-attention rows for 1k1k/8k1k and the missing non-DP-attention row for 8k1k."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1779

- config-keys:
- minimaxm3-fp8-mi300x-vllm
description:
- "Enable AITER kernels for MiniMax-M3 MXFP8 on MI300X/gfx942 via the single master toggle VLLM_ROCM_USE_AITER=1: the stock image left it unset, so the hot decode GEMMs and fused MoE ran on the generic kernels. The per-component AITER flags (MoE, linear, RMSNorm, FP8 batched-GEMM) default to True and are gated behind the master flag, so they are left at their defaults. Keep attention on TRITON_ATTN (VLLM_ROCM_USE_AITER_MHA=0, which defaults to True) because the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention."
- "Add AMD-recommended, numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (raises RCCL channels above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (caps HIP streams below the default of 4)."
- "Measured uplift on 8xMI300X, 1k1k random sweep (total tok/s/gpu): conc256 782.7->856.1 (+9.4%), conc128 598.9->637.0 (+6.4%), conc64 365.1->392.0 (+7.4%), conc32 295.6->327.4 (+10.8%), conc16 203.1->216.5 (+6.6%), conc8 127.6->136.6 (+7.1%), conc4 80.1->84.6 (+5.6%); conc1-2 unchanged (latency-bound). GSM8K exact-match holds at ~0.95 (kernel-selection change only)."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING
Loading