minimaxm2.5-fp8-h200-vllm: switch 8k/1k attention backend to FLASH_ATTN#1668
Conversation
…SH_ATTN Switch the attention backend for the 8k/1k cell of minimaxm2.5-fp8-h200-vllm from FLASHINFER to FLASH_ATTN. ISL-conditional: the 1k/1k cell is unchanged (keeps FLASHINFER + --enable-flashinfer-autotune, byte-identical to prior behavior); only ISL=8192 triggers the swap. Appends a perf-changelog entry.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - config-keys: | ||
| - minimaxm2.5-fp8-h200-vllm | ||
| description: | ||
| - "Switch attention backend from FLASHINFER to FLASH_ATTN for the 8k/1k cell of MiniMax-M2.5 FP8 H200 vLLM." | ||
| - "1k/1k cell not changed in this PR: at 1k/1k all three measured configs." | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1667 |
There was a problem hiding this comment.
🔴 The new perf-changelog entry at lines 3478-3483 has two documentation defects: (1) pr-link is set to /pull/1667 but this is PR #1668 — every other entry in the file links to its own introducing PR (see lines 3460/3467/3476), and the latest commit 424fe77 was explicitly intended to set this link correctly; (2) the second description bullet "1k/1k cell not changed in this PR: at 1k/1k all three measured configs." is grammatically incomplete — the clause after the colon has no verb. Fix the PR link to /pull/1668 and rewrite the bullet to match the PR description, e.g. "At 1k/1k, keep FLASHINFER + --enable-flashinfer-autotune unchanged (byte-identical to prior behavior)."
Extended reasoning...
Two defects in the new perf-changelog entry
The block added at perf-changelog.yaml lines 3478-3483 has two separate documentation issues that should both be fixed in one pass since they share the same entry.
Defect 1 — wrong pr-link (#1667 instead of #1668). This PR is #1668 per the PR metadata, but the new entry sets:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1667The convention in perf-changelog.yaml is that each entry's pr-link points to the PR that introduced the entry itself. Walking the three immediately preceding entries:
- Line 3460 →
/pull/1648(introduced by [NV] Add MiniMax-M2.5 FP8 GB200 Dynamo vLLM recipes #1648) - Line 3467 →
/pull/1663(introduced by [NV] Add MiniMax-M2.5 FP8 B300 Dynamo vLLM recipes #1663) - Line 3476 →
/pull/1544(introduced by [NV] Update H100 Qwen3.5 SGLang agg config #1544)
The most recent commit on this branch — 424fe77 chore: set perf-changelog pr-link for minimaxm2.5-fp8-h200-vllm FA3 swap — explicitly states the intent was to set this link, so 1667 is an off-by-one typo and the intended value is 1668.
Defect 2 — second description bullet is truncated. Line 3482 reads:
- "1k/1k cell not changed in this PR: at 1k/1k all three measured configs."The clause after the colon — "at 1k/1k all three measured configs." — has no verb and trails off without conveying anything. The PR description states the actual intent clearly: "the 1k/1k cell is unchanged (keeps FLASHINFER + --enable-flashinfer-autotune, byte-identical to prior behavior); only ISL=8192 triggers the swap". The YAML bullet appears to be a mid-edit truncation that lost the explanation.
Step-by-step proof.
- PR metadata block above shows
<pr number="1668">— this PR is minimaxm2.5-fp8-h200-vllm: switch 8k/1k attention backend to FLASH_ATTN #1668. - The diff at
perf-changelog.yamlline 3483 (post-context) showspr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1667. - 1668 ≠ 1667, so the link points to a different (likely unrelated) PR. A reader following the link would land on the wrong page.
- The three prior entries in the diff context (lines 3460/3467/3476) all use their own PR number as
pr-link, establishing the convention. - For the bullet, parse
"1k/1k cell not changed in this PR: at 1k/1k all three measured configs."as English: the subject"all three measured configs"is followed by a period with no verb. The sentence is incomplete by any reasonable reading.
Impact. This is documentation-only — no runtime, benchmark, or correctness effect. But perf-changelog.yaml is human-facing changelog metadata; the wrong link points readers at an unrelated PR, and the broken bullet conveys no information. Both are easy fixes.
Suggested fix. Apply both edits to the new block:
- config-keys:
- minimaxm2.5-fp8-h200-vllm
description:
- "Switch attention backend from FLASHINFER to FLASH_ATTN for the 8k/1k cell of MiniMax-M2.5 FP8 H200 vLLM."
- "At 1k/1k, keep FLASHINFER + --enable-flashinfer-autotune unchanged (byte-identical to prior behavior)."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1668|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26986223642 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26986223642 |
|
/reuse-sweep-run |
Switch the attention backend for the 8k/1k cell of minimaxm2.5-fp8-h200-vllm from FLASHINFER to FLASH_ATTN. ISL-conditional: the 1k/1k cell is unchanged (keeps FLASHINFER + --enable-flashinfer-autotune, byte-identical to prior behavior); only ISL=8192 triggers the swap.
Appends a perf-changelog entry.
Note
Low Risk
Benchmark-only serving flags for one model/recipe; no application auth or data-path changes.
Overview
The MiniMax-M2.5 FP8 H200 vLLM launch script now picks the attention stack from
ISL:ISL=8192(8k/1k) usesFLASH_ATTNand drops--enable-flashinfer-autotune; other lengths keepFLASHINFERplus autotune.1k/1kis unchanged.A
perf-changelog.yamlentry documents theminimaxm2.5-fp8-h200-vllm8k/1k backend change.Reviewed by Cursor Bugbot for commit e5edb8c. Bugbot is set up for automated code reviews on this repo. Configure here.