minimaxm2.5-fp8-h200-vllm: switch 8k/1k attention backend to FLASH_ATTN by RohitNagraj · Pull Request #1668 · SemiAnalysisAI/InferenceX

RohitNagraj · 2026-06-04T23:35:24Z

Switch the attention backend for the 8k/1k cell of minimaxm2.5-fp8-h200-vllm from FLASHINFER to FLASH_ATTN. ISL-conditional: the 1k/1k cell is unchanged (keeps FLASHINFER + --enable-flashinfer-autotune, byte-identical to prior behavior); only ISL=8192 triggers the swap.

Appends a perf-changelog entry.

Note

Low Risk
Benchmark-only serving flags for one model/recipe; no application auth or data-path changes.

Overview
The MiniMax-M2.5 FP8 H200 vLLM launch script now picks the attention stack from ISL: ISL=8192 (8k/1k) uses FLASH_ATTN and drops --enable-flashinfer-autotune; other lengths keep FLASHINFER plus autotune. 1k/1k is unchanged.

A perf-changelog.yaml entry documents the minimaxm2.5-fp8-h200-vllm 8k/1k backend change.

^{Reviewed by Cursor Bugbot for commit e5edb8c. Bugbot is set up for automated code reviews on this repo. Configure here.}

…SH_ATTN Switch the attention backend for the 8k/1k cell of minimaxm2.5-fp8-h200-vllm from FLASHINFER to FLASH_ATTN. ISL-conditional: the 1k/1k cell is unchanged (keeps FLASHINFER + --enable-flashinfer-autotune, byte-identical to prior behavior); only ISL=8192 triggers the swap. Appends a perf-changelog entry.

github-actions · 2026-06-04T23:35:32Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-06-04T23:38:54Z

+- config-keys:
+    - minimaxm2.5-fp8-h200-vllm
+  description:
+    - "Switch attention backend from FLASHINFER to FLASH_ATTN for the 8k/1k cell of MiniMax-M2.5 FP8 H200 vLLM."
+    - "1k/1k cell not changed in this PR: at 1k/1k all three measured configs."
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1667


🔴 The new perf-changelog entry at lines 3478-3483 has two documentation defects: (1) pr-link is set to /pull/1667 but this is PR #1668 — every other entry in the file links to its own introducing PR (see lines 3460/3467/3476), and the latest commit 424fe77 was explicitly intended to set this link correctly; (2) the second description bullet "1k/1k cell not changed in this PR: at 1k/1k all three measured configs." is grammatically incomplete — the clause after the colon has no verb. Fix the PR link to /pull/1668 and rewrite the bullet to match the PR description, e.g. "At 1k/1k, keep FLASHINFER + --enable-flashinfer-autotune unchanged (byte-identical to prior behavior)."

Extended reasoning...

Two defects in the new perf-changelog entry

The block added at perf-changelog.yaml lines 3478-3483 has two separate documentation issues that should both be fixed in one pass since they share the same entry.

Defect 1 — wrong pr-link (#1667 instead of #1668). This PR is #1668 per the PR metadata, but the new entry sets:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1667

The convention in perf-changelog.yaml is that each entry's pr-link points to the PR that introduced the entry itself. Walking the three immediately preceding entries:

Line 3460 → /pull/1648 (introduced by [NV] Add MiniMax-M2.5 FP8 GB200 Dynamo vLLM recipes #1648)

Line 3467 → /pull/1663 (introduced by [NV] Add MiniMax-M2.5 FP8 B300 Dynamo vLLM recipes #1663)

Line 3476 → /pull/1544 (introduced by [NV] Update H100 Qwen3.5 SGLang agg config #1544)

The most recent commit on this branch — 424fe77 chore: set perf-changelog pr-link for minimaxm2.5-fp8-h200-vllm FA3 swap — explicitly states the intent was to set this link, so 1667 is an off-by-one typo and the intended value is 1668.

Defect 2 — second description bullet is truncated. Line 3482 reads:

- "1k/1k cell not changed in this PR: at 1k/1k all three measured configs."

The clause after the colon — "at 1k/1k all three measured configs." — has no verb and trails off without conveying anything. The PR description states the actual intent clearly: "the 1k/1k cell is unchanged (keeps FLASHINFER + --enable-flashinfer-autotune, byte-identical to prior behavior); only ISL=8192 triggers the swap". The YAML bullet appears to be a mid-edit truncation that lost the explanation.

Step-by-step proof.

PR metadata block above shows <pr number="1668"> — this PR is minimaxm2.5-fp8-h200-vllm: switch 8k/1k attention backend to FLASH_ATTN #1668.

The diff at perf-changelog.yaml line 3483 (post-context) shows pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1667.

1668 ≠ 1667, so the link points to a different (likely unrelated) PR. A reader following the link would land on the wrong page.

The three prior entries in the diff context (lines 3460/3467/3476) all use their own PR number as pr-link, establishing the convention.

For the bullet, parse "1k/1k cell not changed in this PR: at 1k/1k all three measured configs." as English: the subject "all three measured configs" is followed by a period with no verb. The sentence is incomplete by any reasonable reading.

Impact. This is documentation-only — no runtime, benchmark, or correctness effect. But perf-changelog.yaml is human-facing changelog metadata; the wrong link points readers at an unrelated PR, and the broken bullet conveys no information. Both are easy fixes.

Suggested fix. Apply both edits to the new block:

- config-keys: - minimaxm2.5-fp8-h200-vllm description: - "Switch attention backend from FLASHINFER to FLASH_ATTN for the 8k/1k cell of MiniMax-M2.5 FP8 H200 vLLM." - "At 1k/1k, keep FLASHINFER + --enable-flashinfer-autotune unchanged (byte-identical to prior behavior)." pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1668

github-actions · 2026-06-05T00:14:46Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26986223642
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26986223642

github-actions · 2026-06-05T00:26:06Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26986223642
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26986223642

Oseltamivir

lgtm

Oseltamivir · 2026-06-05T02:26:05Z

/reuse-sweep-run

RohitNagraj added 2 commits June 4, 2026 15:02

chore: set perf-changelog pr-link for minimaxm2.5-fp8-h200-vllm FA3 swap

424fe77

RohitNagraj requested a review from a team June 4, 2026 23:35

github-project-automation Bot added this to InferenceMAX Board Jun 4, 2026

chore: point perf-changelog pr-link to PR #1668

e5edb8c

RohitNagraj mentioned this pull request Jun 4, 2026

minimaxm2.5-fp8-h200-vllm: switch 8k/1k attention backend to FLASH_ATTN #1667

Closed

claude Bot reviewed Jun 4, 2026

View reviewed changes

Ankur-singh added the full-sweep-enabled label Jun 4, 2026

Oseltamivir approved these changes Jun 5, 2026

View reviewed changes

faradawn mentioned this pull request Jun 5, 2026

add attention backend recommendation for Minimax 2.5 vllm-project/recipes#512

Merged

RohitNagraj merged commit c138338 into main Jun 5, 2026
98 of 99 checks passed

RohitNagraj deleted the minimax-m25-h200-vllm-8k1k-fa3 branch June 5, 2026 04:43

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

minimaxm2.5-fp8-h200-vllm: switch 8k/1k attention backend to FLASH_ATTN#1668

minimaxm2.5-fp8-h200-vllm: switch 8k/1k attention backend to FLASH_ATTN#1668
RohitNagraj merged 3 commits into
mainfrom
minimax-m25-h200-vllm-8k1k-fa3

RohitNagraj commented Jun 4, 2026 •

edited by cursor Bot

Loading

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

claude Bot Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Oseltamivir left a comment

Uh oh!

Oseltamivir commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RohitNagraj commented Jun 4, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

claude Bot Jun 4, 2026

Choose a reason for hiding this comment

Two defects in the new perf-changelog entry

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Oseltamivir commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RohitNagraj commented Jun 4, 2026 •

edited by cursor Bot

Loading