Skip to content

add attention backend recommendation for Minimax 2.5#512

Merged
esmeetu merged 1 commit into
vllm-project:mainfrom
faradawn:switch-flashattn
Jun 5, 2026
Merged

add attention backend recommendation for Minimax 2.5#512
esmeetu merged 1 commit into
vllm-project:mainfrom
faradawn:switch-flashattn

Conversation

@faradawn
Copy link
Copy Markdown
Collaborator

@faradawn faradawn commented Jun 5, 2026

For longer sequence lengths, use flash attention backend for best performance for minimax on H200 FP8.

Reference: SemiAnalysisAI/InferenceX#1668

Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Jun 5, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
vllm-recipes Ready Ready Preview, Comment Jun 5, 2026 4:18am

Request Review

@faradawn faradawn requested a review from esmeetu June 5, 2026 04:17
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the deployment guide for MiniMax-M2.5 to recommend specific attention backends (FLASHINFER or FLASH_ATTN) based on sequence length for H200 FP8. The reviewer suggested phrasing improvements to make the instructions clearer and more direct, along with minor punctuation corrections.

Comment on lines +190 to +193
For H200 FP8, choose the attention backend by sequence length for best performance:
shorter sequences (e.g. 1024) should keep the command above with `--attention-backend FLASHINFER`
and `--enable-flashinfer-autotune`, while longer input sequences (e.g. 8192)
can prefer FlashAttention by replacing those flags with `--attention-backend FLASH_ATTN`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrase 'can prefer FlashAttention' is slightly awkward. It is clearer and more direct to state that longer sequences 'should use' FlashAttention for optimal performance. Also, adding commas after 'e.g.' is standard style.

  For H200 FP8, choose the attention backend by sequence length for best performance:
  shorter sequences (e.g., 1024) should keep the command above with `--attention-backend FLASHINFER`
  and `--enable-flashinfer-autotune`, while longer input sequences (e.g., 8192)
  should use FlashAttention by replacing those flags with `--attention-backend FLASH_ATTN`.

@esmeetu esmeetu merged commit 66bc7f2 into vllm-project:main Jun 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants