Skip to content

Support FP8 block-quant TP when intermediate size per rank is not divisible by block_n#39046

Closed
wzhao18 wants to merge 1 commit into
vllm-project:mainfrom
wzhao18:wzhao/fix-minimax-tp8
Closed

Support FP8 block-quant TP when intermediate size per rank is not divisible by block_n#39046
wzhao18 wants to merge 1 commit into
vllm-project:mainfrom
wzhao18:wzhao/fix-minimax-tp8

Conversation

@wzhao18
Copy link
Copy Markdown
Contributor

@wzhao18 wzhao18 commented Apr 5, 2026

Purpose

Currently Fp8 block-quant TP requires intermediate size per rank to be divisible by block_n.

if intermediate_size_per_partition % block_n != 0:
    raise ValueError(
        f"The output_size of gate's and up's weight = "
        f"{intermediate_size_per_partition} is not divisible by "
        f"weight quantization block_n = {block_n}."
    )

This PR enables this by padding intermediate size.

Test Plan

  • Unit tests: tests/kernels/moe/test_moe_weight_loading_padded.py with added block-quant weight loading tests
  • End-to-end Model test: Minimax M2.5 TP8

Test Result

vllm serve MiniMaxAI/MiniMax-M2.5  --trust-remote-code  --tensor-parallel-size 8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9242|±  |0.0073|
|     |       |strict-match    |     5|exact_match|↑  |0.9181|±  |0.0076|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for padded intermediate sizes in FP8 block-quantized MoE layers to ensure alignment for quantization scales. It introduces a new method for calculating checkpoint shard offsets and updates the weight loading process to handle these padded dimensions. Review feedback points out a significant correctness issue in the shard offset calculation for block quantization, noting that using unpadded sizes leads to misalignment between weights and their scales. Additionally, the reviewer suggested a more robust way to distinguish between weight and scale tensors during the loading process.

Comment thread vllm/model_executor/layers/quantization/fp8.py Outdated
Comment thread vllm/model_executor/layers/quantization/fp8.py Outdated
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 5, 2026

Hi @wzhao18, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@wzhao18 wzhao18 marked this pull request as draft April 5, 2026 22:25
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
@wzhao18 wzhao18 force-pushed the wzhao/fix-minimax-tp8 branch from 6a4def8 to 8a1f82b Compare April 5, 2026 23:41
@wzhao18
Copy link
Copy Markdown
Contributor Author

wzhao18 commented Apr 5, 2026

/gemini review

@wzhao18 wzhao18 marked this pull request as ready for review April 5, 2026 23:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for padding the intermediate size of MoE layers to ensure alignment with block-wise quantization requirements (block_n). It updates the FusedMoE weight loading logic to correctly narrow both hidden and intermediate dimensions when padding is applied, adds a rounding mechanism in the FP8 quantization configuration, and includes comprehensive unit tests to verify weight loading across different tensor parallelism ranks.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wzhao18.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 23, 2026
@wzhao18 wzhao18 closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant