Support FP8 block-quant TP when intermediate size per rank is not divisible by block_n by wzhao18 · Pull Request #39046 · vllm-project/vllm

wzhao18 · 2026-04-05T21:16:04Z

Purpose

Currently Fp8 block-quant TP requires intermediate size per rank to be divisible by block_n.

if intermediate_size_per_partition % block_n != 0:
    raise ValueError(
        f"The output_size of gate's and up's weight = "
        f"{intermediate_size_per_partition} is not divisible by "
        f"weight quantization block_n = {block_n}."
    )

This PR enables this by padding intermediate size.

Test Plan

Unit tests: tests/kernels/moe/test_moe_weight_loading_padded.py with added block-quant weight loading tests
End-to-end Model test: Minimax M2.5 TP8

Test Result

vllm serve MiniMaxAI/MiniMax-M2.5  --trust-remote-code  --tensor-parallel-size 8

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9242|±  |0.0073|
|     |       |strict-match    |     5|exact_match|↑  |0.9181|±  |0.0076|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request adds support for padded intermediate sizes in FP8 block-quantized MoE layers to ensure alignment for quantization scales. It introduces a new method for calculating checkpoint shard offsets and updates the weight loading process to handle these padded dimensions. Review feedback points out a significant correctness issue in the shard offset calculation for block quantization, noting that using unpadded sizes leads to misalignment between weights and their scales. Additionally, the reviewer suggested a more robust way to distinguish between weight and scale tensors during the loading process.

mergify · 2026-04-05T21:23:19Z

Hi @wzhao18, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 · 2026-04-05T23:45:56Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces support for padding the intermediate size of MoE layers to ensure alignment with block-wise quantization requirements (block_n). It updates the FusedMoE weight loading logic to correctly narrow both hidden and intermediate dimensions when padding is applied, adds a rounding mechanism in the FP8 quantization configuration, and includes comprehensive unit tests to verify weight loading across different tensor parallelism ranks.

mergify · 2026-05-23T07:59:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wzhao18.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

wzhao18 requested review from WoosukKwon, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners April 5, 2026 21:16

wzhao18 force-pushed the wzhao/fix-minimax-tp8 branch from e99e7a4 to 0ab0142 Compare April 5, 2026 21:16

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/fp8.py Outdated

Comment thread vllm/model_executor/layers/quantization/fp8.py Outdated

wzhao18 marked this pull request as draft April 5, 2026 22:25

Fix minimax tp8

8a1f82b

Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>

wzhao18 force-pushed the wzhao/fix-minimax-tp8 branch from 6a4def8 to 8a1f82b Compare April 5, 2026 23:41

wzhao18 marked this pull request as ready for review April 5, 2026 23:49

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

cferra mentioned this pull request Apr 12, 2026

[Bug]: Gemma 4 31B FP8_BLOCK checkpoint produces garbage repetitive output — logit saturation at softcap wall due to absorbed activation scales being double-applied #39407

Open

mergify Bot added the needs-rebase label May 23, 2026

wzhao18 closed this May 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support FP8 block-quant TP when intermediate size per rank is not divisible by block_n#39046

Support FP8 block-quant TP when intermediate size per rank is not divisible by block_n#39046
wzhao18 wants to merge 1 commit into
vllm-project:mainfrom
wzhao18:wzhao/fix-minimax-tp8

wzhao18 commented Apr 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 5, 2026

Uh oh!

wzhao18 commented Apr 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wzhao18 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Apr 5, 2026

Uh oh!

wzhao18 commented Apr 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wzhao18 commented Apr 5, 2026 •

edited

Loading