Support FP8 block-quant TP when intermediate size per rank is not divisible by block_n#39046
Support FP8 block-quant TP when intermediate size per rank is not divisible by block_n#39046wzhao18 wants to merge 1 commit into
Conversation
e99e7a4 to
0ab0142
Compare
There was a problem hiding this comment.
Code Review
This pull request adds support for padded intermediate sizes in FP8 block-quantized MoE layers to ensure alignment for quantization scales. It introduces a new method for calculating checkpoint shard offsets and updates the weight loading process to handle these padded dimensions. Review feedback points out a significant correctness issue in the shard offset calculation for block quantization, noting that using unpadded sizes leads to misalignment between weights and their scales. Additionally, the reviewer suggested a more robust way to distinguish between weight and scale tensors during the loading process.
|
Hi @wzhao18, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: wzhao18 <wzhao18.sz@gmail.com>
6a4def8 to
8a1f82b
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for padding the intermediate size of MoE layers to ensure alignment with block-wise quantization requirements (block_n). It updates the FusedMoE weight loading logic to correctly narrow both hidden and intermediate dimensions when padding is applied, adds a rounding mechanism in the FP8 quantization configuration, and includes comprehensive unit tests to verify weight loading across different tensor parallelism ranks.
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Currently Fp8 block-quant TP requires intermediate size per rank to be divisible by block_n.
This PR enables this by padding intermediate size.
Test Plan
tests/kernels/moe/test_moe_weight_loading_padded.pywith added block-quant weight loading testsTest Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.