Commit f238d93
authored
vLLM fakequant fold weight_quantizer for megatron export (#1246)
### What does this PR do?
Type of change: Bug fix
During Megatron→vLLM fakequant export
(`export_mcore_gpt_to_hf_vllm_fq`), the `weight_quantizer` is now
applied as fake-quantization (quantize + dequantize) directly into the
exported weight tensor, and its amax is no longer saved to
`quantizer_state.pth`. On reload, if `weight_quantizer` keys are absent
from the checkpoint (because they were folded at export time), the
corresponding quantizer modules are disabled.
This change is useful especially when amax across experts are not synced
for `weight_quantizer`, this allows the `weight_quantizer` to keep them
different for better accuracy.
### Usage
```python
# Unchanged — export API is the same
export_mcore_gpt_to_hf_vllm_fq(model, pretrained_model_name_or_path=..., export_dir=...)
```
### Testing
Step 1 — Quantize (run from Megatron-LM
`examples/post_training/modelopt`):
```bash
HF_MODEL_CKPT=<path/to/hf/weights> MLM_MODEL_SAVE=<quant-ckpt-name> \
bash quantize.sh <hf-model-id> NVFP4_DEFAULT_CFG
```
Step 2 — Export for vLLM fakequant:
```bash
MLM_EXTRA_ARGS=--export-vllm-fq \
HF_MODEL_CKPT=<path/to/hf/weights> \
MLM_MODEL_CKPT=<quant-ckpt-name> \
EXPORT_DIR=<export-dir> \
bash export.sh <hf-model-id>
```
Step 3 — Serve (run from examples/vllm_serve):
```bash
QUANT_CFG=NVFP4_DEFAULT_CFG \
QUANT_FILE_PATH=<export-dir>/quantizer_state.pth \
python3 vllm_serve_fakequant.py <export-dir> \
-tp 1 --served-model-name <model-name> \
--host 0.0.0.0 --port 8000 \
--trust-remote-code --enforce-eager \
--disable-custom-all-reduce \
--gpu-memory-utilization 0.8
```
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit
* **Bug Fixes**
* Better handling when loading checkpoints: missing weight-quantizer entries are validated and corresponding modules are disabled to avoid load failures.
* **Improvements**
* Export now folds enabled weight quantizers into exported weights when present and omits internal weight-quantizer tensors from the exported state to produce cleaner exports.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>1 parent 9f8188d commit f238d93
File tree
3 files changed
+106
-17
lines changed- examples/vllm_serve
- modelopt/torch/export/plugins
3 files changed
+106
-17
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
52 | 53 | | |
53 | 54 | | |
54 | 55 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
| 34 | + | |
34 | 35 | | |
35 | 36 | | |
36 | 37 | | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
37 | 60 | | |
38 | 61 | | |
39 | 62 | | |
| |||
285 | 308 | | |
286 | 309 | | |
287 | 310 | | |
288 | | - | |
| 311 | + | |
289 | 312 | | |
290 | 313 | | |
291 | 314 | | |
| |||
435 | 458 | | |
436 | 459 | | |
437 | 460 | | |
438 | | - | |
439 | | - | |
440 | | - | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
441 | 469 | | |
442 | | - | |
443 | | - | |
444 | | - | |
445 | | - | |
446 | | - | |
447 | | - | |
448 | | - | |
449 | | - | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
| 482 | + | |
| 483 | + | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
450 | 493 | | |
451 | | - | |
452 | | - | |
453 | | - | |
| 494 | + | |
| 495 | + | |
454 | 496 | | |
455 | 497 | | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
456 | 506 | | |
457 | 507 | | |
458 | 508 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
120 | 124 | | |
121 | 125 | | |
122 | 126 | | |
| |||
133 | 137 | | |
134 | 138 | | |
135 | 139 | | |
136 | | - | |
137 | | - | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
138 | 171 | | |
139 | 172 | | |
140 | 173 | | |
141 | 174 | | |
142 | 175 | | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
143 | 179 | | |
| 180 | + | |
| 181 | + | |
144 | 182 | | |
145 | 183 | | |
146 | 184 | | |
| |||
0 commit comments