Commit 010b220
authored
vLLM fakequant export update for AWQ checkpoint (#1242)
### What does this PR do?
Type of change: Bug
Enables end-to-end AWQ checkpoint export and reload in the vLLM
fake-quant serving path (`MODELOPT_STATE_PATH`). Previously, the
`input_quantizer` was using incorrect `pre_quant_scale` especially with
grouped quantizers like `qkv_proj`, using simply the first
`input_quantizer.pre_quant_scale`. This MR adds
`_resmooth_experts_for_export` that non-mutatively averages
`pre_quant_scale` across MoE experts and unifies input `_amax`, required
because vLLM uses a single input quantizer per expert group. Adds
`merge_amax_tensors_for_group` (element-wise max for same-shape, `cat`
for GQA, scalar-max fallback) replacing the scalar-collapsing
`torch.stack().max()` that dropped per-channel `_amax` structure.
### Usage
```python
# Export AWQ checkpoint from HF model
from modelopt.torch.export.plugins.vllm_fakequant_hf import export_hf_vllm_fq_checkpoint
export_hf_vllm_fq_checkpoint(model, export_dir="./awq_vllm_checkpoint")
```
### Testing
**Step 1 — Export the quantized checkpoint:**
```bash
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path <MODEL_PATH> \
--recipe <AWQ_RECIPE> \
--calib_size 512 \
--export_path <EXPORT_DIR> \
--vllm_fakequant_export
```
This produces `<EXPORT_DIR>/vllm_fq_modelopt_state.pth` with the averaged per-expert
pre_quant_scale and unified _amax now included.
Step 2 — Serve via vLLM fakequant worker:
```bash
MODELOPT_STATE_PATH=<EXPORT_DIR>/vllm_fq_modelopt_state.pth \
python examples/vllm_serve/vllm_serve_fakequant.py \
<EXPORT_DIR> --tensor-parallel-size <TP>
```
Tested for quantization configurations:
```
FP8_DEFAULT_CFG
FP8_DEFAULT_CFG (input_q disabled)
INT8_SMOOTHQUANT_CFG
INT8_WEIGHT_ONLY_CFG
NVFP4_DEFAULT_CFG
NVFP4_AWQ_LITE_CFG
INT4_AWQ_CFG
NVFP4_AWQ_CFG
NVFP4_DEFAULT_CFG (input_q disabled)
```
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A
### Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit
* **New Features**
* Added Nemotron-style MoE export support and group-aware AWQ resmoothing with optional requantization during export.
* Improved handling for shared-input / expert groups and tensor-parallel sharding of pre-quantization scales.
* **Bug Fixes**
* Removed AWQ reload limitation from known issues; improved checkpoint validation and safer save/load behavior.
* Better detection and handling of enabled weight-quantizers and clearer warnings for mismatched checkpoint keys.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Kinjal Patel <kinjalpravin@nvidia.com>1 parent 26ae8da commit 010b220
File tree
8 files changed
+911
-161
lines changed- examples/vllm_serve
- modelopt/torch/export
- plugins
- tests
- gpu/torch/export
- unit/torch/export
8 files changed
+911
-161
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | | - | |
102 | | - | |
103 | | - | |
| 101 | + | |
| 102 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| |||
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
| 30 | + | |
29 | 31 | | |
30 | 32 | | |
31 | 33 | | |
| 34 | + | |
32 | 35 | | |
33 | 36 | | |
34 | 37 | | |
35 | 38 | | |
| 39 | + | |
36 | 40 | | |
37 | 41 | | |
38 | 42 | | |
| |||
61 | 65 | | |
62 | 66 | | |
63 | 67 | | |
64 | | - | |
65 | | - | |
66 | | - | |
67 | | - | |
68 | | - | |
| 68 | + | |
| 69 | + | |
69 | 70 | | |
70 | 71 | | |
71 | 72 | | |
72 | 73 | | |
73 | 74 | | |
74 | 75 | | |
75 | | - | |
76 | 76 | | |
77 | | - | |
78 | 77 | | |
79 | 78 | | |
80 | 79 | | |
81 | | - | |
82 | 80 | | |
83 | 81 | | |
84 | | - | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
85 | 107 | | |
| 108 | + | |
| 109 | + | |
86 | 110 | | |
87 | 111 | | |
88 | 112 | | |
| |||
101 | 125 | | |
102 | 126 | | |
103 | 127 | | |
104 | | - | |
105 | 128 | | |
106 | 129 | | |
107 | 130 | | |
108 | 131 | | |
109 | 132 | | |
110 | 133 | | |
111 | | - | |
112 | | - | |
| 134 | + | |
113 | 135 | | |
114 | 136 | | |
115 | 137 | | |
| |||
122 | 144 | | |
123 | 145 | | |
124 | 146 | | |
125 | | - | |
126 | | - | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
127 | 152 | | |
128 | 153 | | |
129 | 154 | | |
| |||
0 commit comments