Commit f26b9c3
skip generate option for large models and mxfp8 (#942)
## What does this PR do?
**Type of change:** New feature
**Overview:** Adds a `--skip_generate` flag to `hf_ptq.py` that skips
the pre/post-quantization generation preview calls. These calls run
`model.generate()` which crashes for very large models (500B+) that are
split across GPU and CPU via `device_map="auto"` (e.g., models with
Mamba/Triton kernels that cannot handle CPU-offloaded tensors).
## Usage
```
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path /path/to/model \
--export_path /path/to/output \
--qformat mxfp8 \
--trust_remote_code \
--export_fmt hf \
--batch_size 1 \
--skip_generate \
--kv_cache_qformat none
```
## Testing
Tested with a 500B parameter NemotronH hybrid Mamba/attention model on
4x GB200 GPUs. Without --skip_generate, the script crashes at
model.generate() due to Mamba Triton kernels failing on CPU-offloaded
tensors. With --skip_generate, the generation preview is skipped and
quantization proceeds normally.
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes/No <!--- If No, explain
why. -->
- **Did you write any new necessary tests?**: Yes/No
- **Did you add or update any necessary documentation?**: Yes/No
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes/No <!--- Only for new features, API changes, critical bug fixes or
bw breaking changes. -->
## Additional Information
The --skip_generate flag sets generated_ids_before_ptq = None early,
which also causes the post-quantization generate to be skipped via the
existing if generated_ids_before_ptq is None: pass guard. Combined with
--batch_size 1 (to skip the get_max_batch_size forward-pass probe), this
eliminates all forward passes that can crash for device-map-split
models.
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Introduced `--skip_generate` CLI option to skip pre-quantization text
and image generation, reducing processing time for very large models.
Useful when generation previews are computationally expensive.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: adithyare <adithyare@nvidia.com>
Signed-off-by: Adi Renduchintala <adithya.r@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>1 parent ba29ad7 commit f26b9c3
1 file changed
Lines changed: 13 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
690 | 690 | | |
691 | 691 | | |
692 | 692 | | |
693 | | - | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
694 | 696 | | |
695 | 697 | | |
696 | 698 | | |
| |||
703 | 705 | | |
704 | 706 | | |
705 | 707 | | |
706 | | - | |
707 | 708 | | |
708 | 709 | | |
709 | 710 | | |
| |||
1084 | 1085 | | |
1085 | 1086 | | |
1086 | 1087 | | |
| 1088 | + | |
| 1089 | + | |
| 1090 | + | |
| 1091 | + | |
| 1092 | + | |
| 1093 | + | |
| 1094 | + | |
| 1095 | + | |
| 1096 | + | |
| 1097 | + | |
1087 | 1098 | | |
1088 | 1099 | | |
1089 | 1100 | | |
| |||
0 commit comments