You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[3.1/4] Diffusion Quantized ckpt export - WAN 2.2 14B (#855)
## What does this PR do?
**Type of change:** documentation <!-- Use one of the following: Bug
fix, new feature, new example, new tests, documentation. -->
**Overview:**
1. Added multi‑backbone support for quantization: --backbone now accepts
space- or comma-separated lists and resolves to a list of backbone
modules.
2. Introduced PipelineManager.iter_backbones() to iterate named backbone
modules and updated get_backbone() to return a single module or a
ModuleList for multi‑backbone.
3. Updated ExportManager to save/restore per‑backbone checkpoints when a
directory is provided, with {backbone_name}.pt files, and to create
target directories when missing.
4. Simplified save_checkpoint() calls to rely on the registered
pipeline_manager by default.
**Usage: **
```bash
python quantize.py --model wan2.2-t2v-14b --format fp4 --batch-size 1 --calib-size 32 \
--n-steps 30 --backbone transformer transformer_2 --model-dtype BFloat16 \
--quantized-torch-ckpt-save-path ./wan22_mo_ckpts \
--hf-ckpt-dir ./wan2.2-t2v-14b
```
Plans
- [x] [1/4] Add the basic functionalities to support limited image
models with NVFP4 + FP8, with some refactoring on the previous LLM code
and the diffusers example. PIC: @jingyu-ml
- [x] [2/4] Add support to more video gen models. PIC: @jingyu-ml
- [x] [3/4] Add test cases, refactor on the doc, and all related README.
PIC: @jingyu-ml
- [ ] [4/4] Add the final support to ComfyUI. PIC @jingyu-ml
## Testing
<!-- Mention how have you tested your change if applicable. -->
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: No <!--- If No, explain why.
-->
- **Did you write any new necessary tests?**:No
- **Did you add or update any necessary documentation?**: Yes
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
Yes <!--- Only for new features, API changes, critical bug fixes or bw
breaking changes. -->
## Additional Information
<!-- E.g. related issue. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Release Notes
* **New Features**
* Unified Hugging Face export support for diffusers pipelines and
components
* LTX-2 and Wan2.2 (T2V) support in diffusers quantization workflow
* Comprehensive ONNX export and TensorRT engine build documentation for
diffusion models
* **Documentation**
* Updated to clarify support for both transformers and diffusers models
in unified export API
* Expanded diffusers examples with LoRA fusion guidance and additional
model options (Flux, SD3, SDXL variants)
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
23
23
Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
24
24
25
-
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm).
25
+
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm). The unified Hugging Face export API now supports both transformers and diffusers models.
We support exporting modelopt-optimized Huggingface models and Megatron Core models to a unified checkpoint format that can be deployed in various inference frameworks such as TensorRT-LLM, vLLM, and SGLang.
5
+
We support exporting modelopt-optimized Hugging Face models (transformers and diffusers pipelines/components) and Megatron Core models to a unified checkpoint format that can be deployed in various inference frameworks such as TensorRT-LLM, vLLM, and SGLang.
6
6
7
7
The workflow is as follows:
8
8
@@ -32,6 +32,10 @@ The export API (:meth:`export_hf_checkpoint <modelopt.torch.export.unified_expor
32
32
export_dir, # The directory where the exported files will be stored.
33
33
)
34
34
35
+
.. note::
36
+
``export_hf_checkpoint`` also supports diffusers pipelines and components (e.g., UNet/transformer). See the
37
+
diffusers quantization examples for end-to-end workflows and CLI usage.
Copy file name to clipboardExpand all lines: docs/source/getting_started/1_overview.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,7 @@ Minimizing inference costs presents a significant challenge as generative AI mod
9
9
The `NVIDIA Model Optimizer <https://github.com/NVIDIA/Model-Optimizer>`_ (referred to as Model Optimizer, or ModelOpt)
10
10
is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress model.
11
11
It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
12
-
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
12
+
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). The unified Hugging Face export API supports both transformers and diffusers models. ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
13
13
14
14
For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ and `TensorRT-RTX <https://github.com/NVIDIA/TensorRT-RTX>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
@@ -89,44 +89,45 @@ We support calibration for INT8, FP8 and FP4 precision and for both weights and
89
89
We also provide instructions on deploying and running E2E diffusion pipelines with Model Optimizer quantized INT8 and FP8 Backbone to generate images and measure latency on target GPUs. Note, Jetson devices are not supported at this time due to the incompatibility of the software.
90
90
91
91
> [!NOTE]
92
-
> Model calibration requires relatively more GPU computing power then deployment.It does not need to be on the same GPUs as the deployment target GPUs. Using the command line below will execute both calibration and ONNX export.
92
+
> Model calibration requires relatively more GPU computing power then deployment.It does not need to be on the same GPUs as the deployment target GPUs. ONNX export and TensorRT engine instructions live in [`quantization/ONNX-TRT-Deployment.md`](./quantization/ONNX-TRT-Deployment.md).
93
93
94
-
### Quantize and export scripts
94
+
### Quantize scripts
95
95
96
-
#### 8-bit Quantize and ONNX Export [Script](./quantization/build_sdxl_8bit_engine.sh)
97
-
98
-
You can run the following script to quantize SDXL backbone to INT8 or FP8 and generate an onnx model built with default settings for SDXL. You can then directly head to the [Build the TRT engine for the Quantized ONNX Backbone](#build-the-trt-engine-for-the-quantized-onnx-backbone) section to run E2E pipeline and generate images.
99
-
100
-
```sh
101
-
bash build_sdxl_8bit_engine.sh --format {FORMAT} # FORMAT can be int8 or fp8
102
-
```
103
-
104
-
If you prefer to customize parameters in calibration or run other models, please follow the instructions below.
@@ -135,7 +136,7 @@ We recommend using a device with a minimum of 48GB of combined CPU and GPU memor
135
136
-`calib-size`: For SDXL INT8, we recommend 32 or 64, for SDXL FP8, 128 is recommended.
136
137
-`n_steps`: Recommendation: SD/SDXL 20 or 30, SDXL-Turbo 4.
137
138
138
-
**Then, we can load the generated checkpoint and export the INT8/FP8 quantized model in the next step. For FP8, we only support the TRT deployment on Ada/Hopper GPUs.**
139
+
**You can use the generated checkpoint directly in PyTorch, export a Hugging Face checkpoint (`--hf-ckpt-dir`) to deploy the model on SGLang/vLLM/TRTLLM, or follow the ONNX/TensorRT workflow in [`quantization/ONNX-TRT-Deployment.md`](./quantization/ONNX-TRT-Deployment.md).**
### Build the TRT engine for the Quantized ONNX Backbone
226
-
227
-
> [!IMPORTANT]
228
-
> TensorRT environment must be setup prior -- Please see [Pre-Requisites](#pre-requisites)
229
-
> INT8 requires **TensorRT version >= 9.2.0**. If you prefer to use the FP8 TensorRT, ensure you have **TensorRT version 10.2.0 or higher**. You can download the latest version of TensorRT at [here](https://developer.nvidia.com/tensorrt/download). Deployment of SVDQuant is currently not supported.
**Please note that `maxShapes` represents the maximum shape of the given tensor. If you want to use a larger batch size or any other dimensions, feel free to adjust the value accordingly.**
257
-
258
-
### Run End-to-end Stable Diffusion Pipeline with Model Optimizer Quantized ONNX Model and demoDiffusion
259
-
260
-
#### demoDiffusion
261
-
262
-
If you want to run end-to-end SD/SDXL pipeline with Model Optimizer quantized UNet to generate images and measure latency on target GPUs, here are the steps:
263
-
264
-
- Clone a copy of [demo/Diffusion repo](https://github.com/NVIDIA/TensorRT/tree/release/10.2/demo/Diffusion).
265
-
266
-
- Following the README from demoDiffusion to set up the pipeline, and run a baseline txt2img example (fp16):
267
-
268
-
```sh
269
-
# SDXL
270
-
python demo_txt2img_xl.py "enchanted winter forest, soft diffuse light on a snow-filled day, serene nature scene, the forest is illuminated by the snow" --negative-prompt "normal quality, low quality, worst quality, low res, blurry, nsfw, nude" --version xl-1.0 --scheduler Euler --denoising-steps 30 --seed 2946901
271
-
# Please refer to the examples provided in the demoDiffusion SD/SDXL pipeline.
272
-
```
273
-
274
-
Note, it will take some time to build TRT engines for the first time
275
-
276
-
- Replace the fp16 backbone TRT engine with int8 engine generated in [Build the TRT engine for the Quantized ONNX Backbone](#build-the-trt-engine-for-the-quantized-onnx-backbone), e.g.,:
277
-
278
-
```sh
279
-
cp -r {YOUR_UNETXL}.plan ./engine/
280
-
```
281
-
282
-
Note, the engines must be built on the same GPU, and ensure that the INT8 engine name matches the names of the FP16 engines to enable compatibility with the demoDiffusion pipeline.
283
-
284
-
- Run the above txt2img example command again. You can compare the generated images and latency for fp16 vs int8.
285
-
Similarly, you could run end-to-end pipeline with Model Optimizer quantized backbone and corresponding examples in demoDiffusion with other diffusion models.
286
-
287
-
### Running the inference pipeline with DeviceModel
288
-
289
-
DeviceModel is an interface designed to run TensorRT engines like torch models. It takes torch inputs and returns torch outputs. Under the hood, DeviceModel exports a torch checkpoint to ONNX and then generates a TensorRT engine from it. This allows you to swap the backbone of the diffusion pipeline with DeviceModel and execute the pipeline for your desired prompt.
290
-
291
-
Generate a quantized torch checkpoint using the [Script](./quantization/quantize.py) shown below:
0 commit comments