Skip to content

Commit 1afa523

Browse files
jingyu-mlkevalmorabia97
authored andcommitted
[3.1/4] Diffusion Quantized ckpt export - WAN 2.2 14B (#855)
## What does this PR do? **Type of change:** documentation <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> **Overview:** 1. Added multi‑backbone support for quantization: --backbone now accepts space- or comma-separated lists and resolves to a list of backbone modules. 2. Introduced PipelineManager.iter_backbones() to iterate named backbone modules and updated get_backbone() to return a single module or a ModuleList for multi‑backbone. 3. Updated ExportManager to save/restore per‑backbone checkpoints when a directory is provided, with {backbone_name}.pt files, and to create target directories when missing. 4. Simplified save_checkpoint() calls to rely on the registered pipeline_manager by default. **Usage: ** ```bash python quantize.py --model wan2.2-t2v-14b --format fp4 --batch-size 1 --calib-size 32 \ --n-steps 30 --backbone transformer transformer_2 --model-dtype BFloat16 \ --quantized-torch-ckpt-save-path ./wan22_mo_ckpts \ --hf-ckpt-dir ./wan2.2-t2v-14b ``` Plans - [x] [1/4] Add the basic functionalities to support limited image models with NVFP4 + FP8, with some refactoring on the previous LLM code and the diffusers example. PIC: @jingyu-ml - [x] [2/4] Add support to more video gen models. PIC: @jingyu-ml - [x] [3/4] Add test cases, refactor on the doc, and all related README. PIC: @jingyu-ml - [ ] [4/4] Add the final support to ComfyUI. PIC @jingyu-ml ## Testing <!-- Mention how have you tested your change if applicable. --> ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: No <!--- If No, explain why. --> - **Did you write any new necessary tests?**:No - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Unified Hugging Face export support for diffusers pipelines and components * LTX-2 and Wan2.2 (T2V) support in diffusers quantization workflow * Comprehensive ONNX export and TensorRT engine build documentation for diffusion models * **Documentation** * Updated to clarify support for both transformers and diffusers models in unified export API * Expanded diffusers examples with LoRA fusion guidance and additional model options (Flux, SD3, SDXL variants) <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
1 parent 9e313ad commit 1afa523

File tree

10 files changed

+268
-160
lines changed

10 files changed

+268
-160
lines changed

CHANGELOG.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@ NVIDIA Model Optimizer Changelog (Linux)
1717
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
1818
- Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
1919
- Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.
20+
- Add unified Hugging Face export support for diffusers pipelines/components.
21+
- Add LTX-2 and Wan2.2 (T2V) support in the diffusers quantization workflow.
2022
- Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is.
2123
- Add support for image-text data calibration in PTQ for Nemotron VL models.
2224

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ ______________________________________________________________________
2222
**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
2323
Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
2424

25-
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm).
25+
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm). The unified Hugging Face export API now supports both transformers and diffusers models.
2626

2727
## Latest News
2828

docs/source/deployment/3_unified_hf.rst

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
Unified HuggingFace Checkpoint
33
=================================================================
44

5-
We support exporting modelopt-optimized Huggingface models and Megatron Core models to a unified checkpoint format that can be deployed in various inference frameworks such as TensorRT-LLM, vLLM, and SGLang.
5+
We support exporting modelopt-optimized Hugging Face models (transformers and diffusers pipelines/components) and Megatron Core models to a unified checkpoint format that can be deployed in various inference frameworks such as TensorRT-LLM, vLLM, and SGLang.
66

77
The workflow is as follows:
88

@@ -32,6 +32,10 @@ The export API (:meth:`export_hf_checkpoint <modelopt.torch.export.unified_expor
3232
export_dir, # The directory where the exported files will be stored.
3333
)
3434
35+
.. note::
36+
``export_hf_checkpoint`` also supports diffusers pipelines and components (e.g., UNet/transformer). See the
37+
diffusers quantization examples for end-to-end workflows and CLI usage.
38+
3539
Deployment Support Matrix
3640
==============================================
3741

docs/source/getting_started/1_overview.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Minimizing inference costs presents a significant challenge as generative AI mod
99
The `NVIDIA Model Optimizer <https://github.com/NVIDIA/Model-Optimizer>`_ (referred to as Model Optimizer, or ModelOpt)
1010
is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress model.
1111
It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
12-
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
12+
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). The unified Hugging Face export API supports both transformers and diffusers models. ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
1313

1414
For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ and `TensorRT-RTX <https://github.com/NVIDIA/TensorRT-RTX>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
1515

examples/diffusers/README.md

Lines changed: 25 additions & 130 deletions
Original file line numberDiff line numberDiff line change
@@ -89,44 +89,45 @@ We support calibration for INT8, FP8 and FP4 precision and for both weights and
8989
We also provide instructions on deploying and running E2E diffusion pipelines with Model Optimizer quantized INT8 and FP8 Backbone to generate images and measure latency on target GPUs. Note, Jetson devices are not supported at this time due to the incompatibility of the software.
9090

9191
> [!NOTE]
92-
> Model calibration requires relatively more GPU computing power then deployment.It does not need to be on the same GPUs as the deployment target GPUs. Using the command line below will execute both calibration and ONNX export.
92+
> Model calibration requires relatively more GPU computing power then deployment. It does not need to be on the same GPUs as the deployment target GPUs. ONNX export and TensorRT engine instructions live in [`quantization/ONNX-TRT-Deployment.md`](./quantization/ONNX-TRT-Deployment.md).
9393
94-
### Quantize and export scripts
94+
### Quantize scripts
9595

96-
#### 8-bit Quantize and ONNX Export [Script](./quantization/build_sdxl_8bit_engine.sh)
97-
98-
You can run the following script to quantize SDXL backbone to INT8 or FP8 and generate an onnx model built with default settings for SDXL. You can then directly head to the [Build the TRT engine for the Quantized ONNX Backbone](#build-the-trt-engine-for-the-quantized-onnx-backbone) section to run E2E pipeline and generate images.
99-
100-
```sh
101-
bash build_sdxl_8bit_engine.sh --format {FORMAT} # FORMAT can be int8 or fp8
102-
```
103-
104-
If you prefer to customize parameters in calibration or run other models, please follow the instructions below.
105-
106-
#### FLUX-Dev|SD3-Medium|SDXL|SDXL-Turbo INT8 [Script](./quantization/quantize.py)
96+
#### FLUX|SD3|SDXL INT8 [Script](./quantization/quantize.py)
10797

10898
```sh
10999
python quantize.py \
110-
--model {flux-dev|sdxl-1.0|sdxl-turbo|sd3-medium} \
100+
--model {flux-dev|flux-schnell|sdxl-1.0|sdxl-turbo|sd3-medium|sd3.5-medium} \
111101
--format int8 --batch-size 2 \
112102
--calib-size 32 --alpha 0.8 --n-steps 20 \
113-
--model-dtype {Half/BFloat16} --trt-high-precision-dtype {Half|BFloat16} \
114-
--quantized-torch-ckpt-save-path ./{MODEL_NAME}.pt --onnx-dir {ONNX_DIR}
103+
--model-dtype {Half/BFloat16} \
104+
--quantized-torch-ckpt-save-path ./{MODEL_NAME}.pt \
105+
--hf-ckpt-dir ./hf_ckpt
115106
```
116107

117-
#### FLUX-Dev|SDXL|SDXL-Turbo|LTX-Video FP8/FP4 [Script](./quantization/quantize.py)
118-
119-
*In our example code, FP4 is only supported for Flux. However, you can modify our script to enable FP4 format support for your own model.*
108+
#### FLUX|SD3|SDXL|LTX|WAN2.2 FP8/FP4 [Script](./quantization/quantize.py)
120109

121110
```sh
122111
python quantize.py \
123-
--model {flux-dev|sdxl-1.0|sdxl-turbo|ltx-video-dev} --model-dtype {Half|BFloat16} --trt-high-precision-dtype {Half|BFloat16} \
112+
--model {flux-dev|flux-schnell|sdxl-1.0|sdxl-turbo|sd3-medium|sd3.5-medium|ltx-video-dev|wan2.2-t2v-14b|wan2.2-t2v-5b} \
113+
--model-dtype {Half|BFloat16} \
124114
--format {fp8|fp4} --batch-size 2 --calib-size {128|256} --quantize-mha \
125115
--n-steps 20 --quantized-torch-ckpt-save-path ./{MODEL_NAME}.pt --collect-method default \
126-
--onnx-dir {ONNX_DIR}
116+
--hf-ckpt-dir ./hf_ckpt
127117
```
128118

129-
We recommend using a device with a minimum of 48GB of combined CPU and GPU memory for exporting ONNX models. If not, please use CPU for onnx export.
119+
#### [LTX-2](https://github.com/Lightricks/LTX-2) FP4 (torch checkpoint export)
120+
121+
```sh
122+
python quantize.py \
123+
--model ltx-2 --format fp4 --batch-size 1 --calib-size 32 --n-steps 40 \
124+
--extra-param checkpoint_path=./ltx-2-19b-dev-fp8.safetensors \
125+
--extra-param distilled_lora_path=./ltx-2-19b-distilled-lora-384.safetensors \
126+
--extra-param spatial_upsampler_path=./ltx-2-spatial-upscaler-x2-1.0.safetensors \
127+
--extra-param gemma_root=./gemma-3-12b-it-qat-q4_0-unquantized \
128+
--extra-param fp8transformer=true \
129+
--quantized-torch-ckpt-save-path ./ltx-2-transformer.pt
130+
```
130131

131132
#### Important Parameters
132133

@@ -135,7 +136,7 @@ We recommend using a device with a minimum of 48GB of combined CPU and GPU memor
135136
- `calib-size`: For SDXL INT8, we recommend 32 or 64, for SDXL FP8, 128 is recommended.
136137
- `n_steps`: Recommendation: SD/SDXL 20 or 30, SDXL-Turbo 4.
137138

138-
**Then, we can load the generated checkpoint and export the INT8/FP8 quantized model in the next step. For FP8, we only support the TRT deployment on Ada/Hopper GPUs.**
139+
**You can use the generated checkpoint directly in PyTorch, export a Hugging Face checkpoint (`--hf-ckpt-dir`) to deploy the model on SGLang/vLLM/TRTLLM, or follow the ONNX/TensorRT workflow in [`quantization/ONNX-TRT-Deployment.md`](./quantization/ONNX-TRT-Deployment.md).**
139140

140141
## Quantization Aware Training (QAT)
141142

@@ -222,113 +223,7 @@ transformer, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
222223

223224
## Build and Run with TensorRT Compiler Framework
224225

225-
### Build the TRT engine for the Quantized ONNX Backbone
226-
227-
> [!IMPORTANT]
228-
> TensorRT environment must be setup prior -- Please see [Pre-Requisites](#pre-requisites)
229-
> INT8 requires **TensorRT version >= 9.2.0**. If you prefer to use the FP8 TensorRT, ensure you have **TensorRT version 10.2.0 or higher**. You can download the latest version of TensorRT at [here](https://developer.nvidia.com/tensorrt/download). Deployment of SVDQuant is currently not supported.
230-
231-
Generate INT8/FP8 Backbone Engine
232-
233-
```bash
234-
# For SDXL
235-
trtexec --builderOptimizationLevel=4 --stronglyTyped --onnx=./model.onnx \
236-
--minShapes=sample:2x4x128x128,timestep:1,encoder_hidden_states:2x77x2048,text_embeds:2x1280,time_ids:2x6 \
237-
--optShapes=sample:16x4x128x128,timestep:1,encoder_hidden_states:16x77x2048,text_embeds:16x1280,time_ids:16x6 \
238-
--maxShapes=sample:16x4x128x128,timestep:1,encoder_hidden_states:16x77x2048,text_embeds:16x1280,time_ids:16x6 \
239-
--saveEngine=model.plan
240-
241-
# For SD3-Medium
242-
trtexec --builderOptimizationLevel=4 --stronglyTyped --onnx=./model.onnx \
243-
--minShapes=hidden_states:2x16x128x128,timestep:2,encoder_hidden_states:2x333x4096,pooled_projections:2x2048 \
244-
--optShapes=hidden_states:16x16x128x128,timestep:16,encoder_hidden_states:16x333x4096,pooled_projections:16x2048 \
245-
--maxShapes=hidden_states:16x16x128x128,timestep:16,encoder_hidden_states:16x333x4096,pooled_projections:16x2048 \
246-
--saveEngine=model.plan
247-
248-
# For FLUX-Dev FP8
249-
trtexec --onnx=./model.onnx --fp8 --bf16 --stronglyTyped \
250-
--minShapes=hidden_states:1x4096x64,img_ids:4096x3,encoder_hidden_states:1x512x4096,txt_ids:512x3,timestep:1,pooled_projections:1x768,guidance:1 \
251-
--optShapes=hidden_states:1x4096x64,img_ids:4096x3,encoder_hidden_states:1x512x4096,txt_ids:512x3,timestep:1,pooled_projections:1x768,guidance:1 \
252-
--maxShapes=hidden_states:1x4096x64,img_ids:4096x3,encoder_hidden_states:1x512x4096,txt_ids:512x3,timestep:1,pooled_projections:1x768,guidance:1 \
253-
--saveEngine=model.plan
254-
```
255-
256-
**Please note that `maxShapes` represents the maximum shape of the given tensor. If you want to use a larger batch size or any other dimensions, feel free to adjust the value accordingly.**
257-
258-
### Run End-to-end Stable Diffusion Pipeline with Model Optimizer Quantized ONNX Model and demoDiffusion
259-
260-
#### demoDiffusion
261-
262-
If you want to run end-to-end SD/SDXL pipeline with Model Optimizer quantized UNet to generate images and measure latency on target GPUs, here are the steps:
263-
264-
- Clone a copy of [demo/Diffusion repo](https://github.com/NVIDIA/TensorRT/tree/release/10.2/demo/Diffusion).
265-
266-
- Following the README from demoDiffusion to set up the pipeline, and run a baseline txt2img example (fp16):
267-
268-
```sh
269-
# SDXL
270-
python demo_txt2img_xl.py "enchanted winter forest, soft diffuse light on a snow-filled day, serene nature scene, the forest is illuminated by the snow" --negative-prompt "normal quality, low quality, worst quality, low res, blurry, nsfw, nude" --version xl-1.0 --scheduler Euler --denoising-steps 30 --seed 2946901
271-
# Please refer to the examples provided in the demoDiffusion SD/SDXL pipeline.
272-
```
273-
274-
Note, it will take some time to build TRT engines for the first time
275-
276-
- Replace the fp16 backbone TRT engine with int8 engine generated in [Build the TRT engine for the Quantized ONNX Backbone](#build-the-trt-engine-for-the-quantized-onnx-backbone), e.g.,:
277-
278-
```sh
279-
cp -r {YOUR_UNETXL}.plan ./engine/
280-
```
281-
282-
Note, the engines must be built on the same GPU, and ensure that the INT8 engine name matches the names of the FP16 engines to enable compatibility with the demoDiffusion pipeline.
283-
284-
- Run the above txt2img example command again. You can compare the generated images and latency for fp16 vs int8.
285-
Similarly, you could run end-to-end pipeline with Model Optimizer quantized backbone and corresponding examples in demoDiffusion with other diffusion models.
286-
287-
### Running the inference pipeline with DeviceModel
288-
289-
DeviceModel is an interface designed to run TensorRT engines like torch models. It takes torch inputs and returns torch outputs. Under the hood, DeviceModel exports a torch checkpoint to ONNX and then generates a TensorRT engine from it. This allows you to swap the backbone of the diffusion pipeline with DeviceModel and execute the pipeline for your desired prompt.
290-
291-
Generate a quantized torch checkpoint using the [Script](./quantization/quantize.py) shown below:
292-
293-
```bash
294-
python quantize.py \
295-
--model {sdxl-1.0|sdxl-turbo|sd3-medium|flux-dev} \
296-
--format fp8 \
297-
--batch-size {1|2} \
298-
--calib-size 128 \
299-
--n-steps 20 \
300-
--quantized-torch-ckpt-save-path ./{MODEL}_fp8.pt \
301-
--collect-method default
302-
```
303-
304-
Generate images for the quantized checkpoint with the following [Script](./quantization/diffusion_trt.py):
305-
306-
```bash
307-
python diffusion_trt.py \
308-
--model {sdxl-1.0|sdxl-turbo|sd3-medium|flux-dev} \
309-
--prompt "A cat holding a sign that says hello world" \
310-
[--override-model-path /path/to/model] \
311-
[--restore-from ./{MODEL}_fp8.pt] \
312-
[--onnx-load-path {ONNX_DIR}] \
313-
[--trt-engine-load-path {ENGINE_DIR}] \
314-
[--dq-only] \
315-
[--torch] \
316-
[--save-image-as /path/to/image] \
317-
[--benchmark] \
318-
[--torch-compile] \
319-
[--skip-image]
320-
```
321-
322-
This script will save the output image as `./{MODEL}.png` and report the latency of the TensorRT backbone.
323-
To generate the image with FP16|BF16 precision, you can run the command shown above without the `--restore-from` argument.
324-
325-
While loading a TensorRT engine using the --trt-engine-load-path argument, it is recommended to load only engines generated using this pipeline.
326-
327-
#### Demo Images
328-
329-
| SDXL FP16 | SDXL INT8 |
330-
|:---------:|:---------:|
331-
| ![FP16](./quantization/assets/xl_base-fp16.png) | ![INT8](./quantization/assets/xl_base-int8.png) |
226+
ONNX export and TensorRT engine instructions are documented in [`quantization/ONNX-TRT-Deployment.md`](./quantization/ONNX-TRT-Deployment.md).
332227

333228
### LoRA
334229

0 commit comments

Comments
 (0)