Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ NVIDIA Model Optimizer Changelog
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
- **AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.
- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
- Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration.
- Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
===============================================
Automated Q/DQ Placement Optimization (ONNX)
Autotune (ONNX)
===============================================

.. contents:: Table of Contents
Expand All @@ -9,7 +9,7 @@ Automated Q/DQ Placement Optimization (ONNX)
Overview
========

The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement optimization in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.

**Key Features:**

Expand Down
2 changes: 1 addition & 1 deletion examples/cnn_qat/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an

## Deployment with TensorRT

The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
19 changes: 19 additions & 0 deletions examples/onnx_ptq/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,25 @@ trtexec --onnx=/path/to/identity_neural_network.quant.onnx \
--staticPlugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so
```

### Optimize Q/DQ node placement with Autotune

This feature automates Q/DQ (Quantize/Dequantize) node placement optimization for ONNX models using TensorRT performance measurements.
For more information on the standalone toolkit, please refer to [autotune](./autotune).

To access this feature in the ONNX quantization workflow, simply add `--autotune` in your CLI:

```bash
python -m modelopt.onnx.quantization \
--onnx_path=vit_base_patch16_224.onnx \
--quantize_mode=<fp8|int8|int4> \
--calibration_data=calib.npy \
--calibration_method=<max|entropy|awq_clip|rtn_dq> \
--output_path=vit_base_patch16_224.quant.onnx \
--autotune=<quick,default,extensive>
```

For more fine-tuned Autotune flags, please refer to the [API guide](https://nvidia.github.io/Model-Optimizer/guides/_onnx_quantization.html).

## Resources

- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)
Expand Down
Loading