Skip to content

Commit 69c0d47

Browse files
authored
[OMNIML-3252][ONNX] MOQ + Autotune moq integration docs (#1026)
### What does this PR do? **Type of change**: documentation **Overview**: This PR updates the documentation and does some folder re-structuring and file re-naming related to #951. ### Usage Documentation ### Testing Documentation ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ - Did you write any new necessary tests?: N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ (renamed `AutoQDQ` to `Autotune`) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Renamed AutoQDQ to Autotune across guides and changelog. * Updated Autotune guide descriptions and wording. * Added a new section on optimizing Q/DQ node placement with Autotune, including CLI usage and API links (appears twice in one README). * Applied minor grammar and capitalization corrections. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
1 parent 72a5b3d commit 69c0d47

18 files changed

+23
-4
lines changed

CHANGELOG.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ NVIDIA Model Optimizer Changelog
2020
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
2121
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
2222
- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
23-
- **AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.
23+
- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
2424
- Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration.
2525
- Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
2626
- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
===============================================
2-
Automated Q/DQ Placement Optimization (ONNX)
2+
Autotune (ONNX)
33
===============================================
44

55
.. contents:: Table of Contents
@@ -9,7 +9,7 @@ Automated Q/DQ Placement Optimization (ONNX)
99
Overview
1010
========
1111

12-
The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
12+
The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement optimization in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
1313

1414
**Key Features:**
1515

examples/cnn_qat/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an
143143

144144
## Deployment with TensorRT
145145

146-
The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
146+
The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.

examples/onnx_ptq/README.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,25 @@ trtexec --onnx=/path/to/identity_neural_network.quant.onnx \
219219
--staticPlugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so
220220
```
221221

222+
### Optimize Q/DQ node placement with Autotune
223+
224+
This feature automates Q/DQ (Quantize/Dequantize) node placement optimization for ONNX models using TensorRT performance measurements.
225+
For more information on the standalone toolkit, please refer to [autotune](./autotune).
226+
227+
To access this feature in the ONNX quantization workflow, simply add `--autotune` in your CLI:
228+
229+
```bash
230+
python -m modelopt.onnx.quantization \
231+
--onnx_path=vit_base_patch16_224.onnx \
232+
--quantize_mode=<fp8|int8|int4> \
233+
--calibration_data=calib.npy \
234+
--calibration_method=<max|entropy|awq_clip|rtn_dq> \
235+
--output_path=vit_base_patch16_224.quant.onnx \
236+
--autotune=<quick,default,extensive>
237+
```
238+
239+
For more fine-tuned Autotune flags, please refer to the [API guide](https://nvidia.github.io/Model-Optimizer/guides/_onnx_quantization.html).
240+
222241
## Resources
223242

224243
- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

tests/gpu/onnx/test_quantize_onnx_torch_int4_awq.py renamed to tests/gpu/onnx/quantization/test_quantize_onnx_torch_int4_awq.py

File renamed without changes.

0 commit comments

Comments
 (0)