NVIDIA · gcunhase · Mar 12, 2026 · Mar 11, 2026 · Mar 11, 2026 · Mar 11, 2026
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -20,7 +20,7 @@ NVIDIA Model Optimizer Changelog
 - Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
 - ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
 - Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
-- **AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.
+- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
 - Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration.
 - Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
 - Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.

diff --git a/docs/source/guides/9_autoqdq.rst → docs/source/guides/9_autotune.rst b/docs/source/guides/9_autoqdq.rst → docs/source/guides/9_autotune.rst
@@ -1,5 +1,5 @@
 ===============================================
-Automated Q/DQ Placement Optimization (ONNX)
+Autotune (ONNX)
 ===============================================
 
 .. contents:: Table of Contents
@@ -9,7 +9,7 @@ Automated Q/DQ Placement Optimization (ONNX)
 Overview
 ========
 
-The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
+The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement optimization in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
 
 **Key Features:**
 

@@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an
 
 ## Deployment with TensorRT
 
-The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
+The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
@@ -219,6 +219,25 @@ trtexec --onnx=/path/to/identity_neural_network.quant.onnx \
     --staticPlugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so
 ```
 
+### Optimize Q/DQ node placement with Autotune
+
+This feature automates Q/DQ (Quantize/Dequantize) node placement optimization for ONNX models using TensorRT performance measurements.
+For more information on the standalone toolkit, please refer to [autotune](./autotune).
+
+To access this feature in the ONNX quantization workflow, simply add `--autotune` in your CLI:
+
+```bash
+python -m modelopt.onnx.quantization \
+    --onnx_path=vit_base_patch16_224.onnx \
+    --quantize_mode=<fp8|int8|int4> \
+    --calibration_data=calib.npy \
+    --calibration_method=<max|entropy|awq_clip|rtn_dq> \
+    --output_path=vit_base_patch16_224.quant.onnx \
+    --autotune=<quick,default,extensive>
+```
+
+For more fine-tuned Autotune flags, please refer to the [API guide](https://nvidia.github.io/Model-Optimizer/guides/_onnx_quantization.html).
+
 ## Resources
 
 - 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)

diff --git a/tests/gpu/onnx/test_concat_elim.py → ...gpu/onnx/quantization/test_concat_elim.py b/tests/gpu/onnx/test_concat_elim.py → ...gpu/onnx/quantization/test_concat_elim.py
diff --git a/tests/gpu/onnx/test_plugin.py → tests/gpu/onnx/quantization/test_plugin.py b/tests/gpu/onnx/test_plugin.py → tests/gpu/onnx/quantization/test_plugin.py
diff --git a/tests/gpu/onnx/test_qdq_utils_fp8.py → ...u/onnx/quantization/test_qdq_utils_fp8.py b/tests/gpu/onnx/test_qdq_utils_fp8.py → ...u/onnx/quantization/test_qdq_utils_fp8.py
diff --git a/tests/gpu/onnx/test_quantize_fp8.py → ...pu/onnx/quantization/test_quantize_fp8.py b/tests/gpu/onnx/test_quantize_fp8.py → ...pu/onnx/quantization/test_quantize_fp8.py
diff --git a/...onnx/test_quantize_onnx_torch_int4_awq.py → ...tion/test_quantize_onnx_torch_int4_awq.py b/...onnx/test_quantize_onnx_torch_int4_awq.py → ...tion/test_quantize_onnx_torch_int4_awq.py
diff --git a/tests/unit/onnx/test_convtranspose_qdq.py → ...nx/quantization/test_convtranspose_qdq.py b/tests/unit/onnx/test_convtranspose_qdq.py → ...nx/quantization/test_convtranspose_qdq.py
diff --git a/tests/unit/onnx/test_dq_transpose_surgery.py → ...quantization/test_dq_transpose_surgery.py b/tests/unit/onnx/test_dq_transpose_surgery.py → ...quantization/test_dq_transpose_surgery.py
diff --git a/tests/unit/onnx/test_qdq_rules_int8.py → .../onnx/quantization/test_qdq_rules_int8.py b/tests/unit/onnx/test_qdq_rules_int8.py → .../onnx/quantization/test_qdq_rules_int8.py
diff --git a/tests/unit/onnx/test_qdq_utils.py → .../unit/onnx/quantization/test_qdq_utils.py b/tests/unit/onnx/test_qdq_utils.py → .../unit/onnx/quantization/test_qdq_utils.py
diff --git a/tests/unit/onnx/test_quant_utils.py → ...nit/onnx/quantization/test_quant_utils.py b/tests/unit/onnx/test_quant_utils.py → ...nit/onnx/quantization/test_quant_utils.py
diff --git a/tests/unit/onnx/test_quantize_api.py → ...it/onnx/quantization/test_quantize_api.py b/tests/unit/onnx/test_quantize_api.py → ...it/onnx/quantization/test_quantize_api.py
diff --git a/tests/unit/onnx/test_quantize_int8.py → ...t/onnx/quantization/test_quantize_int8.py b/tests/unit/onnx/test_quantize_int8.py → ...t/onnx/quantization/test_quantize_int8.py
diff --git a/tests/unit/onnx/test_quantize_zint4.py → .../onnx/quantization/test_quantize_zint4.py b/tests/unit/onnx/test_quantize_zint4.py → .../onnx/quantization/test_quantize_zint4.py
Original file line number	Diff line number	Diff line change
Expand Up		@@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an

		## Deployment with TensorRT

		The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
		The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.