NVIDIA
diff --git a/‎CHANGELOG.rst‎
Lines changed: 1 addition & 1 deletion b/‎CHANGELOG.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/guides/9_autoqdq.rst‎ ‎docs/source/guides/9_autotune.rst‎docs/source/guides/9_autoqdq.rst renamed to docs/source/guides/9_autotune.rst
Lines changed: 2 additions & 2 deletions b/‎docs/source/guides/9_autoqdq.rst‎ ‎docs/source/guides/9_autotune.rst‎docs/source/guides/9_autoqdq.rst renamed to docs/source/guides/9_autotune.rst
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/cnn_qat/README.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/cnn_qat/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/onnx_ptq/README.md‎
Lines changed: 19 additions & 0 deletions b/‎examples/onnx_ptq/README.md‎
Lines changed: 19 additions & 0 deletions
diff --git a/‎examples/onnx/autoqdq/README.md‎ ‎examples/onnx_ptq/autotune/README.md‎examples/onnx/autoqdq/README.md renamed to examples/onnx_ptq/autotune/README.md b/‎examples/onnx/autoqdq/README.md‎ ‎examples/onnx_ptq/autotune/README.md‎examples/onnx/autoqdq/README.md renamed to examples/onnx_ptq/autotune/README.md
diff --git a/‎tests/gpu/onnx/test_concat_elim.py‎ ‎…pu/onnx/quantization/test_concat_elim.py‎tests/gpu/onnx/test_concat_elim.py renamed to tests/gpu/onnx/quantization/test_concat_elim.py b/‎tests/gpu/onnx/test_concat_elim.py‎ ‎…pu/onnx/quantization/test_concat_elim.py‎tests/gpu/onnx/test_concat_elim.py renamed to tests/gpu/onnx/quantization/test_concat_elim.py
diff --git a/‎tests/gpu/onnx/test_plugin.py‎ ‎…sts/gpu/onnx/quantization/test_plugin.py‎tests/gpu/onnx/test_plugin.py renamed to tests/gpu/onnx/quantization/test_plugin.py b/‎tests/gpu/onnx/test_plugin.py‎ ‎…sts/gpu/onnx/quantization/test_plugin.py‎tests/gpu/onnx/test_plugin.py renamed to tests/gpu/onnx/quantization/test_plugin.py
diff --git a/‎tests/gpu/onnx/test_qdq_utils_fp8.py‎ ‎…/onnx/quantization/test_qdq_utils_fp8.py‎tests/gpu/onnx/test_qdq_utils_fp8.py renamed to tests/gpu/onnx/quantization/test_qdq_utils_fp8.py b/‎tests/gpu/onnx/test_qdq_utils_fp8.py‎ ‎…/onnx/quantization/test_qdq_utils_fp8.py‎tests/gpu/onnx/test_qdq_utils_fp8.py renamed to tests/gpu/onnx/quantization/test_qdq_utils_fp8.py
diff --git a/‎tests/gpu/onnx/test_quantize_fp8.py‎ ‎…u/onnx/quantization/test_quantize_fp8.py‎tests/gpu/onnx/test_quantize_fp8.py renamed to tests/gpu/onnx/quantization/test_quantize_fp8.py b/‎tests/gpu/onnx/test_quantize_fp8.py‎ ‎…u/onnx/quantization/test_quantize_fp8.py‎tests/gpu/onnx/test_quantize_fp8.py renamed to tests/gpu/onnx/quantization/test_quantize_fp8.py
diff --git a/‎…nnx/test_quantize_onnx_torch_int4_awq.py‎ ‎…ion/test_quantize_onnx_torch_int4_awq.py‎tests/gpu/onnx/test_quantize_onnx_torch_int4_awq.py renamed to tests/gpu/onnx/quantization/test_quantize_onnx_torch_int4_awq.py b/‎…nnx/test_quantize_onnx_torch_int4_awq.py‎ ‎…ion/test_quantize_onnx_torch_int4_awq.py‎tests/gpu/onnx/test_quantize_onnx_torch_int4_awq.py renamed to tests/gpu/onnx/quantization/test_quantize_onnx_torch_int4_awq.py
@@ -20,7 +20,7 @@ NVIDIA Model Optimizer Changelog
 - Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
 - ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
 - Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
-- **AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation.
+- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
 - Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration.
 - Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
 - Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
 
@@ -1,5 +1,5 @@
 ===============================================
-Automated Q/DQ Placement Optimization (ONNX)
+Autotune (ONNX)
 ===============================================
 
 .. contents:: Table of Contents
@@ -9,7 +9,7 @@ Automated Q/DQ Placement Optimization (ONNX)
 Overview
 ========
 
-The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
+The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement optimization in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time.
 
 **Key Features:**
 
 
@@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an
 
 ## Deployment with TensorRT
 
-The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
+The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
@@ -219,6 +219,25 @@ trtexec --onnx=/path/to/identity_neural_network.quant.onnx \
     --staticPlugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so
 ```
 
+### Optimize Q/DQ node placement with Autotune
+
+This feature automates Q/DQ (Quantize/Dequantize) node placement optimization for ONNX models using TensorRT performance measurements.
+For more information on the standalone toolkit, please refer to [autotune](./autotune).
+
+To access this feature in the ONNX quantization workflow, simply add `--autotune` in your CLI:
+
+```bash
+python -m modelopt.onnx.quantization \
+    --onnx_path=vit_base_patch16_224.onnx \
+    --quantize_mode=<fp8|int8|int4> \
+    --calibration_data=calib.npy \
+    --calibration_method=<max|entropy|awq_clip|rtn_dq> \
+    --output_path=vit_base_patch16_224.quant.onnx \
+    --autotune=<quick,default,extensive>
+```
+
+For more fine-tuned Autotune flags, please refer to the [API guide](https://nvidia.github.io/Model-Optimizer/guides/_onnx_quantization.html).
+
 ## Resources
 
 - 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)
Original file line number	Diff line number	Diff line change
`@@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an`
`143`	`143`
`144`	`144`	`## Deployment with TensorRT`
`145`	`145`
`146`		-The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.
	`146`	+The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt.