diff --git a/CHANGELOG.rst b/CHANGELOG.rst index 51ef630f6c..86b17315a3 100755 --- a/CHANGELOG.rst +++ b/CHANGELOG.rst @@ -20,7 +20,7 @@ NVIDIA Model Optimizer Changelog - Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention. - ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy. - Add :meth:`compute_quantization_mse ` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering. -- **AutoQDQ**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the AutoQDQ guide in the documentation. +- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation. - Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration. - Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search. - Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules. diff --git a/docs/source/guides/9_autoqdq.rst b/docs/source/guides/9_autotune.rst similarity index 99% rename from docs/source/guides/9_autoqdq.rst rename to docs/source/guides/9_autotune.rst index 041f17ce3d..6561ba1d20 100644 --- a/docs/source/guides/9_autoqdq.rst +++ b/docs/source/guides/9_autotune.rst @@ -1,5 +1,5 @@ =============================================== -Automated Q/DQ Placement Optimization (ONNX) +Autotune (ONNX) =============================================== .. contents:: Table of Contents @@ -9,7 +9,7 @@ Automated Q/DQ Placement Optimization (ONNX) Overview ======== -The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time. +The ``modelopt.onnx.quantization.autotune`` module automates Q/DQ (Quantize/Dequantize) placement optimization in ONNX models. It explores placement strategies and uses TensorRT latency measurements to choose a configuration that minimizes inference time. **Key Features:** diff --git a/examples/cnn_qat/README.md b/examples/cnn_qat/README.md index c421ce868c..3d578c5930 100644 --- a/examples/cnn_qat/README.md +++ b/examples/cnn_qat/README.md @@ -143,4 +143,4 @@ Your actual results will vary based on the dataset, specific hyperparameters, an ## Deployment with TensorRT -The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying a ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt. +The final model after QAT, saved using `mto.save()`, contains both the model weights and the quantization metadata. This model can be deployed to TensorRT for inference after ONNX export. The process is generally similar to [deploying an ONNX PTQ](../onnx_ptq/README.md#evaluate-the-quantized-onnx-model) model from ModelOpt. diff --git a/examples/onnx_ptq/README.md b/examples/onnx_ptq/README.md index 980a264938..0cfd4ea62f 100644 --- a/examples/onnx_ptq/README.md +++ b/examples/onnx_ptq/README.md @@ -219,6 +219,25 @@ trtexec --onnx=/path/to/identity_neural_network.quant.onnx \ --staticPlugins=/path/to/libidentity_conv_iplugin_v2_io_ext.so ``` +### Optimize Q/DQ node placement with Autotune + +This feature automates Q/DQ (Quantize/Dequantize) node placement optimization for ONNX models using TensorRT performance measurements. +For more information on the standalone toolkit, please refer to [autotune](./autotune). + +To access this feature in the ONNX quantization workflow, simply add `--autotune` in your CLI: + +```bash +python -m modelopt.onnx.quantization \ + --onnx_path=vit_base_patch16_224.onnx \ + --quantize_mode= \ + --calibration_data=calib.npy \ + --calibration_method= \ + --output_path=vit_base_patch16_224.quant.onnx \ + --autotune= +``` + +For more fine-tuned Autotune flags, please refer to the [API guide](https://nvidia.github.io/Model-Optimizer/guides/_onnx_quantization.html). + ## Resources - 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146) diff --git a/examples/onnx/autoqdq/README.md b/examples/onnx_ptq/autotune/README.md similarity index 100% rename from examples/onnx/autoqdq/README.md rename to examples/onnx_ptq/autotune/README.md diff --git a/tests/gpu/onnx/test_concat_elim.py b/tests/gpu/onnx/quantization/test_concat_elim.py similarity index 100% rename from tests/gpu/onnx/test_concat_elim.py rename to tests/gpu/onnx/quantization/test_concat_elim.py diff --git a/tests/gpu/onnx/test_plugin.py b/tests/gpu/onnx/quantization/test_plugin.py similarity index 100% rename from tests/gpu/onnx/test_plugin.py rename to tests/gpu/onnx/quantization/test_plugin.py diff --git a/tests/gpu/onnx/test_qdq_utils_fp8.py b/tests/gpu/onnx/quantization/test_qdq_utils_fp8.py similarity index 100% rename from tests/gpu/onnx/test_qdq_utils_fp8.py rename to tests/gpu/onnx/quantization/test_qdq_utils_fp8.py diff --git a/tests/gpu/onnx/test_quantize_fp8.py b/tests/gpu/onnx/quantization/test_quantize_fp8.py similarity index 100% rename from tests/gpu/onnx/test_quantize_fp8.py rename to tests/gpu/onnx/quantization/test_quantize_fp8.py diff --git a/tests/gpu/onnx/test_quantize_onnx_torch_int4_awq.py b/tests/gpu/onnx/quantization/test_quantize_onnx_torch_int4_awq.py similarity index 100% rename from tests/gpu/onnx/test_quantize_onnx_torch_int4_awq.py rename to tests/gpu/onnx/quantization/test_quantize_onnx_torch_int4_awq.py diff --git a/tests/unit/onnx/test_convtranspose_qdq.py b/tests/unit/onnx/quantization/test_convtranspose_qdq.py similarity index 100% rename from tests/unit/onnx/test_convtranspose_qdq.py rename to tests/unit/onnx/quantization/test_convtranspose_qdq.py diff --git a/tests/unit/onnx/test_dq_transpose_surgery.py b/tests/unit/onnx/quantization/test_dq_transpose_surgery.py similarity index 100% rename from tests/unit/onnx/test_dq_transpose_surgery.py rename to tests/unit/onnx/quantization/test_dq_transpose_surgery.py diff --git a/tests/unit/onnx/test_qdq_rules_int8.py b/tests/unit/onnx/quantization/test_qdq_rules_int8.py similarity index 100% rename from tests/unit/onnx/test_qdq_rules_int8.py rename to tests/unit/onnx/quantization/test_qdq_rules_int8.py diff --git a/tests/unit/onnx/test_qdq_utils.py b/tests/unit/onnx/quantization/test_qdq_utils.py similarity index 100% rename from tests/unit/onnx/test_qdq_utils.py rename to tests/unit/onnx/quantization/test_qdq_utils.py diff --git a/tests/unit/onnx/test_quant_utils.py b/tests/unit/onnx/quantization/test_quant_utils.py similarity index 100% rename from tests/unit/onnx/test_quant_utils.py rename to tests/unit/onnx/quantization/test_quant_utils.py diff --git a/tests/unit/onnx/test_quantize_api.py b/tests/unit/onnx/quantization/test_quantize_api.py similarity index 100% rename from tests/unit/onnx/test_quantize_api.py rename to tests/unit/onnx/quantization/test_quantize_api.py diff --git a/tests/unit/onnx/test_quantize_int8.py b/tests/unit/onnx/quantization/test_quantize_int8.py similarity index 100% rename from tests/unit/onnx/test_quantize_int8.py rename to tests/unit/onnx/quantization/test_quantize_int8.py diff --git a/tests/unit/onnx/test_quantize_zint4.py b/tests/unit/onnx/quantization/test_quantize_zint4.py similarity index 100% rename from tests/unit/onnx/test_quantize_zint4.py rename to tests/unit/onnx/quantization/test_quantize_zint4.py