You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG-Windows.rst
+15-2Lines changed: 15 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,19 @@
1
1
NVIDIA Model Optimizer Changelog (Windows)
2
2
==========================================
3
3
4
+
0.41 (TBD)
5
+
^^^^^^^^^^
6
+
7
+
**Bug Fixes**
8
+
9
+
- Fix ONNX 1.19 compatibility issues with CuPy during ONNX INT4 AWQ quantization. ONNX 1.19 uses ml_dtypes.int4 instead of numpy.int8 which caused CuPy failures.
10
+
11
+
**New Features**
12
+
13
+
- Add support for ONNX Mixed Precision Weight-only quantization using INT4 and INT8 precisions. Refer quantization `example for GenAI LLMs <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/onnx_ptq/genai_llm>`_.
14
+
- Add support for some diffusion models' quantization on Windows. Refer `example script <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/torch_onnx/diffusers>`_ for details.
15
+
- Add `Perplexity <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/perplexity_metrics>`_ and `KL-Divergence <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/kl_divergence_metrics>`_ accuracy benchmarks.
16
+
4
17
0.33 (2025-07-21)
5
18
^^^^^^^^^^^^^^^^^
6
19
@@ -25,8 +38,8 @@ NVIDIA Model Optimizer Changelog (Windows)
25
38
26
39
- This is the first official release of Model Optimizer for Windows
27
40
- **ONNX INT4 Quantization:** :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` now supports ONNX INT4 quantization for DirectML and TensorRT* deployment. See :ref:`Support_Matrix` for details about supported features and models.
28
-
- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_
- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `Olive example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_.
- **MMLU Benchmark for Accuracy Evaluations:** Introduced `MMLU benchmarking <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/README.md>`_ for accuracy evaluation of ONNX models on DirectML (DML).
31
44
- **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_.
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+5-1Lines changed: 5 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,8 +14,12 @@ NVIDIA Model Optimizer Changelog (Linux)
14
14
- Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
15
15
- Add support for MiniMax M2.1 model quantization support for the original FP8 checkpoint.
16
16
- Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
17
+
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
17
18
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
18
19
- Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
20
+
- Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.
21
+
- Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is.
22
+
- Add support for image-text data calibration in PTQ for Nemotron VL models.
19
23
20
24
0.41 (2026-01-19)
21
25
^^^^^^^^^^^^^^^^^
@@ -225,7 +229,7 @@ NVIDIA Model Optimizer Changelog (Linux)
225
229
- Add support for UNet ONNX quantization.
226
230
- Enable ``concat_elimination`` pass by default to improve the performance of quantized ONNX models.
227
231
- Enable Redundant Cast elimination pass by default in :meth:`moq.quantize <modelopt.onnx.quantization.quantize>`.
228
-
- Add new attribute ``parallel_state`` to :class:`DynamicModule <modelopt.torch.opt.dynamic.DynamicModule>` to support distributed parallelism such as data parallel and tensor parallel.
232
+
- Add new attribute ``parallel_state`` to :class:`QuantModule <modelopt.torch.quantization.nn.modules.quant_module.QuantModule>` to support distributed parallelism such as data parallel and tensor parallel.
229
233
- Add MXFP8, NVFP4 quantized ONNX export support.
230
234
- Add new example for torch quantization to ONNX for MXFP8, NVFP4 precision.
**[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.
21
21
22
22
**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
23
-
Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-NeMo/NeMo), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
23
+
Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
24
24
25
25
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm).
Copy file name to clipboardExpand all lines: docs/source/deployment/2_onnxruntime.rst
+18-6Lines changed: 18 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,19 @@
1
-
.. _DirectML_Deployment:
1
+
.. _Onnxruntime_Deployment:
2
2
3
-
===================
4
-
DirectML
5
-
===================
3
+
===========
4
+
Onnxruntime
5
+
===========
6
6
7
+
Once an ONNX FP16 model is quantized using Model Optimizer on Windows, the resulting quantized ONNX model can be deployed via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
7
8
8
-
Once an ONNX FP16 model is quantized using Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML (DML) backend via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
9
+
ONNX Runtime uses execution providers (EPs) to run models efficiently across a range of backends, including:
10
+
11
+
- **CUDA EP:** Utilizes NVIDIA GPUs for fast inference with CUDA and cuDNN libraries.
12
+
- **DirectML EP:** Enables deployment on a wide range of GPUs.
13
+
- **TensorRT-RTX EP:** Targets NVIDIA RTX GPUs, leveraging TensorRT for further optimized inference.
14
+
- **CPU EP:** Provides a fallback to run inference on the system's CPU when specialized hardware is unavailable.
15
+
16
+
Choose the EP that best matches your model, hardware and deployment requirements.
9
17
10
18
.. note:: Currently, DirectML backend doesn't support 8-bit precision. So, 8-bit quantized models should be deployed on other backends like ORT-CUDA etc. However, DML path does support INT4 quantized models.
11
19
@@ -21,6 +29,10 @@ ONNX Runtime GenAI offers a streamlined solution for deploying generative AI mod
21
29
- **Control Options**: Use the high-level ``generate()`` method for rapid deployment or execute each iteration of the model in a loop for fine-grained control.
22
30
- **Multi-Language API Support**: Provides APIs for Python, C#, and C/C++, allowing seamless integration across a range of applications.
23
31
32
+
.. note::
33
+
34
+
ONNX Runtime GenAI models are typically tied to the execution provider (EP) they were built with; a model exported for one EP (e.g., CUDA or DirectML) is generally not compatible with other EPs. To run inference on a different backend, re-export or convert the model specifically for that target EP.
35
+
24
36
**Getting Started**:
25
37
26
38
Refer to the `ONNX Runtime GenAI documentation <https://onnxruntime.ai/docs/genai/>`_ for an in-depth guide on installation, setup, and usage.
@@ -42,4 +54,4 @@ For further details and examples, please refer to the `ONNX Runtime documentatio
42
54
Collection of optimized ONNX models
43
55
===================================
44
56
45
-
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
57
+
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_. Follow the instructions provided along with the published models for deployment.
Copy file name to clipboardExpand all lines: docs/source/getting_started/1_overview.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ is a library comprising state-of-the-art model optimization techniques including
11
11
It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
12
12
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
13
13
14
-
For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/README.md>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
14
+
For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ and `TensorRT-RTX <https://github.com/NVIDIA/TensorRT-RTX>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
15
15
16
16
Model Optimizer for both Linux and Windows are available for free for all developers on `NVIDIA PyPI <https://pypi.org/project/nvidia-modelopt/>`_. Visit the `Model Optimizer GitHub repository <https://github.com/NVIDIA/Model-Optimizer>`_ for end-to-end
17
17
example scripts and recipes optimized for NVIDIA GPUs.
- Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8.
28
+
- Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8+.
29
29
- We currently support *Single-GPU* configuration.
30
30
31
31
The Model Optimizer - Windows can be used in following ways:
Copy file name to clipboardExpand all lines: docs/source/getting_started/windows/_installation_standalone.rst
+13-18Lines changed: 13 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,7 @@ Before using ModelOpt-Windows, the following components must be installed:
13
13
- NVIDIA GPU and Graphics Driver
14
14
- Python version >= 3.10 and < 3.13
15
15
- Visual Studio 2022 / MSVC / C/C++ Build Tools
16
+
- CUDA Toolkit, CuDNN for using CUDA path during calibration (e.g. for calibration of ONNX models using `onnxruntime-gpu` or CUDA EP)
16
17
17
18
Update ``PATH`` environment variable as needed for above prerequisites.
18
19
@@ -26,45 +27,38 @@ It is recommended to use a virtual environment for managing Python dependencies.
26
27
$ python -m venv .\myEnv
27
28
$ .\myEnv\Scripts\activate
28
29
29
-
In the newly created virtual environment, none of the required packages (e.g., onnx, onnxruntime, onnxruntime-directml, onnxruntime-gpu, nvidia-modelopt) will be pre-installed.
30
+
In the newly created virtual environment, none of the required packages (e.g., onnx, onnxruntime, onnxruntime-directml, onnxruntime-gpu, nvidia-modelopt etc.) will be pre-installed.
30
31
31
32
**3. Install ModelOpt-Windows Wheel**
32
33
33
-
To install the ModelOpt-Windows wheel, run the following command:
34
+
To install the ONNX module of ModelOpt-Windows, run the following command:
34
35
35
36
.. code-block:: bash
36
37
37
38
pip install "nvidia-modelopt[onnx]"
38
39
39
-
This command installs ModelOpt-Windows and its ONNX module, along with the *onnxruntime-directml* (v1.20.0) package. If ModelOpt-Windows is installed without the additional parameter, only the bare minimum dependencies will be installed, without the relevant module and dependencies.
40
+
If you install ModelOpt-Windows without the extra ``[onnx]`` option, only the minimal core dependencies and the PyTorch module (``torch``) will be installed. Support for ONNX model quantization requires installing with ``[onnx]``.
40
41
41
-
**4. Setup ONNX Runtime (ORT) for Calibration**
42
+
**4. ONNX Model Quantization: Setup ONNX Runtime Execution Provider for Calibration**
42
43
43
-
The ONNX Post-Training Quantization (PTQ) process involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
44
+
The Post-Training Quantization (PTQ) process for ONNX models usually involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
44
45
45
46
- *onnxruntime-directml* provides the DirectML EP.
47
+
- *onnxruntime-trt-rtx* provides TensorRT-RTX EP.
46
48
- *onnxruntime-gpu* provides the CUDA EP.
47
49
- *onnxruntime* provides the CPU EP.
48
50
49
-
By default, ModelOpt-Windows installs *onnxruntime-directml* and uses the DirectML EP (v1.20.0) for calibration. No additional dependencies are required.
50
-
If you prefer to use the CUDA EP for calibration, uninstall the existing *onnxruntime-directml* package and install the *onnxruntime-gpu* package, which requires CUDA and cuDNN dependencies:
51
-
52
-
- Uninstall *onnxruntime-directml*:
53
-
54
-
.. code-block:: bash
55
-
56
-
pip uninstall onnxruntime-directml
51
+
By default, ModelOpt-Windows installs *onnxruntime-gpu*. The default CUDA version needed for *onnxruntime-gpu* since v1.19.0 is 12.x. The *onnxruntime-gpu* package (i.e. CUDA EP) has CUDA and cuDNN dependencies:
57
52
58
53
- Install CUDA and cuDNN:
59
54
- For the ONNX Runtime GPU package, you need to install the appropriate version of CUDA and cuDNN. Refer to the `CUDA Execution Provider requirements <https://onnxruntime.ai/docs/install/#cuda-and-cudnn/>`_ for compatible versions of CUDA and cuDNN.
60
55
61
-
- Install ONNX Runtime GPU (CUDA 12.x):
56
+
If you need to use any other EP for calibration, you can uninstall the existing *onnxruntime-gpu* package and install the corresponding package. For example, to use the DirectML EP, you can uninstall the existing *onnxruntime-gpu* package and install the *onnxruntime-directml* package:
62
57
63
58
.. code-block:: bash
64
59
65
-
pip install onnxruntime-gpu
66
-
67
-
- The default CUDA version for *onnxruntime-gpu* since v1.19.0 is 12.x.
60
+
pip uninstall onnxruntime-gpu
61
+
pip install onnxruntime-directml
68
62
69
63
**5. Setup GPU Acceleration Tool for Quantization**
70
64
@@ -75,8 +69,9 @@ By default, ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ t
75
69
Ensure the following steps are verified:
76
70
- **Task Manager**: Check that the GPU appears in the Task Manager, indicating that the graphics driver is installed and functioning.
77
71
- **Python Interpreter**: Open the command line and type python. The Python interpreter should start, displaying the Python version.
78
-
- **Onnxruntime Package**: Ensure that one of the following is installed:
72
+
- **Onnxruntime Package**: Ensure that exactly one of the following is installed:
79
73
- *onnxruntime-directml* (DirectML EP)
74
+
- *onnxruntime-trt-rtx* (TensorRT-RTX EP)
80
75
- *onnxruntime-gpu* (CUDA EP)
81
76
- *onnxruntime* (CPU EP)
82
77
- **Onnx and Onnxruntime Import**: Ensure that following python command runs successfully.
0 commit comments