Skip to content

Commit 4a848c4

Browse files
Modelopt-windows documentation update (#812)
## What does this PR do? Documentation **Overview:** - Update support matrix, changelog, deployment page, example readmes as per recent feature and model support on Windows side. ## Testing - No testing, its just documentation change ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes/No <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes/No - **Did you add or update any necessary documentation?**: Yes/No - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes/No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> ## Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added ONNX Mixed Precision Weight-only quantization (INT4/INT8) support. * Introduced diffusion-model quantization on Windows. * Added new accuracy benchmarks (Perplexity and KL-Divergence). * Expanded deployment with multiple ONNX Runtime Execution Providers (CUDA, DirectML, TensorRT-RTX). * **Bug Fixes** * Fixed ONNX 1.19 compatibility issue with CuPy during INT4 AWQ quantization. * **Documentation** * Updated installation guides with system requirements and multiple backend options. * Reorganized deployment documentation with comprehensive execution provider guidance. * Expanded example workflows with improved setup instructions and support matrices. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: vipandya <vipandya@nvidia.com>
1 parent 2c73de0 commit 4a848c4

File tree

15 files changed

+157
-163
lines changed

15 files changed

+157
-163
lines changed

CHANGELOG-Windows.rst

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,19 @@
11
NVIDIA Model Optimizer Changelog (Windows)
22
==========================================
33

4+
0.41 (TBD)
5+
^^^^^^^^^^
6+
7+
**Bug Fixes**
8+
9+
- Fix ONNX 1.19 compatibility issues with CuPy during ONNX INT4 AWQ quantization. ONNX 1.19 uses ml_dtypes.int4 instead of numpy.int8 which caused CuPy failures.
10+
11+
**New Features**
12+
13+
- Add support for ONNX Mixed Precision Weight-only quantization using INT4 and INT8 precisions. Refer quantization `example for GenAI LLMs <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/onnx_ptq/genai_llm>`_.
14+
- Add support for some diffusion models' quantization on Windows. Refer `example script <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/torch_onnx/diffusers>`_ for details.
15+
- Add `Perplexity <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/perplexity_metrics>`_ and `KL-Divergence <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/kl_divergence_metrics>`_ accuracy benchmarks.
16+
417
0.33 (2025-07-21)
518
^^^^^^^^^^^^^^^^^
619

@@ -25,8 +38,8 @@ NVIDIA Model Optimizer Changelog (Windows)
2538

2639
- This is the first official release of Model Optimizer for Windows
2740
- **ONNX INT4 Quantization:** :meth:`modelopt.onnx.quantization.quantize_int4 <modelopt.onnx.quantization.int4.quantize>` now supports ONNX INT4 quantization for DirectML and TensorRT* deployment. See :ref:`Support_Matrix` for details about supported features and models.
28-
- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_
29-
- **DirectML Deployment Guide:** Added DML deployment guide. Refer :ref:`DirectML_Deployment`.
41+
- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `Olive example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_.
42+
- **DirectML Deployment Guide:** Added DML deployment guide. Refer :ref:`Onnxruntime_Deployment` deployment guide for details.
3043
- **MMLU Benchmark for Accuracy Evaluations:** Introduced `MMLU benchmarking <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/README.md>`_ for accuracy evaluation of ONNX models on DirectML (DML).
3144
- **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_.
3245

Lines changed: 18 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,19 @@
1-
.. _DirectML_Deployment:
1+
.. _Onnxruntime_Deployment:
22

3-
===================
4-
DirectML
5-
===================
3+
===========
4+
Onnxruntime
5+
===========
66

7+
Once an ONNX FP16 model is quantized using Model Optimizer on Windows, the resulting quantized ONNX model can be deployed via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
78

8-
Once an ONNX FP16 model is quantized using Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML (DML) backend via the `ONNX Runtime GenAI <https://onnxruntime.ai/docs/genai/>`_ or `ONNX Runtime <https://onnxruntime.ai/>`_.
9+
ONNX Runtime uses execution providers (EPs) to run models efficiently across a range of backends, including:
10+
11+
- **CUDA EP:** Utilizes NVIDIA GPUs for fast inference with CUDA and cuDNN libraries.
12+
- **DirectML EP:** Enables deployment on a wide range of GPUs.
13+
- **TensorRT-RTX EP:** Targets NVIDIA RTX GPUs, leveraging TensorRT for further optimized inference.
14+
- **CPU EP:** Provides a fallback to run inference on the system's CPU when specialized hardware is unavailable.
15+
16+
Choose the EP that best matches your model, hardware and deployment requirements.
917

1018
.. note:: Currently, DirectML backend doesn't support 8-bit precision. So, 8-bit quantized models should be deployed on other backends like ORT-CUDA etc. However, DML path does support INT4 quantized models.
1119

@@ -21,6 +29,10 @@ ONNX Runtime GenAI offers a streamlined solution for deploying generative AI mod
2129
- **Control Options**: Use the high-level ``generate()`` method for rapid deployment or execute each iteration of the model in a loop for fine-grained control.
2230
- **Multi-Language API Support**: Provides APIs for Python, C#, and C/C++, allowing seamless integration across a range of applications.
2331

32+
.. note::
33+
34+
ONNX Runtime GenAI models are typically tied to the execution provider (EP) they were built with; a model exported for one EP (e.g., CUDA or DirectML) is generally not compatible with other EPs. To run inference on a different backend, re-export or convert the model specifically for that target EP.
35+
2436
**Getting Started**:
2537

2638
Refer to the `ONNX Runtime GenAI documentation <https://onnxruntime.ai/docs/genai/>`_ for an in-depth guide on installation, setup, and usage.
@@ -42,4 +54,4 @@ For further details and examples, please refer to the `ONNX Runtime documentatio
4254
Collection of optimized ONNX models
4355
===================================
4456

45-
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
57+
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_. Follow the instructions provided along with the published models for deployment.

docs/source/getting_started/1_overview.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ is a library comprising state-of-the-art model optimization techniques including
1111
It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
1212
techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
1313

14-
For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows/README.md>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
14+
For Windows users, the `Model Optimizer for Windows <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/windows>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ and `TensorRT-RTX <https://github.com/NVIDIA/TensorRT-RTX>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
1515

1616
Model Optimizer for both Linux and Windows are available for free for all developers on `NVIDIA PyPI <https://pypi.org/project/nvidia-modelopt/>`_. Visit the `Model Optimizer GitHub repository <https://github.com/NVIDIA/Model-Optimizer>`_ for end-to-end
1717
example scripts and recipes optimized for NVIDIA GPUs.

docs/source/getting_started/windows/_installation_for_Windows.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The following system requirements are necessary to install and use Model Optimiz
2525
+-------------------------+-----------------------------+
2626

2727
.. note::
28-
- Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8.
28+
- Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8+.
2929
- We currently support *Single-GPU* configuration.
3030

3131
The Model Optimizer - Windows can be used in following ways:

docs/source/getting_started/windows/_installation_standalone.rst

Lines changed: 13 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Before using ModelOpt-Windows, the following components must be installed:
1313
- NVIDIA GPU and Graphics Driver
1414
- Python version >= 3.10 and < 3.13
1515
- Visual Studio 2022 / MSVC / C/C++ Build Tools
16+
- CUDA Toolkit, CuDNN for using CUDA path during calibration (e.g. for calibration of ONNX models using `onnxruntime-gpu` or CUDA EP)
1617

1718
Update ``PATH`` environment variable as needed for above prerequisites.
1819

@@ -26,45 +27,38 @@ It is recommended to use a virtual environment for managing Python dependencies.
2627
$ python -m venv .\myEnv
2728
$ .\myEnv\Scripts\activate
2829
29-
In the newly created virtual environment, none of the required packages (e.g., onnx, onnxruntime, onnxruntime-directml, onnxruntime-gpu, nvidia-modelopt) will be pre-installed.
30+
In the newly created virtual environment, none of the required packages (e.g., onnx, onnxruntime, onnxruntime-directml, onnxruntime-gpu, nvidia-modelopt etc.) will be pre-installed.
3031

3132
**3. Install ModelOpt-Windows Wheel**
3233

33-
To install the ModelOpt-Windows wheel, run the following command:
34+
To install the ONNX module of ModelOpt-Windows, run the following command:
3435

3536
.. code-block:: bash
3637
3738
pip install "nvidia-modelopt[onnx]"
3839
39-
This command installs ModelOpt-Windows and its ONNX module, along with the *onnxruntime-directml* (v1.20.0) package. If ModelOpt-Windows is installed without the additional parameter, only the bare minimum dependencies will be installed, without the relevant module and dependencies.
40+
If you install ModelOpt-Windows without the extra ``[onnx]`` option, only the minimal core dependencies and the PyTorch module (``torch``) will be installed. Support for ONNX model quantization requires installing with ``[onnx]``.
4041

41-
**4. Setup ONNX Runtime (ORT) for Calibration**
42+
**4. ONNX Model Quantization: Setup ONNX Runtime Execution Provider for Calibration**
4243

43-
The ONNX Post-Training Quantization (PTQ) process involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
44+
The Post-Training Quantization (PTQ) process for ONNX models usually involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP:
4445

4546
- *onnxruntime-directml* provides the DirectML EP.
47+
- *onnxruntime-trt-rtx* provides TensorRT-RTX EP.
4648
- *onnxruntime-gpu* provides the CUDA EP.
4749
- *onnxruntime* provides the CPU EP.
4850

49-
By default, ModelOpt-Windows installs *onnxruntime-directml* and uses the DirectML EP (v1.20.0) for calibration. No additional dependencies are required.
50-
If you prefer to use the CUDA EP for calibration, uninstall the existing *onnxruntime-directml* package and install the *onnxruntime-gpu* package, which requires CUDA and cuDNN dependencies:
51-
52-
- Uninstall *onnxruntime-directml*:
53-
54-
.. code-block:: bash
55-
56-
pip uninstall onnxruntime-directml
51+
By default, ModelOpt-Windows installs *onnxruntime-gpu*. The default CUDA version needed for *onnxruntime-gpu* since v1.19.0 is 12.x. The *onnxruntime-gpu* package (i.e. CUDA EP) has CUDA and cuDNN dependencies:
5752

5853
- Install CUDA and cuDNN:
5954
- For the ONNX Runtime GPU package, you need to install the appropriate version of CUDA and cuDNN. Refer to the `CUDA Execution Provider requirements <https://onnxruntime.ai/docs/install/#cuda-and-cudnn/>`_ for compatible versions of CUDA and cuDNN.
6055

61-
- Install ONNX Runtime GPU (CUDA 12.x):
56+
If you need to use any other EP for calibration, you can uninstall the existing *onnxruntime-gpu* package and install the corresponding package. For example, to use the DirectML EP, you can uninstall the existing *onnxruntime-gpu* package and install the *onnxruntime-directml* package:
6257

6358
.. code-block:: bash
6459
65-
pip install onnxruntime-gpu
66-
67-
- The default CUDA version for *onnxruntime-gpu* since v1.19.0 is 12.x.
60+
pip uninstall onnxruntime-gpu
61+
pip install onnxruntime-directml
6862
6963
**5. Setup GPU Acceleration Tool for Quantization**
7064

@@ -75,8 +69,9 @@ By default, ModelOpt-Windows utilizes the `cupy-cuda12x <https://cupy.dev//>`_ t
7569
Ensure the following steps are verified:
7670
- **Task Manager**: Check that the GPU appears in the Task Manager, indicating that the graphics driver is installed and functioning.
7771
- **Python Interpreter**: Open the command line and type python. The Python interpreter should start, displaying the Python version.
78-
- **Onnxruntime Package**: Ensure that one of the following is installed:
72+
- **Onnxruntime Package**: Ensure that exactly one of the following is installed:
7973
- *onnxruntime-directml* (DirectML EP)
74+
- *onnxruntime-trt-rtx* (TensorRT-RTX EP)
8075
- *onnxruntime-gpu* (CUDA EP)
8176
- *onnxruntime* (CPU EP)
8277
- **Onnx and Onnxruntime Import**: Ensure that following python command runs successfully.

docs/source/getting_started/windows/_installation_with_olive.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
Install ModelOpt-Windows with Olive
55
===================================
66

7-
ModelOpt-Windows can be installed and used through Olive to quantize Large Language Models (LLMs) in ONNX format for deployment with DirectML. Follow the steps below to configure Olive for use with ModelOpt-Windows.
7+
ModelOpt-Windows can be installed and used through Olive to perform model optimization using quantization technique. Follow the steps below to configure Olive for use with ModelOpt-Windows.
88

99
Setup Steps for Olive with ModelOpt-Windows
1010
-------------------------------------------
@@ -17,7 +17,7 @@ Setup Steps for Olive with ModelOpt-Windows
1717
1818
pip install olive-ai[nvmo]
1919
20-
- **Install Prerequisites:** Ensure all required dependencies are installed. Use the following commands to install the necessary packages:
20+
- **Install Prerequisites:** Ensure all required dependencies are installed. For example, to use DirectML Execution-Provider (EP) based onnxruntime and onnxruntime-genai packages, run the following commands:
2121

2222
.. code-block:: shell
2323
@@ -31,11 +31,11 @@ Setup Steps for Olive with ModelOpt-Windows
3131
**2. Configure Olive for Model Optimizer – Windows**
3232

3333
- **New Olive Pass:** Olive introduces a new pass, ``NVModelOptQuantization`` (or “nvmo”), specifically designed for model quantization using Model Optimizer – Windows.
34-
- **Add to Configuration:** To apply quantization to your target model, include this pass in the Olive configuration file. [Refer `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_ Olive example].
34+
- **Add to Configuration:** To apply quantization to your target model, include this pass in the Olive configuration file. [Refer `this <https://github.com/microsoft/Olive/blob/main/docs/source/features/quantization.md#nvidia-tensorrt-model-optimizer-windows>`_ guide for details about this pass.].
3535

3636
**3. Setup Other Passes in Olive Configuration**
3737

38-
- **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model. [Refer `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_ Olive example]
38+
- **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model.
3939

4040
**4. Install other dependencies**
4141

@@ -62,4 +62,4 @@ Setup Steps for Olive with ModelOpt-Windows
6262
**Note**:
6363

6464
#. Currently, the Model Optimizer - Windows only supports Onnx Runtime GenAI based LLM models in the Olive workflow.
65-
#. To try out different LLMs and EPs in the Olive workflow of ModelOpt-Windows, refer the details provided in `phi3 <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-Model-Optimizer>`_ Olive example.
65+
#. To get started with Olive, refer to the official `Olive documentation <https://microsoft.github.io/Olive/>`_.

0 commit comments

Comments
 (0)