diff --git a/CHANGELOG-Windows.rst b/CHANGELOG-Windows.rst index cea2aac1d4..279e5e6781 100644 --- a/CHANGELOG-Windows.rst +++ b/CHANGELOG-Windows.rst @@ -1,6 +1,19 @@ NVIDIA Model Optimizer Changelog (Windows) ========================================== +0.41 (TBD) +^^^^^^^^^^ + +**Bug Fixes** + +- Fix ONNX 1.19 compatibility issues with CuPy during ONNX INT4 AWQ quantization. ONNX 1.19 uses ml_dtypes.int4 instead of numpy.int8 which caused CuPy failures. + +**New Features** + +- Add support for ONNX Mixed Precision Weight-only quantization using INT4 and INT8 precisions. Refer quantization `example for GenAI LLMs `_. +- Add support for some diffusion models' quantization on Windows. Refer `example script `_ for details. +- Add `Perplexity `_ and `KL-Divergence `_ accuracy benchmarks. + 0.33 (2025-07-21) ^^^^^^^^^^^^^^^^^ @@ -25,8 +38,8 @@ NVIDIA Model Optimizer Changelog (Windows) - This is the first official release of Model Optimizer for Windows - **ONNX INT4 Quantization:** :meth:`modelopt.onnx.quantization.quantize_int4 ` now supports ONNX INT4 quantization for DirectML and TensorRT* deployment. See :ref:`Support_Matrix` for details about supported features and models. -- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `example `_ -- **DirectML Deployment Guide:** Added DML deployment guide. Refer :ref:`DirectML_Deployment`. +- **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `Olive example `_. +- **DirectML Deployment Guide:** Added DML deployment guide. Refer :ref:`Onnxruntime_Deployment` deployment guide for details. - **MMLU Benchmark for Accuracy Evaluations:** Introduced `MMLU benchmarking `_ for accuracy evaluation of ONNX models on DirectML (DML). - **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections `_. diff --git a/docs/source/deployment/2_directml.rst b/docs/source/deployment/2_onnxruntime.rst similarity index 66% rename from docs/source/deployment/2_directml.rst rename to docs/source/deployment/2_onnxruntime.rst index 90a4a31a9d..266190c084 100644 --- a/docs/source/deployment/2_directml.rst +++ b/docs/source/deployment/2_onnxruntime.rst @@ -1,11 +1,19 @@ -.. _DirectML_Deployment: +.. _Onnxruntime_Deployment: -=================== -DirectML -=================== +=========== +Onnxruntime +=========== +Once an ONNX FP16 model is quantized using Model Optimizer on Windows, the resulting quantized ONNX model can be deployed via the `ONNX Runtime GenAI `_ or `ONNX Runtime `_. -Once an ONNX FP16 model is quantized using Model Optimizer on Windows, the resulting quantized ONNX model can be deployed on the DirectML (DML) backend via the `ONNX Runtime GenAI `_ or `ONNX Runtime `_. +ONNX Runtime uses execution providers (EPs) to run models efficiently across a range of backends, including: + +- **CUDA EP:** Utilizes NVIDIA GPUs for fast inference with CUDA and cuDNN libraries. +- **DirectML EP:** Enables deployment on a wide range of GPUs. +- **TensorRT-RTX EP:** Targets NVIDIA RTX GPUs, leveraging TensorRT for further optimized inference. +- **CPU EP:** Provides a fallback to run inference on the system's CPU when specialized hardware is unavailable. + +Choose the EP that best matches your model, hardware and deployment requirements. .. note:: Currently, DirectML backend doesn't support 8-bit precision. So, 8-bit quantized models should be deployed on other backends like ORT-CUDA etc. However, DML path does support INT4 quantized models. @@ -21,6 +29,10 @@ ONNX Runtime GenAI offers a streamlined solution for deploying generative AI mod - **Control Options**: Use the high-level ``generate()`` method for rapid deployment or execute each iteration of the model in a loop for fine-grained control. - **Multi-Language API Support**: Provides APIs for Python, C#, and C/C++, allowing seamless integration across a range of applications. +.. note:: + + ONNX Runtime GenAI models are typically tied to the execution provider (EP) they were built with; a model exported for one EP (e.g., CUDA or DirectML) is generally not compatible with other EPs. To run inference on a different backend, re-export or convert the model specifically for that target EP. + **Getting Started**: Refer to the `ONNX Runtime GenAI documentation `_ for an in-depth guide on installation, setup, and usage. @@ -42,4 +54,4 @@ For further details and examples, please refer to the `ONNX Runtime documentatio Collection of optimized ONNX models =================================== -The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections `_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment. +The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections `_. Follow the instructions provided along with the published models for deployment. diff --git a/docs/source/getting_started/1_overview.rst b/docs/source/getting_started/1_overview.rst index 698819ba5e..795e8281ff 100644 --- a/docs/source/getting_started/1_overview.rst +++ b/docs/source/getting_started/1_overview.rst @@ -11,7 +11,7 @@ is a library comprising state-of-the-art model optimization techniques including It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM `_ or `TensorRT `_ (Linux). ModelOpt is integrated with `NVIDIA NeMo `_ and `Megatron-LM `_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM `_. -For Windows users, the `Model Optimizer for Windows `_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML `_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive `_ and `ONNX Runtime `_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path. +For Windows users, the `Model Optimizer for Windows `_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML `_ and `TensorRT-RTX `_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive `_ and `ONNX Runtime `_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path. Model Optimizer for both Linux and Windows are available for free for all developers on `NVIDIA PyPI `_. Visit the `Model Optimizer GitHub repository `_ for end-to-end example scripts and recipes optimized for NVIDIA GPUs. diff --git a/docs/source/getting_started/windows/_installation_for_Windows.rst b/docs/source/getting_started/windows/_installation_for_Windows.rst index a386fd30f7..f68ee90b5d 100644 --- a/docs/source/getting_started/windows/_installation_for_Windows.rst +++ b/docs/source/getting_started/windows/_installation_for_Windows.rst @@ -25,7 +25,7 @@ The following system requirements are necessary to install and use Model Optimiz +-------------------------+-----------------------------+ .. note:: - - Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8. + - Make sure to use GPU-compatible driver and other dependencies (e.g. torch etc.). For instance, support for Blackwell GPU might be present in Nvidia 570+ driver, and CUDA-12.8+. - We currently support *Single-GPU* configuration. The Model Optimizer - Windows can be used in following ways: diff --git a/docs/source/getting_started/windows/_installation_standalone.rst b/docs/source/getting_started/windows/_installation_standalone.rst index 47f36050c2..500b480e12 100644 --- a/docs/source/getting_started/windows/_installation_standalone.rst +++ b/docs/source/getting_started/windows/_installation_standalone.rst @@ -13,6 +13,7 @@ Before using ModelOpt-Windows, the following components must be installed: - NVIDIA GPU and Graphics Driver - Python version >= 3.10 and < 3.13 - Visual Studio 2022 / MSVC / C/C++ Build Tools + - CUDA Toolkit, CuDNN for using CUDA path during calibration (e.g. for calibration of ONNX models using `onnxruntime-gpu` or CUDA EP) Update ``PATH`` environment variable as needed for above prerequisites. @@ -26,45 +27,38 @@ It is recommended to use a virtual environment for managing Python dependencies. $ python -m venv .\myEnv $ .\myEnv\Scripts\activate -In the newly created virtual environment, none of the required packages (e.g., onnx, onnxruntime, onnxruntime-directml, onnxruntime-gpu, nvidia-modelopt) will be pre-installed. +In the newly created virtual environment, none of the required packages (e.g., onnx, onnxruntime, onnxruntime-directml, onnxruntime-gpu, nvidia-modelopt etc.) will be pre-installed. **3. Install ModelOpt-Windows Wheel** -To install the ModelOpt-Windows wheel, run the following command: +To install the ONNX module of ModelOpt-Windows, run the following command: .. code-block:: bash pip install "nvidia-modelopt[onnx]" -This command installs ModelOpt-Windows and its ONNX module, along with the *onnxruntime-directml* (v1.20.0) package. If ModelOpt-Windows is installed without the additional parameter, only the bare minimum dependencies will be installed, without the relevant module and dependencies. +If you install ModelOpt-Windows without the extra ``[onnx]`` option, only the minimal core dependencies and the PyTorch module (``torch``) will be installed. Support for ONNX model quantization requires installing with ``[onnx]``. -**4. Setup ONNX Runtime (ORT) for Calibration** +**4. ONNX Model Quantization: Setup ONNX Runtime Execution Provider for Calibration** -The ONNX Post-Training Quantization (PTQ) process involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP: +The Post-Training Quantization (PTQ) process for ONNX models usually involves running the base model with user-supplied inputs, a process called calibration. The user-supplied model inputs are referred to as calibration data. To perform calibration, the base model must be run using a suitable ONNX Execution Provider (EP), such as *DmlExecutionProvider* (DirectML EP) or *CUDAExecutionProvider* (CUDA EP). There are different ONNX Runtime packages for each EP: - *onnxruntime-directml* provides the DirectML EP. +- *onnxruntime-trt-rtx* provides TensorRT-RTX EP. - *onnxruntime-gpu* provides the CUDA EP. - *onnxruntime* provides the CPU EP. -By default, ModelOpt-Windows installs *onnxruntime-directml* and uses the DirectML EP (v1.20.0) for calibration. No additional dependencies are required. -If you prefer to use the CUDA EP for calibration, uninstall the existing *onnxruntime-directml* package and install the *onnxruntime-gpu* package, which requires CUDA and cuDNN dependencies: - -- Uninstall *onnxruntime-directml*: - - .. code-block:: bash - - pip uninstall onnxruntime-directml +By default, ModelOpt-Windows installs *onnxruntime-gpu*. The default CUDA version needed for *onnxruntime-gpu* since v1.19.0 is 12.x. The *onnxruntime-gpu* package (i.e. CUDA EP) has CUDA and cuDNN dependencies: - Install CUDA and cuDNN: - For the ONNX Runtime GPU package, you need to install the appropriate version of CUDA and cuDNN. Refer to the `CUDA Execution Provider requirements `_ for compatible versions of CUDA and cuDNN. -- Install ONNX Runtime GPU (CUDA 12.x): +If you need to use any other EP for calibration, you can uninstall the existing *onnxruntime-gpu* package and install the corresponding package. For example, to use the DirectML EP, you can uninstall the existing *onnxruntime-gpu* package and install the *onnxruntime-directml* package: .. code-block:: bash - pip install onnxruntime-gpu - - - The default CUDA version for *onnxruntime-gpu* since v1.19.0 is 12.x. + pip uninstall onnxruntime-gpu + pip install onnxruntime-directml **5. Setup GPU Acceleration Tool for Quantization** @@ -75,8 +69,9 @@ By default, ModelOpt-Windows utilizes the `cupy-cuda12x `_ t Ensure the following steps are verified: - **Task Manager**: Check that the GPU appears in the Task Manager, indicating that the graphics driver is installed and functioning. - **Python Interpreter**: Open the command line and type python. The Python interpreter should start, displaying the Python version. - - **Onnxruntime Package**: Ensure that one of the following is installed: + - **Onnxruntime Package**: Ensure that exactly one of the following is installed: - *onnxruntime-directml* (DirectML EP) + - *onnxruntime-trt-rtx* (TensorRT-RTX EP) - *onnxruntime-gpu* (CUDA EP) - *onnxruntime* (CPU EP) - **Onnx and Onnxruntime Import**: Ensure that following python command runs successfully. diff --git a/docs/source/getting_started/windows/_installation_with_olive.rst b/docs/source/getting_started/windows/_installation_with_olive.rst index a05155278f..544ecd2df8 100644 --- a/docs/source/getting_started/windows/_installation_with_olive.rst +++ b/docs/source/getting_started/windows/_installation_with_olive.rst @@ -4,7 +4,7 @@ Install ModelOpt-Windows with Olive =================================== -ModelOpt-Windows can be installed and used through Olive to quantize Large Language Models (LLMs) in ONNX format for deployment with DirectML. Follow the steps below to configure Olive for use with ModelOpt-Windows. +ModelOpt-Windows can be installed and used through Olive to perform model optimization using quantization technique. Follow the steps below to configure Olive for use with ModelOpt-Windows. Setup Steps for Olive with ModelOpt-Windows ------------------------------------------- @@ -17,7 +17,7 @@ Setup Steps for Olive with ModelOpt-Windows pip install olive-ai[nvmo] - - **Install Prerequisites:** Ensure all required dependencies are installed. Use the following commands to install the necessary packages: + - **Install Prerequisites:** Ensure all required dependencies are installed. For example, to use DirectML Execution-Provider (EP) based onnxruntime and onnxruntime-genai packages, run the following commands: .. code-block:: shell @@ -31,11 +31,11 @@ Setup Steps for Olive with ModelOpt-Windows **2. Configure Olive for Model Optimizer – Windows** - **New Olive Pass:** Olive introduces a new pass, ``NVModelOptQuantization`` (or “nvmo”), specifically designed for model quantization using Model Optimizer – Windows. - - **Add to Configuration:** To apply quantization to your target model, include this pass in the Olive configuration file. [Refer `phi3 `_ Olive example]. + - **Add to Configuration:** To apply quantization to your target model, include this pass in the Olive configuration file. [Refer `this `_ guide for details about this pass.]. **3. Setup Other Passes in Olive Configuration** - - **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model. [Refer `phi3 `_ Olive example] + - **Add Other Passes:** Add additional passes to the Olive configuration file as needed for the desired Olive workflow of your input model. **4. Install other dependencies** @@ -62,4 +62,4 @@ Setup Steps for Olive with ModelOpt-Windows **Note**: #. Currently, the Model Optimizer - Windows only supports Onnx Runtime GenAI based LLM models in the Olive workflow. -#. To try out different LLMs and EPs in the Olive workflow of ModelOpt-Windows, refer the details provided in `phi3 `_ Olive example. +#. To get started with Olive, refer to the official `Olive documentation `_. diff --git a/docs/source/guides/0_support_matrix.rst b/docs/source/guides/0_support_matrix.rst index 69e860e4b7..0e5ddeea5b 100644 --- a/docs/source/guides/0_support_matrix.rst +++ b/docs/source/guides/0_support_matrix.rst @@ -63,7 +63,7 @@ Feature Support Matrix * Uses AWQ Algorithm * GPUs: Ampere and Later - PyTorch*, ONNX - - ORT-DirectML, TensorRT*, TensorRT-LLM* + - ORT-DML, ORT-CUDA, ORT-TRT-RTX, TensorRT*, TensorRT-LLM* * - W4A8 (INT4 Weights, FP8 Activations) - * Block-wise INT4 Weights, Per-Tensor FP8 Activations * Uses AWQ Algorithm @@ -84,7 +84,9 @@ Feature Support Matrix - PyTorch*, ONNX - TensorRT*, TensorRT-LLM*, ORT-CUDA -.. note:: Features marked with an asterisk (*) are considered experimental. +.. note:: + - Features marked with an asterisk (*) are considered experimental. + - ``ORT-CUDA``, ``ORT-DML``, and ``ORT-TRT-RTX`` are ONNX Runtime Execution Providers (EPs) for CUDA, DirectML, and TensorRT-RTX respectively. Support for different deployment backends can vary across models. Model Support Matrix @@ -96,87 +98,4 @@ Model Support Matrix .. tab:: Windows - .. list-table:: - :header-rows: 1 - - * - Model - - ONNX INT4 AWQ (W4A16) - - ONNX INT8 Max (W8A8) - - ONNX FP8 Max (W8A8) - * - Llama3.1-8B-Instruct - - Yes - - No - - No - * - Phi3.5-mini-Instruct - - Yes - - No - - No - * - Mistral-7B-Instruct-v0.3 - - Yes - - No - - No - * - Llama3.2-3B-Instruct - - Yes - - No - - No - * - Gemma-2b-it - - Yes - - No - - No - * - Gemma-2-2b - - Yes - - No - - No - * - Gemma-2-9b - - Yes - - No - - No - * - Nemotron Mini 4B Instruct - - Yes - - No - - No - * - Qwen2.5-7B-Instruct - - Yes - - No - - No - * - DeepSeek-R1-Distill-Llama-8B - - Yes - - No - - No - * - DeepSeek-R1-Distil-Qwen-1.5B - - Yes - - No - - No - * - DeepSeek-R1-Distil-Qwen-7B - - Yes - - No - - No - * - DeepSeek-R1-Distill-Qwen-14B - - Yes - - No - - No - * - Mistral-NeMo-Minitron-2B-128k-Instruct - - Yes - - No - - No - * - Mistral-NeMo-Minitron-4B-128k-Instruct - - Yes - - No - - No - * - Mistral-NeMo-Minitron-8B-128k-Instruct - - Yes - - No - - No - * - whisper-large - - No - - Yes - - Yes - * - sam2-hiera-large - - No - - Yes - - Yes - - .. note:: - - ``ONNX INT8 Max`` means INT8 (W8A8) quantization of ONNX model using Max calibration. Similar holds true for the term ``ONNX FP8 Max``. - - The LLMs in above table are `GenAI `_ built LLMs unless specified otherwise. - - Check `examples `_ for specific instructions and scripts. + Please checkout the model support matrix `details `_. diff --git a/docs/source/guides/windows_guides/_ONNX_PTQ_guide.rst b/docs/source/guides/windows_guides/_ONNX_PTQ_guide.rst index 9e60611c07..b79415178c 100644 --- a/docs/source/guides/windows_guides/_ONNX_PTQ_guide.rst +++ b/docs/source/guides/windows_guides/_ONNX_PTQ_guide.rst @@ -155,4 +155,4 @@ To save a quantized ONNX model with external data, use the following code: Deploy Quantized ONNX Model --------------------------- -Inference of the quantized models can be done using tools like `GenAI `_, `OnnxRunTime (ORT) `_. These APIs can do inference on backends like DML. For details about DirectML deployment of quantized models, see :ref:`DirectML_Deployment`. Also, refer `example scripts `_ for any possible model-specific inference guidance or script (if any). +Inference of the quantized models can be done using tools like `GenAI `_, `OnnxRunTime (ORT) `_. These APIs can do inference on backends like DML, CUDA, TensorRT-RTX. For details about onnxruntime deployment of quantized models, see :ref:`Onnxruntime_Deployment` deployment guide. Also, refer `example scripts `_ for any possible model-specific inference guidance or script (if any). diff --git a/docs/source/support/2_faqs.rst b/docs/source/support/2_faqs.rst index 02970a1f91..3d0afaa3c8 100644 --- a/docs/source/support/2_faqs.rst +++ b/docs/source/support/2_faqs.rst @@ -15,7 +15,7 @@ ModelOpt-Windows Awq-scale search should complete in minutes with NVIDIA GPU acceleration. If stalled: -- **GPU acceleration may be disabled.** If CUDA 12.x is not available, quantization will fall back to slower ``numpy`` implementation instead of ``cupy-cuda12x``. +- **GPU acceleration may be disabled.** If CUDA 12.x is not available, quantization will fall back to slower ``numpy`` implementation instead of ``cupy-cuda12x``. Make sure that ``cupy`` package is compatible with installed CUDA toolkit. - **Low GPU memory.** Quantization needs 20-24GB VRAM; low memory forces slower shared memory usage. - **Using CPU for quantization.** Install ORT-GPU (supports CUDA EP) or ORT-DML (supports DML EP) for better speed. @@ -45,21 +45,21 @@ Make sure that the output directory is clean before each quantization run otherw `Error Unrecognized attribute: block_size for operator DequantizeLinear` -ModelOpt-Windows uses ONNX's `DequantizeLinear `_ (DQ) nodes. The int4 data-type support in DeQuantizeLinear node came in opset-21. And, *block_size* attribute was added in DeQuantizeLinear node in Opset-21. Make sure that quantized model's opset version is 21 or higher. Refer :ref:`Apply_ONNX_PTQ` for details. +ModelOpt-Windows uses ONNX's `DequantizeLinear `_ (DQ) nodes. The int4 data-type support in DeQuantizeLinear node came in opset-21. And, *block_size* attribute was added in DequantizeLinear node in Opset-21. Make sure that quantized model's opset version is 21 or higher. Refer :ref:`Apply_ONNX_PTQ` for details. 6. Running INT4 quantized ONNX model on DirectML backend gives following kind of error. What can be the issue? -------------------------------------------------------------------------------------------------------------- `Error: Type 'tensor(int4)' of input parameter (onnx::MatMul_6508_i4) of operator (DequantizeLinear) in node (onnx::MatMul_6508_DequantizeLinear) is invalid.` -One possible reason for above error is that INT4 quantized ONNX model's opset version (default or onnx domain) is less than 21. Ensure the INT4 quantized model's opset version is 21 or higher since INT4 data-type support in DeQuantizeLinear ONNX node came in opset-21. +One possible reason for above error is that INT4 quantized ONNX model's opset version (default or onnx domain) is less than 21. Ensure the INT4 quantized model's opset version is 21 or higher since INT4 data-type support in DequantizeLinear ONNX node came in opset-21. 7. Running 8-bit quantized ONNX model with ORT-DML gives onnxruntime error about using 8-bit data-type (e.g. INT8/FP8). What can be the issue? ----------------------------------------------------------------------------------------------------------------------------------------------- Currently, DirectML backend (ORT-DML) doesn't support 8-bit precision. So, it expectedly complains about 8-bit data-type. Try using ORT-CUDA or other 8-bit compatible backend. -8. How to resolve onnxruntime error about invalid use of FP8 type in QuantizeLinear / DeQuantizeLinear node? +8. How to resolve onnxruntime error about invalid use of FP8 type in QuantizeLinear / DequantizeLinear node? ------------------------------------------------------------------------------------------------------------- The FP8 type support in QuantizeLinear / DeQuantizeLinear node came with Opset-19. So, ensure that opset of ONNX model is 19+. diff --git a/examples/windows/README.md b/examples/windows/README.md index 2c9b212389..30e6cad1d3 100644 --- a/examples/windows/README.md +++ b/examples/windows/README.md @@ -117,7 +117,7 @@ onnx.save_model( ) ``` -For detailed instructions about deployment of quantized models with DirectML backend (ORT-DML), see the [DirectML](https://nvidia.github.io/Model-Optimizer/deployment/2_directml.html#directml-deployment). +For detailed instructions about deployment of quantized models with ONNX Runtime, see the [ONNX Runtime Deployment Guide](https://nvidia.github.io/Model-Optimizer/deployment/2_onnxruntime.html). > [!Note] > The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace [NVIDIA collections](https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus). @@ -134,7 +134,12 @@ For detailed instructions about deployment of quantized models with DirectML bac ## Support Matrix -Please refer to [support matrix](https://nvidia.github.io/Model-Optimizer/guides/0_support_matrix.html) for a full list of supported features and models. +| Model Type | Support Matrix | +|------------|----------------| +| Large Language Models (LLMs) | [View Support Matrix](./onnx_ptq/genai_llm/README.md#support-matrix) | +| Automatic Speech Recognition | [View Support Matrix](./onnx_ptq/whisper/README.md#support-matrix) | +| Segmentation Models | [View Support Matrix](./onnx_ptq/sam2/README.md#support-matrix) | +| Diffusion Models | [View Support Matrix](./torch_onnx/diffusers/README.md#support-matrix) | ## Benchmark Results diff --git a/examples/windows/onnx_ptq/genai_llm/README.md b/examples/windows/onnx_ptq/genai_llm/README.md index 1a46a687e4..9127bc752c 100644 --- a/examples/windows/onnx_ptq/genai_llm/README.md +++ b/examples/windows/onnx_ptq/genai_llm/README.md @@ -1,12 +1,23 @@ +## Table of Contents + +- [Overview](#overview) +- [Setup](#setup) +- [Prepare ORT-GenAI Compatible Base Model](#prepare-ort-genai-compatible-base-model) +- [Quantization](#quantization) +- [Evaluate the Quantized Model](#evaluate-the-quantized-model) +- [Deployment](#deployment) +- [Support Matrix](#support-matrix) +- [Troubleshoot](#troubleshoot) + ## Overview -The example script showcases how to utilize the **ModelOpt-Windows** toolkit for optimizing ONNX (Open Neural Network Exchange) models through quantization. This toolkit is designed for developers looking to enhance model performance, reduce size, and accelerate inference times, while preserving the accuracy of neural networks deployed with backends like DirectML on local RTX GPUs running Windows. +The example script showcases how to utilize the **ModelOpt-Windows** toolkit for optimizing ONNX (Open Neural Network Exchange) models through quantization. This toolkit is designed for developers looking to enhance model performance, reduce size, and accelerate inference times, while preserving the accuracy of neural networks deployed with backends like TensorRT-RTX, DirectML, CUDA on local RTX GPUs running Windows. Quantization is a technique that converts models from floating-point to lower-precision formats, such as integers, which are more computationally efficient. This process can significantly speed up execution on supported hardware, while also reducing memory and bandwidth requirements. This example takes an ONNX model as input, along with the necessary quantization settings, and generates a quantized ONNX model as output. This script can be used for quantizing popular, [ONNX Runtime GenAI](https://onnxruntime.ai/docs/genai) built Large Language Models (LLMs) in the ONNX format. -### Setup +## Setup 1. Install ModelOpt-Windows. Refer [installation instructions](../../README.md). @@ -16,15 +27,15 @@ This example takes an ONNX model as input, along with the necessary quantization pip install -r requirements.txt ``` -### Prepare ORT-GenAI Compatible Base Model +## Prepare ORT-GenAI Compatible Base Model -You may generate the base model using the model builder that comes with onnxruntime-genai. The ORT-GenAI's [model-builder](https://github.com/microsoft/onnxruntime-genai/tree/main/src/python/py/models) downloads the original Pytorch model from Hugging Face, and produces an ONNX GenAI compatible base model in ONNX format. See example command-line below: +You may generate the base model using the model builder that comes with onnxruntime-genai. The ORT-GenAI's [model-builder](https://github.com/microsoft/onnxruntime-genai/tree/main/src/python/py/models) downloads the original Pytorch model from Hugging Face, and produces an ONNX GenAI-compatible base model in ONNX format. See example command-line below: ```bash python -m onnxruntime_genai.models.builder -m meta-llama/Meta-Llama-3-8B -p fp16 -e dml -o E:\llama3-8b-fp16-dml-genai ``` -### Quantization +## Quantization To begin quantization, run the script like below: @@ -35,13 +46,13 @@ python quantize.py --model_name=meta-llama/Meta-Llama-3-8B \ --calib_size=32 --algo=awq_lite --dataset=cnn ``` -#### Command Line Arguments +### Command Line Arguments The table below lists key command-line arguments of the ONNX PTQ example script. | **Argument** | **Supported Values** | **Description** | |---------------------------|------------------------------------------------------|-------------------------------------------------------------| -| `--calib_size` | 32 (default), 64, 128 | Specifies the calibration size. | +| `--calib_size` | 32 , 64, 128 (default) | Specifies the calibration size. | | `--dataset` | cnn (default), pilevel | Choose calibration dataset: cnn_dailymail or pile-val. | | `--algo` | awq_lite (default), awq_clip, rtn, rtn_dq | Select the quantization algorithm. | | `--onnx_path` | input .onnx file path | Path to the input ONNX model. | @@ -54,10 +65,10 @@ The table below lists key command-line arguments of the ONNX PTQ example script. | `--awqclip_alpha_step` | 0.05 (default) | Step-size for AWQ weight clipping, user-defined | | `--awqclip_alpha_min` | 0.5 (default) | Minimum AWQ weight-clipping threshold, user-defined | | `--awqclip_bsz_col` | 1024 (default) | Chunk size in columns during weight clipping, user-defined | -| `--calibration_eps` | dml, cuda, cpu, NvTensorRtRtx (default: [dml,cpu]) | List of execution-providers to use for session run during calibration | -| `--no_position_ids` | Default: position_ids input enabled | Use this option to disable position_ids input in calibration data| -| `--enable_mixed_quant` | Default: disabled mixed quant | Use this option to enable mixed precsion quantization| -| `--layers_8bit` | Default: None | Use this option to Overrides default mixed quant strategy| +| `--calibration_eps` | dml, cuda, cpu, NvTensorRtRtx (default: [cuda,cpu]) | List of execution-providers to use for session run during calibration | +| `--add_position_ids` | Default: position_ids input is disabled | Use this option to enable position_ids input in calibration data| +| `--enable_mixed_quant` | Default: mixed-quant is disabled | Use this option to enable mixed precision quantization| +| `--layers_8bit` | Default: None | Use this option to override default mixed-quant strategy| | `--gather_quantize_axis` | Default: None | Use this option to enable INT4 quantization of Gather nodes - choose 0 or 1| | `--gather_block_size` | Default: 32 | Block-size for Gather node's INT4 quantization (when its enabled using gather_quantize_axis option)| @@ -69,24 +80,24 @@ python quantize.py --help Note: -1. For the `algo` argument, we have following options to choose form: awq_lite, awq_clip, rtn, rtn_dq. +1. For the `algo` argument, we have following options to choose from: awq_lite, awq_clip, rtn, rtn_dq. - The 'awq_lite' option does core AWQ scale search and INT4 quantization. - The 'awq_clip' option primarily does weight clipping and INT4 quantization. - The 'rtn' option does INT4 RTN quantization with Q->DQ nodes for weights. - The 'rtn_dq' option does INT4 RTN quantization with only DQ nodes for weights. 1. RTN algorithm doesn't use calibration-data. -1. If needed for the input base model, use `--no_position_ids` command-line option to disable +1. If needed for the input base model, use `--add_position_ids` command-line option to enable generating position_ids calibration input. The GenAI built LLM models produced with DML EP has position_ids input but ones produced with CUDA EP, NvTensorRtRtx EP don't have position_ids input. Use `--help` or command-line options table above to inspect default values. Please refer to `quantize.py` for further details on command-line parameters. -#### Mixed Precision Quantization (INT4 + INT8) +### Mixed Precision Quantization (INT4 + INT8) ModelOpt-Windows supports **mixed precision quantization**, where different layers in the model can be quantized to different bit-widths. This approach combines INT4 quantization for most layers (for maximum compression and speed) with INT8 quantization for important or sensitive layers (to preserve accuracy). -##### Why Use Mixed Precision? +#### Why Use Mixed Precision? Mixed precision quantization provides an optimal balance between: @@ -107,7 +118,7 @@ Based on benchmark results, mixed precision quantization shows significant advan As shown above, mixed precision significantly improves accuracy with minimal disk size increase (~85-109 MB). -##### How Mixed Precision Works +#### How Mixed Precision Works The quantization strategy selects which layers to quantize to INT8 vs INT4: @@ -117,9 +128,9 @@ The quantization strategy selects which layers to quantize to INT8 vs INT4: This strategy preserves accuracy for the most sensitive layers while maintaining aggressive compression elsewhere. -##### Using Mixed Precision Quantization +#### Using Mixed Precision Quantization -###### Method 1: Use the default mixed precision strategy +##### Method 1: Use the default mixed precision strategy ```bash python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ @@ -132,7 +143,7 @@ python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ The `--enable_mixed_quant` flag automatically applies the default strategy. -###### Method 2: Specify custom layers for INT8 +##### Method 2: Specify custom layers for INT8 ```bash python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ @@ -158,24 +169,45 @@ The `--layers_8bit` option allows you to manually specify which layers to quanti For more benchmark results and detailed accuracy metrics, refer to the [Benchmark Guide](../../Benchmark.md). -### Evaluate the Quantized Model +## Evaluate the Quantized Model To evaluate the quantized model, please refer to the [accuracy benchmarking](../../accuracy_benchmark/README.md) and [onnxruntime-genai performance benchmarking](https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python). -### Deployment +## Deployment -Once an ONNX FP16 model is quantized using ModelOpt-Windows, the resulting quantized ONNX model can be deployed on the DirectML backend using [ORT-GenAI](https://onnxruntime.ai/) or [ORT](https://onnxruntime.ai/). +Once an ONNX FP16 model is quantized using ModelOpt-Windows, the resulting quantized ONNX model can be deployed using [ORT-GenAI](https://onnxruntime.ai/) or [ORT](https://onnxruntime.ai/). Refer to the following example scripts and tutorials for deployment: 1. [ORT GenAI examples](https://github.com/microsoft/onnxruntime-genai/tree/main/examples/python) 1. [ONNX Runtime documentation](https://onnxruntime.ai/docs/api/python/) -### Model Support Matrix - -Please refer to [support matrix](https://nvidia.github.io/Model-Optimizer/guides/0_support_matrix.html) for a full list of supported features and models. - -### Troubleshoot +## Support Matrix + +| Model | ONNX INT4 AWQ (W4A16) | +| :---: | :---: | +| Llama3.1-8B-Instruct | ✅ | +| Phi3.5-mini-Instruct | ✅ | +| Mistral-7B-Instruct-v0.3 | ✅ | +| Llama3.2-3B-Instruct | ✅ | +| Gemma-2b-it | ✅ | +| Gemma-2-2b | ✅ | +| Gemma-2-9b | ✅ | +| Nemotron Mini 4B Instruct | ✅ | +| Qwen2.5-7B-Instruct | ✅ | +| DeepSeek-R1-Distill-Llama-8B | ✅ | +| DeepSeek-R1-Distil-Qwen-1.5B | ✅ | +| DeepSeek-R1-Distil-Qwen-7B | ✅ | +| DeepSeek-R1-Distill-Qwen-14B | ✅ | +| Mistral-NeMo-Minitron-2B-128k-Instruct | ✅ | +| Mistral-NeMo-Minitron-4B-128k-Instruct | ✅ | +| Mistral-NeMo-Minitron-8B-128k-Instruct | ✅ | + +> *All LLMs in the above table are [GenAI](https://github.com/microsoft/onnxruntime-genai/) built LLMs.* + + > *`ONNX INT4 AWQ (W4A16)` means INT4 weights and FP16 activations using AWQ algorithm.* + +## Troubleshoot 1. **Configure Directories** @@ -208,4 +240,4 @@ Please refer to [support matrix](https://nvidia.github.io/Model-Optimizer/guides 1. **Error - Invalid Position-IDs input to the ONNX model** - The ONNX models produced using ONNX GenerativeAI (GenAI) have different IO bindings for models produced using different execution-providers (EPs). For instance, model built with DML EP has position-ids input in the ONNX model but models builts using CUDA EP or NvTensorRtRtx EP don't have position-ids inputs. So, if base model requires, use `no_position_ids` command-line argument for disabling position_ids calibration input or set "add_position_ids" variable to `False` value (hard-code) in the quantize script if required. + The ONNX models produced using ONNX GenerativeAI (GenAI) have different IO bindings for models produced using different execution-providers (EPs). For instance, model built with DML EP has position-ids input in the ONNX model but models builts using CUDA EP or NvTensorRtRtx EP don't have position-ids inputs. So, if base model requires, use `--add_position_ids` command-line argument for enabling position_ids calibration input or set "add_position_ids" variable to `True` value (hard-code) in the quantize script if required. diff --git a/examples/windows/onnx_ptq/genai_llm/quantize.py b/examples/windows/onnx_ptq/genai_llm/quantize.py index 57021ed4d6..e59b388fd0 100644 --- a/examples/windows/onnx_ptq/genai_llm/quantize.py +++ b/examples/windows/onnx_ptq/genai_llm/quantize.py @@ -230,7 +230,7 @@ def get_calib_inputs( ) if "cnn" in dataset_name: - dataset2 = load_dataset("cnn_dailymail", name="3.0.0", split="train").select( + dataset2 = load_dataset("abisee/cnn_dailymail", name="3.0.0", split="train").select( range(max_calib_rows_to_load) ) column = "article" @@ -589,10 +589,10 @@ def main(args): help="True when --use_gqa was passed during export.", ) parser.add_argument( - "--no_position_ids", + "--add_position_ids", dest="add_position_ids", - action="store_false", - default=True, + action="store_true", + default=False, help="True when we want to also pass position_ids input to model", ) parser.add_argument( @@ -605,7 +605,7 @@ def main(args): parser.add_argument( "--calibration_eps", type=parse_calibration_eps, # Use the custom parser - default=["dml", "cpu"], # Default as a list + default=["cuda", "cpu"], # Default as a list help="Comma-separated list of calibration endpoints. Choose from 'cuda', 'cpu', 'dml', 'NvTensorRtRtx'.", ) parser.add_argument( diff --git a/examples/windows/onnx_ptq/sam2/README.md b/examples/windows/onnx_ptq/sam2/README.md index 24eb9a634a..8d9e5fed3e 100644 --- a/examples/windows/onnx_ptq/sam2/README.md +++ b/examples/windows/onnx_ptq/sam2/README.md @@ -6,6 +6,7 @@ This repository contains an example to demontrate 8-bit quantization of SAM2 ONN - [ONNX export and Inference tool](#onnx-export-and-inference-tool) - [Quantization](#quantization) +- [Support Matrix](#support-matrix) - [Validated Settings](#validated-settings) - [Troubleshoot](#troubleshoot) @@ -59,6 +60,14 @@ python .\sam2_onnx_quantization.py --onnx_path=E:\base\sam2_hiera_large.encoder. ``` +## Support Matrix + +| Model | ONNX INT8 Max (W8A8) | ONNX FP8 Max (W8A8) | +| :---: | :---: | :---: | +| sam2-hiera-large | ✅ | ✅ | + +> *`ONNX INT8 Max` means INT8 (W8A8) quantization of ONNX model using Max calibration. Similar holds true for the term `ONNX FP8 Max`.* + ## Validated Settings This example is currently validated with following settings: diff --git a/examples/windows/onnx_ptq/whisper/README.md b/examples/windows/onnx_ptq/whisper/README.md index 7d82c0dc45..8757aaeb53 100644 --- a/examples/windows/onnx_ptq/whisper/README.md +++ b/examples/windows/onnx_ptq/whisper/README.md @@ -7,6 +7,7 @@ This repository contains an example to demontrate 8-bit quantization of Whisper - [ONNX export](#onnx-export) - [Inference script](#inference-script) - [Quantization script](#quantization-script) +- [Support Matrix](#support-matrix) - [Validated Settings](#validated-settings) - [Troubleshoot](#troubleshoot) @@ -149,6 +150,14 @@ python .\whisper_onnx_quantization.py --model_name=openai/whisper-large --base_m - In case, ONNX installation unexpectedly throws error, then one can try with other ONNX versions. +## Support Matrix + +| Model | ONNX INT8 Max (W8A8) | ONNX FP8 Max (W8A8) | +| :---: | :---: | :---: | +| whisper-large | ✅ | ✅ | + +> *`ONNX INT8 Max` means INT8 (W8A8) quantization of ONNX model using Max calibration. Similar holds true for the term `ONNX FP8 Max`.* + ## Validated Settings These scripts are currently validated with following settings: diff --git a/examples/windows/torch_onnx/diffusers/README.md b/examples/windows/torch_onnx/diffusers/README.md index 962bf4ec0f..856fcb63be 100644 --- a/examples/windows/torch_onnx/diffusers/README.md +++ b/examples/windows/torch_onnx/diffusers/README.md @@ -7,7 +7,7 @@ This repository provides relevant steps, script, and guidance for quantization o - [Installation and Pre-requisites](#installation-and-pre-requisites) - [Quantization of Backbone](#quantization-of-backbone) - [Inference using ONNX runtime](#inference-using-onnxruntime) -- [Quantization Support Matrix](#quantization-support-matrix) +- [Support Matrix](#support-matrix) - [Validated Settings](#validated-settings) - [Troubleshoot](#troubleshoot) @@ -92,10 +92,10 @@ Place the quantized ONNX model file for the backbone inside its specific subdire Optimum-ONNX Runtime provides pipelines such as ORTStableDiffusion3Pipeline and ORTFluxPipeline that can be used to run ONNX-exported diffusion models. These pipelines offer a convenient, high-level interface for loading the exported graph and performing inference. For a practical reference, see the stable diffusion inference [example script](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/python/models/stable_difusion) in the ONNX Runtime inference examples repository. -## Quantization Support Matrix +## Support Matrix | Model | fp8 | nvfp41 | -| :---: | :---: | +| :---: | :---: | :---: | | SD3-Medium-Diffusers | ❌ | ✅ | | SD3.5-Medium | ✅ | ✅ | | Flux.1.Dev2 | ✅ | ✅ | @@ -106,7 +106,7 @@ Optimum-ONNX Runtime provides pipelines such as ORTStableDiffusion3Pipeline and > *The accuracy loss after PTQ may vary depending on the actual model and the quantization method. Different models may have different accuracy loss and usually the accuracy loss is more significant when the base model is small. If the accuracy after PTQ is not meeting the requirement, please try disabling with KV-Cache or MHA quantization, try out different calibration settings (like calibration samples data, samples size, diffusion steps etc.) or perform QAT / QAD (not yet supported / validated on Windows RTX).* -Please refer to [support matrix](https://nvidia.github.io/Model-Optimizer/guides/0_support_matrix.html) for a full list of supported features and models. +> *There are some known performance issues with NVFP4 model execution using TRTRTX EP. Stay tuned for further updates!* ## Validated Settings