|
1 | 1 | # Torch Quantization to ONNX Export |
2 | 2 |
|
3 | | -This example demonstrates how to quantize PyTorch models (vision and LLM) followed by export to ONNX format. The scripts leverage the ModelOpt toolkit for both quantization and ONNX export. |
| 3 | +This example demonstrates how to quantize PyTorch models followed by export to ONNX format. The scripts leverage the ModelOpt toolkit for quantization and ONNX export. |
| 4 | + |
| 5 | +For **vision models**, the `torch_quant_to_onnx.py` script in this directory handles quantization and ONNX export directly. |
| 6 | + |
| 7 | +For **LLMs and VLMs**, use [TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM) which provides a complete pipeline for quantizing models with ModelOpt and exporting them to optimized ONNX for deployment on edge platforms (Jetson, DRIVE). |
4 | 8 |
|
5 | 9 | <div align="center"> |
6 | 10 |
|
7 | 11 | | **Section** | **Description** | **Link** | |
8 | 12 | | :------------: | :------------: | :------------: | |
9 | 13 | | Pre-Requisites | Required packages to use this example | [Link](#pre-requisites) | |
10 | 14 | | Vision Models | Quantize timm models and export to ONNX | [Link](#vision-models) | |
11 | | -| LLM Export | Export LLMs to quantized ONNX | [Link](#llm-export) | |
| 15 | +| LLM Quantization and Export | Quantize and export LLMs/VLMs via TensorRT-Edge-LLM | [Link](#llm-quantization-and-export-with-tensorrt-edge-llm) | |
| 16 | +| Supported Models | LLM and VLM models supported by TensorRT-Edge-LLM | [Link](#supported-models) | |
12 | 17 | | Mixed Precision | Auto mode for optimal per-layer quantization | [Link](#mixed-precision-quantization-auto-mode) | |
13 | | -| Support Matrix | View the ONNX export supported LLM models | [Link](#onnx-export-supported-llm-models) | |
14 | 18 | | Resources | Extra links to relevant resources | [Link](#resources) | |
15 | 19 |
|
16 | 20 | </div> |
@@ -78,67 +82,180 @@ python ../onnx_ptq/evaluate.py \ |
78 | 82 | --model_name=vit_base_patch16_224 |
79 | 83 | ``` |
80 | 84 |
|
81 | | -## LLM Export |
| 85 | +## LLM Quantization and Export with TensorRT-Edge-LLM |
82 | 86 |
|
83 | | -The `llm_export.py` script exports LLM models to ONNX with optional quantization. |
| 87 | +[TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM) provides a complete pipeline for quantizing LLMs and VLMs using NVIDIA ModelOpt and exporting them to optimized ONNX for deployment on edge platforms such as NVIDIA Jetson and DRIVE. |
84 | 88 |
|
85 | | -### What it does |
| 89 | +### Overview |
86 | 90 |
|
87 | | -- Loads a HuggingFace LLM model (local path or model name). |
88 | | -- Optionally quantizes the model to FP8, INT4_AWQ, or NVFP4. |
89 | | -- Exports the model to ONNX format. |
90 | | -- Post-processes the ONNX graph for TensorRT compatibility. |
| 91 | +The pipeline follows these stages: |
91 | 92 |
|
92 | | -### Usage |
| 93 | +1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ, NVFP4) |
| 94 | +2. **Export** (x86 host with GPU) — Convert quantized model to ONNX |
| 95 | +3. **Build** (edge device) — Compile ONNX into TensorRT engines |
| 96 | +4. **Inference** (edge device) — Run the compiled engines |
| 97 | + |
| 98 | +### Installation |
93 | 99 |
|
94 | 100 | ```bash |
95 | | -python llm_export.py \ |
96 | | - --hf_model_path=<HuggingFace model name or local path> \ |
97 | | - --dtype=<fp16|fp8|int4_awq|nvfp4> \ |
98 | | - --output_dir=<directory to save ONNX model> |
| 101 | +# Use the PyTorch Docker image (recommended) |
| 102 | +docker pull nvcr.io/nvidia/pytorch:25.12-py3 |
| 103 | +docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace nvcr.io/nvidia/pytorch:25.12-py3 bash |
| 104 | + |
| 105 | +# Clone and install TensorRT-Edge-LLM |
| 106 | +git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git |
| 107 | +cd TensorRT-Edge-LLM |
| 108 | +git submodule update --init --recursive |
| 109 | +python3 -m venv venv |
| 110 | +source venv/bin/activate |
| 111 | +pip3 install . |
| 112 | + |
| 113 | +# Verify installation |
| 114 | +tensorrt-edgellm-quantize-llm --help |
| 115 | +tensorrt-edgellm-export-llm --help |
99 | 116 | ``` |
100 | 117 |
|
101 | | -### Examples |
| 118 | +**System requirements:** |
| 119 | + |
| 120 | +- x86-64 Linux (Ubuntu 22.04 or 24.04 recommended) |
| 121 | +- NVIDIA GPU with Compute Capability 8.0+ (Ampere or newer) |
| 122 | +- CUDA 12.x or 13.x, Python 3.10+ |
| 123 | +- GPU VRAM: 16 GB for models up to 3B, 40 GB for models up to 4B, 80 GB for models up to 8B |
| 124 | + |
| 125 | +### CLI Tools |
| 126 | + |
| 127 | +| Tool | Purpose | |
| 128 | +| :--- | :--- | |
| 129 | +| `tensorrt-edgellm-quantize-llm` | Quantize LLM models using ModelOpt (FP8, INT4 AWQ, NVFP4) | |
| 130 | +| `tensorrt-edgellm-export-llm` | Export LLM to ONNX with precision-specific optimizations | |
| 131 | +| `tensorrt-edgellm-export-visual` | Export visual encoders for multimodal VLM models | |
| 132 | +| `tensorrt-edgellm-quantize-draft` | Quantize EAGLE draft models for speculative decoding | |
| 133 | +| `tensorrt-edgellm-export-draft` | Export EAGLE draft models to ONNX | |
| 134 | +| `tensorrt-edgellm-insert-lora` | Insert LoRA patterns into existing ONNX models | |
| 135 | +| `tensorrt-edgellm-process-lora` | Process LoRA adapter weights for runtime loading | |
102 | 136 |
|
103 | | -Export Qwen2 to FP16 ONNX: |
| 137 | +### Example: Quantize and Export an LLM |
104 | 138 |
|
105 | 139 | ```bash |
106 | | -python llm_export.py \ |
107 | | - --hf_model_path=Qwen/Qwen2-0.5B-Instruct \ |
108 | | - --dtype=fp16 \ |
109 | | - --output_dir=./qwen2_fp16 |
| 140 | +# Step 1: Quantize with ModelOpt |
| 141 | +tensorrt-edgellm-quantize-llm \ |
| 142 | + --model_dir Qwen/Qwen2.5-3B-Instruct \ |
| 143 | + --quantization fp8 \ |
| 144 | + --output_dir quantized/qwen2.5-3b-fp8 |
| 145 | + |
| 146 | +# Step 2: Export to ONNX |
| 147 | +tensorrt-edgellm-export-llm \ |
| 148 | + --model_dir quantized/qwen2.5-3b-fp8 \ |
| 149 | + --output_dir onnx_models/qwen2.5-3b |
110 | 150 | ``` |
111 | 151 |
|
112 | | -Export Qwen2 to FP8 ONNX with quantization: |
| 152 | +### Example: Quantize and Export a VLM |
113 | 153 |
|
114 | 154 | ```bash |
115 | | -python llm_export.py \ |
116 | | - --hf_model_path=Qwen/Qwen2-0.5B-Instruct \ |
117 | | - --dtype=fp8 \ |
118 | | - --output_dir=./qwen2_fp8 |
| 155 | +# Quantize the language model component |
| 156 | +tensorrt-edgellm-quantize-llm \ |
| 157 | + --model_dir Qwen/Qwen2.5-VL-3B-Instruct \ |
| 158 | + --quantization fp8 \ |
| 159 | + --output_dir quantized/qwen2.5-vl-3b |
| 160 | + |
| 161 | +# Export the language model |
| 162 | +tensorrt-edgellm-export-llm \ |
| 163 | + --model_dir quantized/qwen2.5-vl-3b \ |
| 164 | + --output_dir onnx_models/qwen2.5-vl-3b/llm |
| 165 | + |
| 166 | +# Export the visual encoder |
| 167 | +tensorrt-edgellm-export-visual \ |
| 168 | + --model_dir Qwen/Qwen2.5-VL-3B-Instruct \ |
| 169 | + --output_dir onnx_models/qwen2.5-vl-3b/visual |
119 | 170 | ``` |
120 | 171 |
|
121 | | -Export to NVFP4 with custom calibration: |
| 172 | +### Example: EAGLE Speculative Decoding |
122 | 173 |
|
123 | 174 | ```bash |
124 | | -python llm_export.py \ |
125 | | - --hf_model_path=Qwen/Qwen3-0.6B \ |
126 | | - --dtype=nvfp4 \ |
127 | | - --calib_size=512 \ |
128 | | - --output_dir=./qwen3_nvfp4 |
| 175 | +# Quantize base model |
| 176 | +tensorrt-edgellm-quantize-llm \ |
| 177 | + --model_dir meta-llama/Llama-3.1-8B-Instruct \ |
| 178 | + --quantization fp8 \ |
| 179 | + --output_dir quantized/llama3.1-8b-base |
| 180 | + |
| 181 | +# Export base model with EAGLE flag |
| 182 | +tensorrt-edgellm-export-llm \ |
| 183 | + --model_dir quantized/llama3.1-8b-base \ |
| 184 | + --output_dir onnx_models/llama3.1-8b/base \ |
| 185 | + --is_eagle_base |
| 186 | + |
| 187 | +# Quantize EAGLE draft model |
| 188 | +tensorrt-edgellm-quantize-draft \ |
| 189 | + --base_model_dir meta-llama/Llama-3.1-8B-Instruct \ |
| 190 | + --draft_model_dir EAGLE3-LLaMA3.1-Instruct-8B \ |
| 191 | + --quantization fp8 \ |
| 192 | + --output_dir quantized/llama3.1-8b-draft |
| 193 | + |
| 194 | +# Export draft model |
| 195 | +tensorrt-edgellm-export-draft \ |
| 196 | + --draft_model_dir quantized/llama3.1-8b-draft \ |
| 197 | + --base_model_dir meta-llama/Llama-3.1-8B-Instruct \ |
| 198 | + --output_dir onnx_models/llama3.1-8b/draft |
129 | 199 | ``` |
130 | 200 |
|
131 | | -### Key Parameters |
| 201 | +### Quantization Methods |
132 | 202 |
|
133 | | -| Parameter | Description | |
| 203 | +| Method | Description | |
134 | 204 | | :--- | :--- | |
135 | | -| `--hf_model_path` | HuggingFace model name (e.g., `Qwen/Qwen2-0.5B-Instruct`) or local model path | |
136 | | -| `--dtype` | Export precision: `fp16`, `fp8`, `int4_awq`, or `nvfp4` | |
137 | | -| `--output_dir` | Directory to save the exported ONNX model | |
138 | | -| `--calib_size` | Number of calibration samples for quantization (default: 512) | |
139 | | -| `--lm_head` | Precision of lm_head layer (default: `fp16`) | |
140 | | -| `--save_original` | Save the raw ONNX before post-processing | |
141 | | -| `--trust_remote_code` | Trust remote code when loading from HuggingFace Hub | |
| 205 | +| FP8 | Best accuracy-to-memory balance on SM89+ hardware (Hopper, Ada) | |
| 206 | +| INT4 AWQ | Weight-only quantization; effective for memory-constrained platforms and low-batch inference | |
| 207 | +| NVFP4 | 4-bit format for NVIDIA Blackwell and Thor hardware; applies to both weights and activations | |
| 208 | +| MXFP8 | Experimental; Microscaling FP8 format for SM89+ hardware | |
| 209 | +| INT8 SmoothQuant | Experimental; INT8 weight and activation quantization with SmoothQuant | |
| 210 | +| INT4 GPTQ | Can be loaded directly from HuggingFace Hub (no additional quantization needed) | |
| 211 | + |
| 212 | +### Supported Models |
| 213 | + |
| 214 | +For the latest support matrix, see the [TensorRT-Edge-LLM Supported Models](https://nvidia.github.io/TensorRT-Edge-LLM/developer_guide/getting-started/supported-models.html) page. |
| 215 | + |
| 216 | +#### LLMs |
| 217 | + |
| 218 | +| Model | FP16 | FP8 | INT4 | NVFP4 | |
| 219 | +| :--- | :---: | :---: | :---: | :---: | |
| 220 | +| [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 221 | +| [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 222 | +| [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 223 | +| [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 224 | +| [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 225 | +| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 226 | +| [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 227 | +| [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 228 | +| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 229 | +| [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 230 | +| [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | ✅ | ✅ | ✅ | ✅ | |
| 231 | +| [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | ✅ | ✅ | ✅ | ✅ | |
| 232 | +| [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) | ✅ | ✅ | ✅ | ✅ | |
| 233 | +| [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | ✅ | ✅ | ✅ | ✅ | |
| 234 | +| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | ✅ | ✅ | ✅ | ✅ | |
| 235 | +| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | ✅ | ✅ | ✅ | ✅ | |
| 236 | + |
| 237 | +#### VLMs |
| 238 | + |
| 239 | +| Model | FP16 | FP8 | INT4 | NVFP4 | |
| 240 | +| :--- | :---: | :---: | :---: | :---: | |
| 241 | +| [Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 242 | +| [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 243 | +| [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 244 | +| [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 245 | +| [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 246 | +| [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 247 | +| [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
| 248 | +| [InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B) | ✅ | ✅ | ✅ | ✅ | |
| 249 | +| [InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B) | ✅ | ✅ | ✅ | ✅ | |
| 250 | +| [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | ✅ | ✅ | ✅ | ✅ | |
| 251 | + |
| 252 | +### Troubleshooting |
| 253 | + |
| 254 | +- **GPU out of memory**: Use a larger GPU (40 GB for models up to 4B, 80 GB for models up to 8B) or try `--device cpu` (limited precision support). |
| 255 | +- **Calibration dataset issues**: Download the dataset manually and pass the local path with `--calib_dataset ./path/to/dataset`. |
| 256 | +- **Accuracy degradation**: Try FP8 instead of INT4/NVFP4, or increase calibration sample size. |
| 257 | + |
| 258 | +For full documentation, see the [TensorRT-Edge-LLM Developer Guide](https://nvidia.github.io/TensorRT-Edge-LLM/). |
142 | 259 |
|
143 | 260 | ## Mixed Precision Quantization (Auto Mode) |
144 | 261 |
|
@@ -180,21 +297,6 @@ python torch_quant_to_onnx.py \ |
180 | 297 | | NVFP4 Quantized | 84.558% | 97.36% | |
181 | 298 | | Auto Quantized (FP8 + NVFP4, 4.78 effective bits) | 84.726% | 97.434% | |
182 | 299 |
|
183 | | -## ONNX Export Supported LLM Models |
184 | | - |
185 | | -| Model | FP16 | INT4 | FP8 | NVFP4 | |
186 | | -| :---: | :---: | :---: | :---: | :---: | |
187 | | -| [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
188 | | -| [Llama3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | ✅ | ✅ | ✅ | ✅ | |
189 | | -| [Llama3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) | ✅ | ✅ | ✅ | ✅ | |
190 | | -| [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
191 | | -| [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
192 | | -| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
193 | | -| [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
194 | | -| [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
195 | | -| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
196 | | -| [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | ✅ | ✅ | ✅ | ✅ | |
197 | | - |
198 | 300 | ## Resources |
199 | 301 |
|
200 | 302 | - 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146) |
|
0 commit comments