Skip to content

Commit e861961

Browse files
ajrasaneclaude
authored andcommitted
Replace in-repo LLM ONNX export with TensorRT-Edge-LLM (#1210)
### What does this PR do? Type of change: Documentation, Cleanup This PR removes the in-repo LLM ONNX export pipeline (`llm_export.py` and `modelopt.onnx.llm_export_utils` package) and updates the `examples/torch_onnx/README.md` to direct users to [TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM), which provides a more complete and actively maintained pipeline for quantizing LLMs/VLMs with ModelOpt and exporting them to optimized ONNX for edge deployment (Jetson, DRIVE). **Removed:** - `examples/torch_onnx/llm_export.py` — standalone LLM export script - `modelopt/onnx/llm_export_utils/` — supporting package (`export_utils.py`, `quantization_utils.py`, `surgeon_utils.py`) - `tests/examples/torch_onnx/test_llm_export.py` — associated tests **Updated:** - `examples/torch_onnx/README.md` — rewrote the LLM section with TensorRT-Edge-LLM installation, CLI tools, usage examples (LLM, VLM, EAGLE speculative decoding), supported model matrix, quantization methods, and troubleshooting guidance ### Usage Users should now use TensorRT-Edge-LLM CLI tools instead of the removed `llm_export.py`: ```bash # Quantize a model with ModelOpt tensorrt-edgellm-quantize-llm \ --model_dir Qwen/Qwen2.5-3B-Instruct \ --quantization fp8 \ --output_dir quantized/qwen2.5-3b-fp8 # Export to ONNX tensorrt-edgellm-export-llm \ --model_dir quantized/qwen2.5-3b-fp8 \ --output_dir onnx_models/qwen2.5-3b ``` ### Testing - No functional code changes — this is a removal of deprecated code and documentation update. - Vision model quantization/export (`torch_quant_to_onnx.py`) and mixed precision examples are unaffected. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ❌ — Removes `llm_export.py` and `modelopt.onnx.llm_export_utils` package. Users should migrate to [TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM). - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A — This removes code and tests; no new functionality added. - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ❌ — Should be updated to note the removal of `llm_export_utils` and migration to TensorRT-Edge-LLM. ### Additional Information - Successor tool: [TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Restructured LLM ONNX export documentation to reflect new quantization → export → build → inference workflow with updated tooling guidance * Added comprehensive installation and setup instructions with system requirements verification and supported model information * Expanded examples to cover LLMs, VLMs, and speculative decoding techniques with step-by-step command guidance, parameters, and troubleshooting resources <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 8815853 commit e861961

7 files changed

Lines changed: 158 additions & 960 deletions

File tree

examples/torch_onnx/README.md

Lines changed: 158 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,20 @@
11
# Torch Quantization to ONNX Export
22

3-
This example demonstrates how to quantize PyTorch models (vision and LLM) followed by export to ONNX format. The scripts leverage the ModelOpt toolkit for both quantization and ONNX export.
3+
This example demonstrates how to quantize PyTorch models followed by export to ONNX format. The scripts leverage the ModelOpt toolkit for quantization and ONNX export.
4+
5+
For **vision models**, the `torch_quant_to_onnx.py` script in this directory handles quantization and ONNX export directly.
6+
7+
For **LLMs and VLMs**, use [TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM) which provides a complete pipeline for quantizing models with ModelOpt and exporting them to optimized ONNX for deployment on edge platforms (Jetson, DRIVE).
48

59
<div align="center">
610

711
| **Section** | **Description** | **Link** |
812
| :------------: | :------------: | :------------: |
913
| Pre-Requisites | Required packages to use this example | [Link](#pre-requisites) |
1014
| Vision Models | Quantize timm models and export to ONNX | [Link](#vision-models) |
11-
| LLM Export | Export LLMs to quantized ONNX | [Link](#llm-export) |
15+
| LLM Quantization and Export | Quantize and export LLMs/VLMs via TensorRT-Edge-LLM | [Link](#llm-quantization-and-export-with-tensorrt-edge-llm) |
16+
| Supported Models | LLM and VLM models supported by TensorRT-Edge-LLM | [Link](#supported-models) |
1217
| Mixed Precision | Auto mode for optimal per-layer quantization | [Link](#mixed-precision-quantization-auto-mode) |
13-
| Support Matrix | View the ONNX export supported LLM models | [Link](#onnx-export-supported-llm-models) |
1418
| Resources | Extra links to relevant resources | [Link](#resources) |
1519

1620
</div>
@@ -78,67 +82,180 @@ python ../onnx_ptq/evaluate.py \
7882
--model_name=vit_base_patch16_224
7983
```
8084

81-
## LLM Export
85+
## LLM Quantization and Export with TensorRT-Edge-LLM
8286

83-
The `llm_export.py` script exports LLM models to ONNX with optional quantization.
87+
[TensorRT-Edge-LLM](https://github.com/NVIDIA/TensorRT-Edge-LLM) provides a complete pipeline for quantizing LLMs and VLMs using NVIDIA ModelOpt and exporting them to optimized ONNX for deployment on edge platforms such as NVIDIA Jetson and DRIVE.
8488

85-
### What it does
89+
### Overview
8690

87-
- Loads a HuggingFace LLM model (local path or model name).
88-
- Optionally quantizes the model to FP8, INT4_AWQ, or NVFP4.
89-
- Exports the model to ONNX format.
90-
- Post-processes the ONNX graph for TensorRT compatibility.
91+
The pipeline follows these stages:
9192

92-
### Usage
93+
1. **Quantize** (x86 host with GPU) — Reduce model precision using ModelOpt (FP8, INT4 AWQ, NVFP4)
94+
2. **Export** (x86 host with GPU) — Convert quantized model to ONNX
95+
3. **Build** (edge device) — Compile ONNX into TensorRT engines
96+
4. **Inference** (edge device) — Run the compiled engines
97+
98+
### Installation
9399

94100
```bash
95-
python llm_export.py \
96-
--hf_model_path=<HuggingFace model name or local path> \
97-
--dtype=<fp16|fp8|int4_awq|nvfp4> \
98-
--output_dir=<directory to save ONNX model>
101+
# Use the PyTorch Docker image (recommended)
102+
docker pull nvcr.io/nvidia/pytorch:25.12-py3
103+
docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace nvcr.io/nvidia/pytorch:25.12-py3 bash
104+
105+
# Clone and install TensorRT-Edge-LLM
106+
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
107+
cd TensorRT-Edge-LLM
108+
git submodule update --init --recursive
109+
python3 -m venv venv
110+
source venv/bin/activate
111+
pip3 install .
112+
113+
# Verify installation
114+
tensorrt-edgellm-quantize-llm --help
115+
tensorrt-edgellm-export-llm --help
99116
```
100117

101-
### Examples
118+
**System requirements:**
119+
120+
- x86-64 Linux (Ubuntu 22.04 or 24.04 recommended)
121+
- NVIDIA GPU with Compute Capability 8.0+ (Ampere or newer)
122+
- CUDA 12.x or 13.x, Python 3.10+
123+
- GPU VRAM: 16 GB for models up to 3B, 40 GB for models up to 4B, 80 GB for models up to 8B
124+
125+
### CLI Tools
126+
127+
| Tool | Purpose |
128+
| :--- | :--- |
129+
| `tensorrt-edgellm-quantize-llm` | Quantize LLM models using ModelOpt (FP8, INT4 AWQ, NVFP4) |
130+
| `tensorrt-edgellm-export-llm` | Export LLM to ONNX with precision-specific optimizations |
131+
| `tensorrt-edgellm-export-visual` | Export visual encoders for multimodal VLM models |
132+
| `tensorrt-edgellm-quantize-draft` | Quantize EAGLE draft models for speculative decoding |
133+
| `tensorrt-edgellm-export-draft` | Export EAGLE draft models to ONNX |
134+
| `tensorrt-edgellm-insert-lora` | Insert LoRA patterns into existing ONNX models |
135+
| `tensorrt-edgellm-process-lora` | Process LoRA adapter weights for runtime loading |
102136

103-
Export Qwen2 to FP16 ONNX:
137+
### Example: Quantize and Export an LLM
104138

105139
```bash
106-
python llm_export.py \
107-
--hf_model_path=Qwen/Qwen2-0.5B-Instruct \
108-
--dtype=fp16 \
109-
--output_dir=./qwen2_fp16
140+
# Step 1: Quantize with ModelOpt
141+
tensorrt-edgellm-quantize-llm \
142+
--model_dir Qwen/Qwen2.5-3B-Instruct \
143+
--quantization fp8 \
144+
--output_dir quantized/qwen2.5-3b-fp8
145+
146+
# Step 2: Export to ONNX
147+
tensorrt-edgellm-export-llm \
148+
--model_dir quantized/qwen2.5-3b-fp8 \
149+
--output_dir onnx_models/qwen2.5-3b
110150
```
111151

112-
Export Qwen2 to FP8 ONNX with quantization:
152+
### Example: Quantize and Export a VLM
113153

114154
```bash
115-
python llm_export.py \
116-
--hf_model_path=Qwen/Qwen2-0.5B-Instruct \
117-
--dtype=fp8 \
118-
--output_dir=./qwen2_fp8
155+
# Quantize the language model component
156+
tensorrt-edgellm-quantize-llm \
157+
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
158+
--quantization fp8 \
159+
--output_dir quantized/qwen2.5-vl-3b
160+
161+
# Export the language model
162+
tensorrt-edgellm-export-llm \
163+
--model_dir quantized/qwen2.5-vl-3b \
164+
--output_dir onnx_models/qwen2.5-vl-3b/llm
165+
166+
# Export the visual encoder
167+
tensorrt-edgellm-export-visual \
168+
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
169+
--output_dir onnx_models/qwen2.5-vl-3b/visual
119170
```
120171

121-
Export to NVFP4 with custom calibration:
172+
### Example: EAGLE Speculative Decoding
122173

123174
```bash
124-
python llm_export.py \
125-
--hf_model_path=Qwen/Qwen3-0.6B \
126-
--dtype=nvfp4 \
127-
--calib_size=512 \
128-
--output_dir=./qwen3_nvfp4
175+
# Quantize base model
176+
tensorrt-edgellm-quantize-llm \
177+
--model_dir meta-llama/Llama-3.1-8B-Instruct \
178+
--quantization fp8 \
179+
--output_dir quantized/llama3.1-8b-base
180+
181+
# Export base model with EAGLE flag
182+
tensorrt-edgellm-export-llm \
183+
--model_dir quantized/llama3.1-8b-base \
184+
--output_dir onnx_models/llama3.1-8b/base \
185+
--is_eagle_base
186+
187+
# Quantize EAGLE draft model
188+
tensorrt-edgellm-quantize-draft \
189+
--base_model_dir meta-llama/Llama-3.1-8B-Instruct \
190+
--draft_model_dir EAGLE3-LLaMA3.1-Instruct-8B \
191+
--quantization fp8 \
192+
--output_dir quantized/llama3.1-8b-draft
193+
194+
# Export draft model
195+
tensorrt-edgellm-export-draft \
196+
--draft_model_dir quantized/llama3.1-8b-draft \
197+
--base_model_dir meta-llama/Llama-3.1-8B-Instruct \
198+
--output_dir onnx_models/llama3.1-8b/draft
129199
```
130200

131-
### Key Parameters
201+
### Quantization Methods
132202

133-
| Parameter | Description |
203+
| Method | Description |
134204
| :--- | :--- |
135-
| `--hf_model_path` | HuggingFace model name (e.g., `Qwen/Qwen2-0.5B-Instruct`) or local model path |
136-
| `--dtype` | Export precision: `fp16`, `fp8`, `int4_awq`, or `nvfp4` |
137-
| `--output_dir` | Directory to save the exported ONNX model |
138-
| `--calib_size` | Number of calibration samples for quantization (default: 512) |
139-
| `--lm_head` | Precision of lm_head layer (default: `fp16`) |
140-
| `--save_original` | Save the raw ONNX before post-processing |
141-
| `--trust_remote_code` | Trust remote code when loading from HuggingFace Hub |
205+
| FP8 | Best accuracy-to-memory balance on SM89+ hardware (Hopper, Ada) |
206+
| INT4 AWQ | Weight-only quantization; effective for memory-constrained platforms and low-batch inference |
207+
| NVFP4 | 4-bit format for NVIDIA Blackwell and Thor hardware; applies to both weights and activations |
208+
| MXFP8 | Experimental; Microscaling FP8 format for SM89+ hardware |
209+
| INT8 SmoothQuant | Experimental; INT8 weight and activation quantization with SmoothQuant |
210+
| INT4 GPTQ | Can be loaded directly from HuggingFace Hub (no additional quantization needed) |
211+
212+
### Supported Models
213+
214+
For the latest support matrix, see the [TensorRT-Edge-LLM Supported Models](https://nvidia.github.io/TensorRT-Edge-LLM/developer_guide/getting-started/supported-models.html) page.
215+
216+
#### LLMs
217+
218+
| Model | FP16 | FP8 | INT4 | NVFP4 |
219+
| :--- | :---: | :---: | :---: | :---: |
220+
| [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |||||
221+
| [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) |||||
222+
| [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) |||||
223+
| [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) |||||
224+
| [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |||||
225+
| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |||||
226+
| [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) |||||
227+
| [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |||||
228+
| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) |||||
229+
| [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |||||
230+
| [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |||||
231+
| [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) |||||
232+
| [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) |||||
233+
| [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) |||||
234+
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |||||
235+
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) |||||
236+
237+
#### VLMs
238+
239+
| Model | FP16 | FP8 | INT4 | NVFP4 |
240+
| :--- | :---: | :---: | :---: | :---: |
241+
| [Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) |||||
242+
| [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) |||||
243+
| [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) |||||
244+
| [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) |||||
245+
| [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) |||||
246+
| [Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) |||||
247+
| [Qwen3-VL-8B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) |||||
248+
| [InternVL3-1B](https://huggingface.co/OpenGVLab/InternVL3-1B) |||||
249+
| [InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B) |||||
250+
| [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) |||||
251+
252+
### Troubleshooting
253+
254+
- **GPU out of memory**: Use a larger GPU (40 GB for models up to 4B, 80 GB for models up to 8B) or try `--device cpu` (limited precision support).
255+
- **Calibration dataset issues**: Download the dataset manually and pass the local path with `--calib_dataset ./path/to/dataset`.
256+
- **Accuracy degradation**: Try FP8 instead of INT4/NVFP4, or increase calibration sample size.
257+
258+
For full documentation, see the [TensorRT-Edge-LLM Developer Guide](https://nvidia.github.io/TensorRT-Edge-LLM/).
142259

143260
## Mixed Precision Quantization (Auto Mode)
144261

@@ -180,21 +297,6 @@ python torch_quant_to_onnx.py \
180297
| NVFP4 Quantized | 84.558% | 97.36% |
181298
| Auto Quantized (FP8 + NVFP4, 4.78 effective bits) | 84.726% | 97.434% |
182299

183-
## ONNX Export Supported LLM Models
184-
185-
| Model | FP16 | INT4 | FP8 | NVFP4 |
186-
| :---: | :---: | :---: | :---: | :---: |
187-
| [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |||||
188-
| [Llama3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) |||||
189-
| [Llama3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) |||||
190-
| [Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) |||||
191-
| [Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) |||||
192-
| [Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) |||||
193-
| [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) |||||
194-
| [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |||||
195-
| [Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) |||||
196-
| [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |||||
197-
198300
## Resources
199301

200302
- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)

0 commit comments

Comments
 (0)