Skip to content

Commit cda5eb9

Browse files
committed
resolve comments
Signed-off-by: Will Guo <willg@nvidia.com>
1 parent c06a405 commit cda5eb9

File tree

3 files changed

+329
-265
lines changed

3 files changed

+329
-265
lines changed

examples/onnx/autoqdq/README.md

Lines changed: 299 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
# QDQ Placement Optimization Example
2+
3+
This example demonstrates automated Q/DQ (Quantize/Dequantize) node placement optimization for ONNX models using TensorRT performance measurements.
4+
5+
## Table of Contents
6+
7+
- [Prerequisites](#prerequisites)
8+
- [Get the Model](#get-the-model)
9+
- [Set Fixed Batch Size](#set-fixed-batch-size)
10+
- [What's in This Directory](#whats-in-this-directory)
11+
- [Quick Start](#quick-start)
12+
- [Basic Usage](#basic-usage)
13+
- [FP8 Quantization](#fp8-quantization)
14+
- [Faster Exploration](#faster-exploration)
15+
- [Output Structure](#output-structure)
16+
- [Region Inspection](#region-inspection)
17+
- [Using the Optimized Model](#using-the-optimized-model)
18+
- [Pattern Cache](#pattern-cache)
19+
- [Optimize from Existing QDQ Model](#optimize-from-existing-qdq-model)
20+
- [Remote Autotuning with TensorRT](#remote-autotuning-with-tensorrt)
21+
- [Programmatic API Usage](#programmatic-api-usage)
22+
- [Documentation](#documentation)
23+
24+
## Prerequisites
25+
26+
### Get the Model
27+
28+
Download the ResNet50 model from the ONNX Model Zoo:
29+
30+
```bash
31+
# Download ResNet50 from ONNX Model Zoo
32+
curl -L -o resnet50_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet50_Opset17_torch_hub/resnet50_Opset17.onnx
33+
```
34+
35+
### Set Fixed Batch Size
36+
37+
The downloaded model has a dynamic batch size. For best performance with TensorRT benchmarking, set a fixed batch size:
38+
39+
```bash
40+
# Set batch size to 128 using the provided script
41+
python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx
42+
43+
# Or for other batch sizes
44+
python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 1 --output resnet50.bs1.onnx
45+
```
46+
47+
This creates `resnet50.bs128.onnx` with a fixed batch size of 128, which is optimal for TensorRT performance benchmarking.
48+
49+
**Note:** The script requires the `onnx` package.
50+
51+
### What's in This Directory
52+
53+
- `set_batch_size.py` - Script to convert dynamic batch size models to fixed batch size
54+
- `README.md` - This guide
55+
56+
**Note:** ONNX model files are not included in the repository (excluded via `.gitignore`). Download and prepare them using the instructions above.
57+
58+
## Quick Start
59+
60+
### Basic Usage
61+
62+
Optimize the ResNet50 model with INT8 quantization:
63+
64+
```bash
65+
# Using the fixed batch size model
66+
python3 -m modelopt.onnx.quantization.autotune \
67+
--onnx_path resnet50.bs128.onnx \
68+
--output_dir ./resnet50_results \
69+
--quant_type int8 \
70+
--schemes_per_region 30
71+
72+
# Or use the original dynamic batch size model, batch is set to 1 in benchmark
73+
python3 -m modelopt.onnx.quantization.autotune \
74+
--onnx_path resnet50_Opset17.onnx \
75+
--output_dir ./resnet50_results \
76+
--quant_type int8 \
77+
--schemes_per_region 30
78+
```
79+
80+
Short options: `-m` for `--onnx_path`, `-o` for `--output_dir`, `-s` for `--schemes_per_region`. Default output directory is `./autotuner_output` if `--output_dir` is omitted.
81+
82+
This will:
83+
84+
1. Automatically discover optimization regions in the model
85+
2. Test 30 different Q/DQ placement schemes per region pattern
86+
3. Measure TensorRT performance for each scheme
87+
4. Export the best optimized model to `./resnet50_results/optimized_final.onnx`
88+
89+
### FP8 Quantization
90+
91+
For FP8 quantization:
92+
93+
```bash
94+
python3 -m modelopt.onnx.quantization.autotune \
95+
--onnx_path resnet50.bs128.onnx \
96+
--output_dir ./resnet50_fp8_results \
97+
--quant_type fp8 \
98+
--schemes_per_region 50
99+
```
100+
101+
### Faster Exploration
102+
103+
For quick experiments, reduce the number of schemes:
104+
105+
```bash
106+
python3 -m modelopt.onnx.quantization.autotune \
107+
--onnx_path resnet50.bs128.onnx \
108+
--output_dir ./resnet50_quick \
109+
--schemes_per_region 15
110+
```
111+
112+
## Output Structure
113+
114+
After running, the output workspace will be:
115+
116+
```log
117+
resnet50_results/
118+
├── optimized_final.onnx # Optimized model
119+
├── baseline.onnx # Baseline for comparison
120+
├── autotuner_state.yaml # Resume checkpoint
121+
├── autotuner_state_pattern_cache.yaml # Reusable pattern cache
122+
├── logs/
123+
│ ├── baseline.log # TensorRT baseline log
124+
│ ├── region_*_scheme_*.log # Per-scheme logs
125+
│ └── final.log # Final model log
126+
└── region_models/ # Best model per region (intermediate)
127+
└── region_*_level_*.onnx
128+
```
129+
130+
## Region Inspection
131+
132+
To debug how the autotuner discovers and partitions regions in your model, use the `region_inspect` tool. It runs the same region search as the autotuner and prints the region hierarchy, node counts, and summary statistics (without running benchmarks).
133+
134+
```bash
135+
# Basic inspection (regions with quantizable ops only)
136+
python3 -m modelopt.onnx.quantization.autotune.region_inspect --model resnet50.bs128.onnx
137+
138+
# Verbose mode for detailed debug logging
139+
python3 -m modelopt.onnx.quantization.autotune.region_inspect --model resnet50.bs128.onnx --verbose
140+
141+
# Custom maximum sequence region size
142+
python3 -m modelopt.onnx.quantization.autotune.region_inspect --model resnet50.bs128.onnx --max-sequence-size 20
143+
144+
# Include all regions (including those without Conv/MatMul etc.)
145+
python3 -m modelopt.onnx.quantization.autotune.region_inspect --model resnet50.bs128.onnx --include-all-regions
146+
```
147+
148+
Short option: `-m` for `--model`, `-v` for `--verbose`. Use this to verify region boundaries and counts before or during autotuning.
149+
150+
## Using the Optimized Model
151+
152+
Deploy with TensorRT:
153+
154+
```bash
155+
trtexec --onnx=resnet50_results/optimized_final.onnx \
156+
--saveEngine=resnet50.engine \
157+
--stronglyTyped
158+
```
159+
160+
## Pattern Cache
161+
162+
Reuse learned patterns on similar models (warm-start):
163+
164+
```bash
165+
# First optimization on ResNet50
166+
python3 -m modelopt.onnx.quantization.autotune \
167+
--onnx_path resnet50.bs128.onnx \
168+
--output_dir ./resnet50_run
169+
170+
# Download and prepare ResNet101 (or any similar model)
171+
curl -L -o resnet101_Opset17.onnx https://github.com/onnx/models/blob/main/Computer_Vision/resnet101_Opset17_torch_hub/resnet101_Opset17.onnx
172+
python3 set_batch_size.py resnet101_Opset17.onnx --batch-size 128 --output resnet101.bs128.onnx
173+
174+
# Reuse patterns from ResNet50 on ResNet101
175+
python3 -m modelopt.onnx.quantization.autotune \
176+
--onnx_path resnet101.bs128.onnx \
177+
--output_dir ./resnet101_run \
178+
--pattern_cache ./resnet50_run/autotuner_state_pattern_cache.yaml
179+
```
180+
181+
## Optimize from Existing QDQ Model
182+
183+
If the user already have a quantized model, he can use it as a starting point to potentially find even better Q/DQ placements:
184+
185+
```bash
186+
# Use an existing QDQ model as baseline (imports quantization patterns)
187+
python3 -m modelopt.onnx.quantization.autotune \
188+
--onnx_path resnet50.bs128.onnx \
189+
--output_dir ./resnet50_improved \
190+
--qdq_baseline resnet50_quantized.onnx \
191+
--schemes_per_region 40
192+
```
193+
194+
This will:
195+
196+
1. Extract Q/DQ insertion points from the baseline model
197+
2. Import them into the pattern cache as seed schemes
198+
3. Generate and test variations to find better placements
199+
4. Compare against the baseline performance
200+
201+
**Use cases:**
202+
203+
- **Improve existing quantization**: Fine-tune manually quantized models
204+
- **Compare tools**: Test if autotuner can beat other quantization methods
205+
- **Bootstrap optimization**: Start from expert-tuned schemes
206+
207+
**Example workflow:**
208+
209+
```bash
210+
# Step 1: Create initial quantized model with modelopt
211+
# For example, using modelopt's quantize function:
212+
python3 -c "
213+
import numpy as np
214+
from modelopt.onnx.quantization import quantize
215+
216+
# Create dummy calibration data (replace with real data for production)
217+
dummy_input = np.random.randn(128, 3, 224, 224).astype(np.float32)
218+
quantize(
219+
'resnet50.bs128.onnx',
220+
calibration_data=dummy_input,
221+
calibration_method='entropy',
222+
output_path='resnet50_quantized.onnx'
223+
)
224+
"
225+
226+
# Step 2: Use the quantized baseline for autotuning
227+
# The autotuner will try to find better Q/DQ placements than the initial quantization
228+
python3 -m modelopt.onnx.quantization.autotune \
229+
--onnx_path resnet50.bs128.onnx \
230+
--output_dir ./resnet50_autotuned \
231+
--qdq_baseline resnet50_quantized.onnx \
232+
--schemes_per_region 50
233+
```
234+
235+
**Note:** This example uses dummy calibration data. For production use, provide real calibration data representative of the inference workload.
236+
237+
## Remote Autotuning with TensorRT
238+
239+
TensorRT 10.16+ supports remote autotuning, which allows TensorRT's optimization process to be offloaded to a remote hardware. This is useful when optimizing models for different target GPUs without having direct access to them.
240+
241+
To use remote autotuning during Q/DQ placement optimization, run with `trtexec` and pass extra args:
242+
243+
```bash
244+
python3 -m modelopt.onnx.quantization.autotune \
245+
--onnx_path resnet50.bs128.onnx \
246+
--output_dir ./resnet50_remote_autotuned \
247+
--schemes_per_region 50 \
248+
--use_trtexec \
249+
--trtexec_benchmark_args "--remoteAutoTuningConfig=\"<remote autotuning config>\""
250+
```
251+
252+
**Requirements:**
253+
254+
- TensorRT 10.16 or later
255+
- Valid remote autotuning configuration
256+
- `--use_trtexec` must be set (benchmarking uses `trtexec` instead of the TensorRT Python API)
257+
258+
Replace `<remote autotuning config>` with user's actual remote autotuning configuration string. Other TensorRT benchmark options (e.g. `--timing_cache`, `--warmup_runs`, `--timing_runs`, `--plugin_libraries`) are also available; run `--help` for details.
259+
260+
## Programmatic API Usage
261+
262+
All examples above use the command-line interface. For **low-level programmatic control** in Python code, use the Python API directly. This allows user to:
263+
264+
- Integrate autotuning into custom pipelines
265+
- Implement custom evaluation functions
266+
- Control state management and checkpointing
267+
- Build custom optimization workflows
268+
269+
**See the API Reference documentation for low-level usage:**
270+
271+
- [`docs/source/reference/2_qdq_placement.rst`](../../docs/source/reference/2_qdq_placement.rst)
272+
273+
The API docs include detailed examples of:
274+
275+
- Using the `QDQAutotuner` class and `region_pattern_autotuning_workflow`
276+
- Customizing region discovery and scheme generation
277+
- Managing optimization state and pattern cache programmatically
278+
- Implementing custom performance evaluators (e.g. via `init_benchmark_instance` and `benchmark_onnx_model`)
279+
280+
## Documentation
281+
282+
For comprehensive documentation on QDQ placement optimization, see:
283+
284+
- **User Guide**: [`docs/source/guides/9_qdq_placement.rst`](../../docs/source/guides/9_qdq_placement.rst)
285+
- Detailed explanations of how the autotuner works
286+
- Advanced usage patterns and best practices
287+
- Configuration options and performance tuning
288+
- Troubleshooting common issues
289+
290+
- **API Reference**: [`docs/source/reference/2_qdq_placement.rst`](../../docs/source/reference/2_qdq_placement.rst)
291+
- Complete API documentation for all classes and functions
292+
- Low-level usage examples
293+
- State management and pattern cache details
294+
295+
For command-line help and all options (e.g. `--state_file`, `--node_filter_list`, `--default_dq_dtype`, `--verbose`):
296+
297+
```bash
298+
python3 -m modelopt.onnx.quantization.autotune --help
299+
```
Lines changed: 30 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
#!/usr/bin/env python3
2-
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
33
# SPDX-License-Identifier: Apache-2.0
44
#
55
# Licensed under the Apache License, Version 2.0 (the "License");
@@ -25,9 +25,25 @@
2525
"""
2626

2727
import argparse
28+
import sys
2829

2930
import onnx
30-
from onnx import shape_inference
31+
32+
from modelopt.onnx.utils import check_model, infer_shapes, save_onnx
33+
34+
35+
def _validate_onnx_model_path(path: str) -> None:
36+
"""Ensure the model path has a .onnx extension for consistent output path generation."""
37+
if not path.lower().endswith(".onnx"):
38+
print(f"Error: Model path must end with '.onnx', got: {path}", file=sys.stderr)
39+
sys.exit(1)
40+
41+
42+
def _validate_batch_size(batch_size: int) -> None:
43+
"""Ensure batch size is a positive integer to prevent invalid model configurations."""
44+
if batch_size < 1:
45+
print(f"Error: Batch size must be a positive integer, got: {batch_size}", file=sys.stderr)
46+
sys.exit(1)
3147

3248

3349
def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None:
@@ -68,20 +84,21 @@ def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None:
6884
)
6985

7086
# Run shape inference to propagate the batch size through the model
87+
# Use modelopt's infer_shapes to support models with external data and large models
7188
print("Running shape inference...")
7289
try:
73-
model = shape_inference.infer_shapes(model)
90+
model = infer_shapes(model)
7491
except Exception as e:
7592
print(f"Warning: Shape inference failed: {e}")
7693
print("Continuing without shape inference...")
7794

78-
# Save the modified model
95+
# Save the modified model (handles external data and IR > max ORT supported)
7996
print(f"Saving modified model to {output_path}...")
80-
onnx.save(model, output_path)
97+
save_onnx(model, output_path)
8198

82-
# Verify the saved model
99+
# Verify the saved model (handles external data and large models)
83100
print("Verifying model...")
84-
onnx.checker.check_model(output_path)
101+
check_model(model)
85102
print("✓ Model saved and verified successfully!")
86103

87104

@@ -109,9 +126,13 @@ def main():
109126

110127
args = parser.parse_args()
111128

112-
# Generate output path if not provided
129+
_validate_onnx_model_path(args.model)
130+
_validate_batch_size(args.batch_size)
131+
132+
# Generate output path if not provided (requires .onnx extension, validated above)
113133
if args.output is None:
114-
base_name = args.model.rsplit(".", 1)[0]
134+
parts = args.model.rsplit(".", 1)
135+
base_name = parts[0] if len(parts) == 2 else args.model
115136
args.output = f"{base_name}.bs{args.batch_size}.onnx"
116137

117138
set_batch_size(args.model, args.batch_size, args.output)

0 commit comments

Comments
 (0)