Skip to content

Commit 26cad67

Browse files
authored
[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune (#951)
## What does this PR do? **Type of change:** New feature **Overview:** ONNX Autotune (also called Auto Q/DQ) is currently and standalone feature of ModelOpt that automatically adds Q/DQ where relevant according to information obtained from TensorRT inference. One issue is that the scales in those Q/DQ nodes are random. This PR does 2 major things: 1. Integrates Auto Q/DQ into the ONNX quantization workflow; and 2. Enables calibration data to be used to obtain the correct scales for the Q/DQ nodes. ## Usage ```python $ python -m modelopt.onnx.quantization --onnx_path=model.onnx --autotune={quick,default,extensive} ``` > Please see `__main__.py` for other args. ## Testing 1. Added unittest for Q/DQ node placement validation: `tests/gpu/onnx/quantization/test_autotune_quantization_integration.py` 2. Verified that accuracy was recovered by integrating MOQ with Autotune. Results on RTX 3090 with TRT 10.12.0.36 (`--stronglyTyped`) with ViT, as per `examples/onnx_ptq`: | Model | Top-1 acc | Top-5 acc | |--------------------------|---------------|----------------| | FP32 | 85.1% | 97.5% | | FP16 (FP32 with --fp16) | 85.1% | 97.5% | | Quant (MOQ) | 82.4% | 96.4% | | Quant (Autotune) | 0.1% | 0.5%| | Quant (MOQ + Autotune) | 79.6% | 95.0% | Notice that accuracy was mostly recovered from standalone Autotune to MOQ + Autotune (real Q/DQ scales). The drop in accuracy between MOQ and MOQ + Autotune is likely due to some sensitive nodes being quantized, such as `BiasAdd` (see bug 5916898). ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No (will be done in a different PR) - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Autotuning added to ONNX quantization: CLI flags, presets, per-region tuning, and FP8/INT8 support; accepts in-memory models and optional output dirs; node-filter loading and explicit-flag CLI behavior. * Activation-operation accessor exposed and autotune helpers added to the package API. * **Bug Fixes** * Safer graph rewiring to avoid corrupting quantized graphs when targets are absent. * **Tests** * New integration test and model helper validating autotune quantization consistency. <!-- end of auto-generated comment: release notes by coderabbit.ai --> ## Additional information To reproduce accuracy with ViT, call `download_example_onnx.py` and `image_prep.py` without `--fp16`. If `--fp16` is used here, quantizing this model with `--autotune` results in the following error: ``` [modelopt][onnx] - ERROR - Benchmark failed: Converting dtype('float16') to a ctypes type ``` This is fixed in #978. --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
1 parent fe83270 commit 26cad67

16 files changed

Lines changed: 788 additions & 121 deletions

File tree

modelopt/onnx/op_types.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -386,3 +386,25 @@ def get_symmetric_ops():
386386
"BitwiseOr",
387387
"BitwiseXor",
388388
}
389+
390+
391+
def get_activation_ops():
392+
"""Returns set of activation operations."""
393+
return {
394+
"Relu",
395+
"LeakyRelu",
396+
"PRelu",
397+
"Elu",
398+
"Selu",
399+
"ThresholdedRelu",
400+
"Sigmoid",
401+
"Tanh",
402+
"HardSigmoid",
403+
"Softmax",
404+
"LogSoftmax",
405+
"Clip",
406+
"Softplus",
407+
"Softsign",
408+
"Swish",
409+
"HardSwish",
410+
}

modelopt/onnx/quantization/__main__.py

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,11 @@
2020

2121
import numpy as np
2222

23+
from modelopt.onnx.quantization.autotune import (
24+
MODE_PRESETS,
25+
StoreWithExplicitFlag,
26+
get_node_filter_list,
27+
)
2328
from modelopt.onnx.quantization.quantize import quantize
2429

2530
__all__ = ["main"]
@@ -295,9 +300,128 @@ def get_parser() -> argparse.ArgumentParser:
295300
"if certain operations require a higher version."
296301
),
297302
)
303+
argparser.add_argument(
304+
"--autotune",
305+
nargs="?",
306+
const="default",
307+
default=None,
308+
choices=["quick", "default", "extensive"],
309+
help=(
310+
"If set, enable Autotune to detect optimal Q/DQ node placements according to TensorRT runtimes. "
311+
"Available modes (presets 'schemes_per_region', 'warmup_runs', and 'timing_runs' values): "
312+
" - 'quick': fewer schemes and benchmark runs for quick exploration; "
313+
" - 'default': balanced, recommended for most cases; "
314+
" - 'extensive': more schemes and runs for extensive search and thorough tuning. "
315+
"Explicit --autotune_schemes_per_region/warmup_runs/timing_runs override the preset."
316+
),
317+
)
318+
319+
autotune_group = argparser.add_argument_group(
320+
"Autotune (only applicable when --autotune is set)"
321+
)
322+
autotune_group.add_argument(
323+
"--autotune_output_dir",
324+
type=str,
325+
default=None,
326+
help="Output directory for autotune results (state file, logs). Default: temp directory.",
327+
)
328+
autotune_group.add_argument(
329+
"--autotune_schemes_per_region",
330+
type=int,
331+
default=MODE_PRESETS["default"]["schemes_per_region"],
332+
help="Number of Q/DQ schemes to test per region.",
333+
action=StoreWithExplicitFlag,
334+
explicit_attr="_explicit_autotune_schemes_per_region",
335+
)
336+
autotune_group.add_argument(
337+
"--autotune_pattern_cache",
338+
type=str,
339+
default=None,
340+
dest="autotune_pattern_cache_file",
341+
help="Path to pattern cache YAML for warm-start.",
342+
)
343+
autotune_group.add_argument(
344+
"--autotune_qdq_baseline",
345+
type=str,
346+
default=None,
347+
help="Path to a pre-quantized ONNX model to import Q/DQ patterns as warm-start.",
348+
)
349+
autotune_group.add_argument(
350+
"--autotune_state_file",
351+
type=str,
352+
default=None,
353+
help="State file path for crash recovery and resume capability (default: <output_dir>/autotuner_state.yaml).",
354+
)
355+
autotune_group.add_argument(
356+
"--autotune_node_filter_list",
357+
type=str,
358+
default=None,
359+
help=(
360+
"Path to a file containing wildcard patterns to filter ONNX nodes (one pattern per line). "
361+
"Regions without any matching nodes are skipped during autotuning."
362+
),
363+
)
364+
autotune_group.add_argument(
365+
"--autotune_verbose",
366+
action="store_true",
367+
help="Enable verbose logging in the autotuner.",
368+
)
369+
autotune_group.add_argument(
370+
"--autotune_use_trtexec",
371+
action="store_true",
372+
help="Use trtexec for benchmarking instead of the TensorRT Python API.",
373+
)
374+
autotune_group.add_argument(
375+
"--autotune_timing_cache",
376+
type=str,
377+
default=None,
378+
help="TensorRT timing cache file for faster engine builds.",
379+
)
380+
autotune_group.add_argument(
381+
"--autotune_warmup_runs",
382+
type=int,
383+
default=MODE_PRESETS["default"]["warmup_runs"],
384+
help="Number of warmup runs before timing.",
385+
action=StoreWithExplicitFlag,
386+
explicit_attr="_explicit_autotune_warmup_runs",
387+
)
388+
autotune_group.add_argument(
389+
"--autotune_timing_runs",
390+
type=int,
391+
default=MODE_PRESETS["default"]["timing_runs"],
392+
help="Number of timed runs for latency measurement.",
393+
action=StoreWithExplicitFlag,
394+
explicit_attr="_explicit_autotune_timing_runs",
395+
)
396+
autotune_group.add_argument(
397+
"--autotune_trtexec_args",
398+
type=str,
399+
default=None,
400+
help=(
401+
"Additional trtexec arguments as a single quoted string. "
402+
"Example: --autotune_trtexec_args '--fp16 --workspace=4096'"
403+
),
404+
)
298405
return argparser
299406

300407

408+
def apply_mode_presets(args) -> None:
409+
"""Apply --autotune=mode preset to schemes_per_region, warmup_runs, timing_runs.
410+
411+
Only applies preset for an option when that option was not explicitly set on the
412+
command line (explicit flags override the preset).
413+
"""
414+
if args.autotune not in MODE_PRESETS:
415+
return
416+
preset = MODE_PRESETS[args.autotune]
417+
if not getattr(args, "_explicit_autotune_schemes_per_region", False):
418+
args.autotune_schemes_per_region = preset["schemes_per_region"]
419+
if not getattr(args, "_explicit_autotune_warmup_runs", False):
420+
args.autotune_warmup_runs = preset["warmup_runs"]
421+
if not getattr(args, "_explicit_autotune_timing_runs", False):
422+
args.autotune_timing_runs = preset["timing_runs"]
423+
424+
301425
def main():
302426
"""Command-line entrypoint for ONNX PTQ."""
303427
args = get_parser().parse_args()
@@ -331,6 +455,14 @@ def main():
331455
else:
332456
raise
333457

458+
# Autotune configs
459+
autotune_enabled = args.autotune is not None
460+
if autotune_enabled:
461+
apply_mode_presets(args)
462+
autotune_node_filter_list = (
463+
get_node_filter_list(args.autotune_node_filter_list) if autotune_enabled else None
464+
)
465+
334466
quantize(
335467
args.onnx_path,
336468
quantize_mode=args.quantize_mode,
@@ -362,6 +494,19 @@ def main():
362494
calibrate_per_node=args.calibrate_per_node,
363495
direct_io_types=args.direct_io_types,
364496
opset=args.opset,
497+
autotune=autotune_enabled,
498+
autotune_output_dir=args.autotune_output_dir,
499+
autotune_num_schemes_per_region=args.autotune_schemes_per_region,
500+
autotune_pattern_cache_file=args.autotune_pattern_cache_file,
501+
autotune_state_file=args.autotune_state_file,
502+
autotune_qdq_baseline=args.autotune_qdq_baseline,
503+
autotune_node_filter_list=autotune_node_filter_list,
504+
autotune_verbose=args.autotune_verbose,
505+
autotune_use_trtexec=args.autotune_use_trtexec,
506+
autotune_timing_cache=args.autotune_timing_cache,
507+
autotune_warmup_runs=args.autotune_warmup_runs,
508+
autotune_timing_runs=args.autotune_timing_runs,
509+
autotune_trtexec_args=args.autotune_trtexec_args,
365510
)
366511

367512

modelopt/onnx/quantization/autotune/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@
2020
region analysis to efficiently explore and optimize Q/DQ insertion strategies.
2121
"""
2222

23+
# Expose Autotune modes
24+
from .__main__ import MODE_PRESETS
25+
2326
# Core data structures
2427
from .autotuner import QDQAutotuner
2528
from .benchmark import TensorRTPyBenchmark, TrtExecBenchmark
@@ -42,8 +45,10 @@
4245
)
4346
from .region_pattern import RegionPattern
4447
from .region_search import CombinedRegionSearch
48+
from .utils import StoreWithExplicitFlag, get_node_filter_list
4549

4650
__all__ = [
51+
"MODE_PRESETS",
4752
"AutotunerError",
4853
"AutotunerNotInitializedError",
4954
"ChildRegionInputInsertionPoint",
@@ -60,6 +65,8 @@
6065
"RegionPattern",
6166
"RegionType",
6267
"ResolvedInsertionPoint",
68+
"StoreWithExplicitFlag",
6369
"TensorRTPyBenchmark",
6470
"TrtExecBenchmark",
71+
"get_node_filter_list",
6572
]

modelopt/onnx/quantization/autotune/__main__.py

Lines changed: 10 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,11 @@
2121
from pathlib import Path
2222

2323
from modelopt.onnx.logging_config import logger
24+
from modelopt.onnx.quantization.autotune.utils import (
25+
StoreWithExplicitFlag,
26+
get_node_filter_list,
27+
validate_file_path,
28+
)
2429
from modelopt.onnx.quantization.autotune.workflows import (
2530
init_benchmark_instance,
2631
region_pattern_autotuning_workflow,
@@ -44,18 +49,6 @@
4449
}
4550

4651

47-
class _StoreWithExplicitFlag(argparse.Action):
48-
"""Store the value and set an 'explicit' flag on the namespace so mode presets do not override."""
49-
50-
def __init__(self, explicit_attr: str, *args, **kwargs):
51-
self._explicit_attr = explicit_attr
52-
super().__init__(*args, **kwargs)
53-
54-
def __call__(self, parser, namespace, values, option_string=None):
55-
setattr(namespace, self.dest, values)
56-
setattr(namespace, self._explicit_attr, True)
57-
58-
5952
def apply_mode_presets(args) -> None:
6053
"""Apply --mode preset to schemes_per_region, warmup_runs, timing_runs.
6154
@@ -73,30 +66,6 @@ def apply_mode_presets(args) -> None:
7366
args.timing_runs = preset["timing_runs"]
7467

7568

76-
def validate_file_path(path: str | None, description: str) -> Path | None:
77-
"""Validate that a file path exists.
78-
79-
Args:
80-
path: Path string to validate (can be None)
81-
description: Description of the file for error messages
82-
83-
Returns:
84-
Path object if valid, None if path is None
85-
86-
Raises:
87-
SystemExit: If path is provided but doesn't exist
88-
"""
89-
if path is None:
90-
return None
91-
92-
path_obj = Path(path)
93-
if not path_obj.exists():
94-
logger.error(f"{description} not found: {path_obj}")
95-
sys.exit(1)
96-
97-
return path_obj
98-
99-
10069
def log_benchmark_config(args):
10170
"""Log TensorRT benchmark configuration for transparency.
10271
@@ -155,20 +124,9 @@ def run_autotune() -> int:
155124
return 1
156125

157126
try:
158-
node_filter_list = None
159-
if args.node_filter_list:
160-
filter_file = validate_file_path(args.node_filter_list, "Node filter list file")
161-
if filter_file:
162-
with open(filter_file) as f:
163-
node_filter_list = [
164-
line.strip()
165-
for line in f
166-
if line.strip() and not line.strip().startswith("#")
167-
]
168-
logger.info(f"Loaded {len(node_filter_list)} filter patterns from {filter_file}")
169-
127+
node_filter_list = get_node_filter_list(args.node_filter_list)
170128
region_pattern_autotuning_workflow(
171-
model_path=str(model_path),
129+
model_or_path=str(model_path),
172130
output_dir=output_dir,
173131
num_schemes_per_region=args.num_schemes,
174132
pattern_cache_file=args.pattern_cache_file,
@@ -265,7 +223,7 @@ def get_parser() -> argparse.ArgumentParser:
265223
type=int,
266224
default=DEFAULT_NUM_SCHEMES,
267225
dest="num_schemes",
268-
action=_StoreWithExplicitFlag,
226+
action=StoreWithExplicitFlag,
269227
explicit_attr="_explicit_num_schemes",
270228
help=f"Schemes per region (default: {DEFAULT_NUM_SCHEMES}; preset from --mode if not set)",
271229
)
@@ -331,15 +289,15 @@ def get_parser() -> argparse.ArgumentParser:
331289
"--warmup_runs",
332290
type=int,
333291
default=DEFAULT_WARMUP_RUNS,
334-
action=_StoreWithExplicitFlag,
292+
action=StoreWithExplicitFlag,
335293
explicit_attr="_explicit_warmup_runs",
336294
help=f"Number of warmup runs (default: {DEFAULT_WARMUP_RUNS}; preset from --mode applies if not set)",
337295
)
338296
trt_group.add_argument(
339297
"--timing_runs",
340298
type=int,
341299
default=DEFAULT_TIMING_RUNS,
342-
action=_StoreWithExplicitFlag,
300+
action=StoreWithExplicitFlag,
343301
explicit_attr="_explicit_timing_runs",
344302
help=f"Number of timing runs (default: {DEFAULT_TIMING_RUNS}; preset from --mode applies if not set)",
345303
)

0 commit comments

Comments
 (0)