diff --git a/docs/source/reference/2_qdq_placement.rst b/docs/source/reference/2_qdq_placement.rst new file mode 100644 index 0000000000..491b200ca0 --- /dev/null +++ b/docs/source/reference/2_qdq_placement.rst @@ -0,0 +1,1117 @@ +==================================================== +Automatic ONNX Q/DQ Placement Optimizer Architecture +==================================================== + +.. contents:: Table of Contents + :local: + :depth: 3 + +Overview +======== + +The ``modelopt.onnx.quantization.autotune`` module provides an automatic optimization framework for Quantize/Dequantize (Q/DQ) node placement in ONNX models. The system partitions ONNX computation graphs into smaller regions and systematically searches for optimal Q/DQ insertion points to minimize TensorRT inference latency. + +**Key Capabilities:** + +* **Automatic Region Discovery**: Identifies optimization regions around compute-intensive operations +* **Pattern-Based Optimization**: Groups structurally similar regions and applies learned schemes across all instances +* **Performance-Driven Search**: Uses TensorRT profiling to measure actual inference latency and guide optimization +* **Incremental State Management**: Supports crash recovery and resumption of optimization sessions +* **Pattern Cache**: Enables warm-start optimization by reusing known-good schemes from previous runs +* **Baseline Import**: Transfers quantization patterns from existing QDQ models + +Architecture Overview +===================== + +Core Design Principles +---------------------- + +1. **Hierarchical Region Partitioning**: This module decomposes ONNX graphs into a hierarchical tree of regions, enabling focused optimization at different granularity levels. + +2. **Pattern-Based Scheme Sharing**: Regions with identical topological structure share the same pattern signature. Optimization schemes are portable across all regions matching a pattern, reducing the search space significantly. + +3. **Performance-Driven Selection**: Every insertion scheme is evaluated through actual TensorRT engine compilation and profiling, ensuring real-world performance gains. + +4. **Incremental Optimization**: Regions are optimized sequentially with the best scheme committed before proceeding to the next region, allowing progressive refinement. + +Module Structure +---------------- + +.. code-block:: text + + autotune/ + ├── Core API + │ ├── autotuner.py # QDQAutotuner (automatic region discovery) + │ ├── autotuner_base.py # QDQAutotunerBase (core optimization logic) + │ ├── workflows.py # High-level workflow and benchmark helpers + │ └── common.py # Data structures (Region, Config, PatternCache, etc.) + │ + ├── Region Management + │ ├── region_search.py # CombinedRegionSearch (region discovery) + │ ├── region_pattern.py # RegionPattern (structural pattern matching) + │ └── region_inspect.py # CLI to inspect region search (debugging) + │ + ├── Q/DQ Insertion & Export + │ ├── insertion_points.py # Insertion point types and resolution + │ └── export_utils.py # Q/DQ node creation and ONNX export + │ + ├── Benchmarking + │ └── benchmark.py # TensorRTPyBenchmark, TrtExecBenchmark + │ + └── Entry Points + │ ├── __init__.py # Public API exports + │ └── __main__.py # Command-line interface + │ + Q/DQ analysis (used for baseline import) lives in the parent package: + modelopt.onnx.quantization.qdq_utils (e.g. get_quantized_tensors). + + +Key Components +============== + +1. Autotuner (autotuner.py, autotuner_base.py) +------------------------------------------------ + +The autotuner is the central orchestrator of the Q/DQ optimization process. + +QDQAutotunerBase +~~~~~~~~~~~~~~~~ + +Base class (in ``autotuner_base.py``) providing core optimization functionality: + +* **Scheme Generation**: Creates candidate Q/DQ insertion schemes for regions +* **Model Export**: Generates ONNX models with specified Q/DQ insertions applied +* **Performance Tracking**: Records and ranks schemes by measured latency +* **State Persistence**: Saves/loads optimization progress for crash recovery + +**Key Attributes:** + +* ``graph``: Clean ONNX GraphSurgeon representation of the model +* ``regions``: List of regions to optimize (populated by subclass) +* ``profiled_patterns``: Pattern-based scheme results +* ``current_profile_region``: Region currently being optimized +* ``config``: Configuration parameters +* ``pattern_cache``: Seed schemes from previous optimization runs + +**Workflow Methods:** + +* ``initialize(config, pattern_cache)``: Configure autotuner and prepare for profiling +* ``set_profile_region(region, commit)``: Select region to profile and commit previous results +* ``generate()``: Generate a new insertion scheme for current region +* ``export_onnx(path_or_none, insert_qdq, best=False)``: Export model with Q/DQ nodes. If path is ``None``, returns serialized model bytes (for in-memory benchmarking). When ``best=True``, exports using the current region's best scheme so far. +* ``submit(latency, success)``: Record performance measurement for current scheme +* ``save_state(path)`` / ``load_state(path)``: Persist/restore optimization state + +QDQAutotuner +~~~~~~~~~~~~ + +Concrete implementation with automatic region discovery: + +* Inherits from ``QDQAutotunerBase`` +* Automatically discovers regions during initialization using ``CombinedRegionSearch`` +* For custom partitioning, users can implement their own region search by subclassing ``RegionSearchBase`` and overriding ``_search_regions()`` in a subclass of ``QDQAutotuner`` to use it. + +**Initialization Process:** + +1. Constructs root region encompassing entire graph +2. Runs combined region search to identify optimization candidates +3. Prepares region hierarchy for sequential optimization + +2. Region Management +-------------------- + +Region Partitioning (region_search.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The region search module implements hierarchical partitioning strategies to decompose ONNX graphs into optimization regions. + +**CombinedRegionSearch** + +Multi-strategy region discovery combining: + +* **Pattern-Based Search**: Identifies common subgraph patterns (Conv+BN+Relu, etc.) +* **Operation-Centered Search**: Creates regions around major quantizable operations (Conv, MatMul, Gemm) +* **Sequence Merging**: Combines adjacent linear operations into single regions +* **Hierarchical Composition**: Builds multi-level region trees + +**Region Discovery Algorithm:** + +1. **Bottom-Up Search**: Start from individual operations +2. **Local Expansion**: Expand forward/backward from seed nodes within step limits +3. **Pattern Recognition**: Identify and merge common computational patterns +4. **Hierarchy Construction**: Build parent-child relationships between regions + +**Key Classes:** + +* ``RegionSearchBase``: Base class with graph traversal utilities +* ``CombinedRegionSearch``: Main region discovery implementation + +**Region Types:** + +* ``LEAF``: Atomic regions containing only direct nodes +* ``COMPOSITE``: Hierarchical regions containing child regions + +Region Pattern Analysis (region_pattern.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Provides structural pattern matching for regions, enabling scheme portability. + +**RegionPattern Class** + +Represents the topological signature of a region: + +* **Signature Generation**: Creates deterministic hash from region structure + + - Node operation types + - Connectivity patterns (inputs/outputs per node) + - Child region structures (for composite regions) + - Handles symmetric operations (Add, Mul) order-invariantly + +* **Pattern Matching**: Groups regions by structural similarity +* **Insertion Point Resolution**: Resolves pattern-relative addresses to actual tensor names + +**Signature Components:** + +.. code-block:: text + + Pattern Signature = hash( + node_types_sorted + + connectivity_structure + + child_region_patterns + + symmetry_normalization + ) + +**Key Methods:** + +* ``from_region(region, graph)``: Generate pattern from region + +Region Inspection (region_inspect.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +CLI and helper for debugging region discovery without running benchmarks: + +* **Entry point**: ``python -m modelopt.onnx.quantization.autotune.region_inspect --model model.onnx`` +* **inspect_region_search(onnx_path, max_sequence_size=10, include_all_regions=False)**: Loads the model, runs ``CombinedRegionSearch``, and prints region hierarchy, node counts, and summary statistics. Returns the list of discovered regions. +* **Options**: ``--verbose`` / ``-v`` for debug logging; ``--max-sequence-size`` for sequence region size; ``--include-all-regions`` to include regions without major quantizable ops (Conv, MatMul, etc.). + +Use this to verify how the autotuner partitions the graph before or during tuning. + +3. Q/DQ Insertion Points +------------------------ + +Insertion Point Types (insertion_points.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Defines three types of Q/DQ insertion locations: + +**NodeInputInsertionPoint** + +Inserts Q/DQ at a specific node input: + +* Pattern-relative node index +* Input tensor index (0, 1, 2, ...) +* Most common insertion type for quantizing operation inputs + +**RegionOutputInsertionPoint** + +Inserts Q/DQ at region output, only used for composite regions: + +* Pattern-relative child region index +* Output tensor index from that region +* Used for composite regions with child boundaries + +**ChildRegionInputInsertionPoint** + +Inserts Q/DQ at a child region input boundary: + +* Pattern-relative child region index +* Input tensor index to that region +* Enables quantization of data flowing into subregions + +**InsertionScheme** + +Collection of insertion points with performance metadata: + +* Set of insertion points (pattern-relative) +* Measured latency (ms) +* Success/failure status +* Unique fingerprint for deduplication + +**Resolution process** + +Insertion points and schemes are pattern-relative (node/region indices within the pattern), so the same scheme applies to every region that matches the pattern. Before adding Q/DQ nodes to the ONNX graph, the autotuner resolves them to concrete tensor names in the current model: + +1. Take pattern-relative insertion points from the scheme +2. Map node/region indices to actual graph elements for the target region +3. Resolve to concrete tensor names (producer/consumer) +4. Merge and deduplicate so each tensor gets at most one Q/DQ pair +5. Create and insert Q/DQ nodes at the resolved locations (see ``export_utils``) + +Q/DQ Analysis (modelopt.onnx.quantization.qdq_utils) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The autotune package uses the parent quantization package for Q/DQ analysis: + +* **get_quantized_tensors(onnx_model)** — Returns the set of tensor names that have Q/DQ nodes in the given ONNX model. Used by the workflow when ``qdq_baseline_model`` is provided to import insertion patterns from an existing quantized model for warm-start. + +4. Workflows (workflows.py) +--------------------------- + +High-level workflow functions orchestrating the complete optimization process. + +region_pattern_autotuning_workflow +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Main workflow for pattern-based Q/DQ optimization: + +**Workflow Steps:** + +1. **Initialization** + + * Load ONNX model + * Create autotuner with automatic region discovery + * Load pattern cache (if provided) + * Import patterns from QDQ baseline (if provided) + +2. **Baseline Measurement** + + * Export model without Q/DQ nodes + * Benchmark with TensorRT to establish baseline latency + +3. **Region Profiling Loop** + + For each discovered region: + + * Set as current profile region + * Generate N insertion schemes (default: 30) + * For each scheme: + + - Export ONNX model with Q/DQ nodes applied + - Build TensorRT engine and measure latency + - Submit result to autotuner + + * Commit best scheme for region + * Save incremental state (crash recovery) + +4. **Finalization** + + * Export final optimized model with all best schemes + * Measure final latency and compute speedup + * Save complete state and pattern cache + +**Key Features:** + +* **Automatic Resume**: Detects existing state file and continues from last checkpoint +* **Pattern Cache Warm-Start**: Seeds scheme generation with known-good patterns +* **Baseline Import**: Extracts quantization patterns from existing QDQ models +* **Progressive Saving**: State saved after each region for crash recovery + +Benchmarking Functions (workflows.py) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +* ``benchmark_onnx_model(model_path, log_file=None, flush_timing_cache=False)``: Run global benchmark on ONNX model (path or bytes); returns median latency in ms or ``float('inf')`` on failure. +* ``init_benchmark_instance(use_trtexec=False, plugin_libraries=None, timing_cache_file=None, warmup_runs=5, timing_runs=20, trtexec_args=None)``: Initialize the global TensorRT benchmark used by the workflow (must be called before ``benchmark_onnx_model``). + +5. Benchmarking (benchmark.py) +------------------------------- + +TensorRT integration for performance measurement. + +Benchmark Classes +~~~~~~~~~~~~~~~~~ + +**Abstract interface (Benchmark):** + +* ``run(model_path, log_file=None, flush_timing_cache=False)``: Benchmark model (file path or bytes) and return median latency (ms). + +**TensorRTPyBenchmark** (default) + +Uses TensorRT Python API: + +* Direct Python bindings to TensorRT +* Persistent Builder/Runtime/Logger instances +* Efficient for repeated benchmarking +* Timing cache support for faster engine builds +* Optional plugin library paths (list of ``.so`` paths) + +**TrtExecBenchmark** (optional, ``--use_trtexec``) + +Uses ``trtexec`` command-line tool: + +* Spawns subprocess per benchmark +* Useful when Python API is unavailable or for remote autotuning (e.g. ``--trtexec_benchmark_args "--remoteAutoTuningConfig=..."``) +* Supports same timing cache, warmup/timing runs, and plugin libraries +* ``trtexec_args``: optional list of extra arguments passed to trtexec + +**Benchmarking process:** + +1. Parse ONNX model (from path or bytes) +2. Build TensorRT engine with optimization +3. Load timing cache (if available) +4. Warmup iterations (default: 5) +5. Timing iterations (default: 20) +6. Median latency reported; timing cache updated + +**Configuration:** + +* ``timing_cache_file``: Path to TensorRT timing cache (default: system temp ``trtexec_timing.cache``) +* ``warmup_runs``: Warmup iterations (default: 5) +* ``timing_runs``: Timed iterations (default: 20) +* ``plugin_libraries``: List of TensorRT plugin ``.so`` paths (optional) +* ``trtexec_args``: Extra arguments for trtexec (optional; only when ``use_trtexec=True``) + +6. Configuration (common.py) +----------------------------- + +Config Class +~~~~~~~~~~~~ + +Central configuration for autotuning behavior. Controls the autotuning process including +performance requirements, quantization parameters, region building, scheme generation, and +pattern cache behavior. + +**Logging:** + +* ``verbose`` (bool): Enable detailed logging of autotuning progress (default: False) + +**Performance:** + +* ``performance_threshold`` (float): Minimum speedup ratio to accept a scheme; 1.0 = no improvement required, 1.02 = 2% (default: 1.02) + +**Quantization Parameters:** + +* ``default_q_scale`` (float): Default scale parameter for Q/DQ nodes. Typical range: 0.01-0.1 (default: 0.1) +* ``default_q_zero_point`` (int): Default zero-point for Q/DQ nodes; 0 for signed int8, 128 for uint8 (default: 0) +* ``default_quant_type`` (str): Quantization type for Q/DQ nodes: "int8" (default), "fp8" +* ``default_dq_dtype`` (str): Dtype for DequantizeLinear output when not inferred: "float32" (default), "float16", "bfloat16" + +**Region Builder Settings:** + +* ``maximum_sequence_region_size`` (int): Maximum number of nodes in a sequence region during + top-down refinement. Prevents overly large merged regions (default: 10) +* ``minimum_topdown_search_size`` (int): Minimum number of nodes in a region to trigger + top-down search during region building (default: 10) + +**Scheme Generation Settings:** + +* ``top_percent_to_mutate`` (float): Top percentage of best schemes to use as mutation seeds + during scheme generation. Range: 0.0-1.0 (default: 0.1 = top 10%) +* ``minimum_schemes_to_mutate`` (int): Minimum number of schemes to keep as mutation seeds, + even if top_percent_to_mutate results in fewer (default: 10) +* ``maximum_mutations`` (int): Maximum number of mutations to apply to a single scheme + during generation (default: 3) +* ``maximum_generation_attempts`` (int): Maximum attempts to generate a unique new scheme + before giving up (default: 100) + +**Pattern Cache Settings:** + +* ``pattern_cache_minimum_distance`` (int): Minimum edit distance required between schemes in cache. + When adding schemes, if a scheme is too similar (distance < minimum_distance) to an existing + scheme, only the better-performing one is kept (default: 4) +* ``pattern_cache_max_entries_per_pattern`` (int): Maximum number of schemes to keep per pattern + in pattern cache. Only the top N best-performing schemes are kept for each pattern. + Use 0 to keep all schemes (default: 32) + +**Example:** + +.. code-block:: python + + from modelopt.onnx.quantization.autotune import Config + + config = Config( + default_quant_type="fp8", + default_dq_dtype="float16", + default_q_scale=0.05, + top_percent_to_mutate=0.2, + maximum_mutations=5, + pattern_cache_minimum_distance=2, + pattern_cache_max_entries_per_pattern=64, + verbose=True, + ) + +PatternCache Class +~~~~~~~~~~~~~~~~~~ + +Stores top-performing schemes for pattern-based warm-start: + +* Maps pattern signatures to ``PatternSchemes`` +* Maintains diversity through distance-based filtering +* Limits entries per pattern to avoid bloat +* Serializable to YAML for persistence + +**Cache Operations:** + +* ``add_pattern_schemes(pattern_schemes)``: Add a ``PatternSchemes`` instance (with diversity filtering) +* ``get_pattern_schemes(pattern_signature)``: Return ``PatternSchemes`` for a pattern signature, or ``None`` +* ``save(path)`` / ``load(path)``: Persist cache to YAML / load from YAML + +Region Class +~~~~~~~~~~~~ + +Hierarchical subgraph representation: + +**Attributes:** + +* ``id``: Unique identifier +* ``level``: Hierarchical level (0=leaf, higher=composite) +* ``type``: RegionType (LEAF/COMPOSITE) +* ``parent``: Parent region reference +* ``children``: List of child regions +* ``nodes``: Set of direct node indices +* ``inputs``: Input tensor names +* ``outputs``: Output tensor names + +**Methods:** + +* Hierarchy navigation (parent/children access) +* Node management (direct vs recursive nodes) +* Boundary computation (inputs/outputs) +* Metadata storage + + +Autotuning Workflow +=================== + +Complete Optimization Process +------------------------------ + +.. code-block:: text + + ┌─────────────────────────────────────────────────────────────┐ + │ 1. Model Loading & Initialization │ + │ • Load ONNX model │ + │ • Create QDQAutotuner instance │ + │ • Run automatic region discovery │ + │ • Load pattern cache (warm-start) │ + │ • Import patterns from QDQ baseline (optional) │ + └────────────────────┬────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────┐ + │ 2. Baseline Measurement │ + │ • Export model without Q/DQ nodes │ + │ • Build TensorRT engine │ + │ • Measure baseline latency │ + │ • Submit to autotuner │ + └────────────────────┬────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────┐ + │ 3. Pattern-Based Region Profiling │ + │ ┌───────────────────────────────────────────┐ │ + │ │ For each region: │ │ + │ │ • Set as current profile region │ │ + │ │ • Check if pattern already profiled │ │ + │ │ • Generate N insertion schemes │ │ + │ │ ┌─────────────────────────────┐ │ │ + │ │ │ For each scheme: │ │ │ + │ │ │ • Generate unique scheme │ │ │ + │ │ │ • Export model with Q/DQ │ │ │ + │ │ │ • Build TRT engine │ │ │ + │ │ │ • Measure latency │ │ │ + │ │ │ • Submit result │ │ │ + │ │ └─────────────────────────────┘ │ │ + │ │ • Select best scheme for pattern │ │ + │ │ • Commit scheme (applies to all │ │ + │ │ regions with this pattern) │ │ + │ │ • Save incremental state │ │ + │ └───────────────────────────────────────────┘ │ + └────────────────────┬────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────┐ + │ 4. Finalization │ + │ • Commit final region │ + │ • Export optimized model with all best schemes │ + │ • Measure final latency │ + │ • Compute speedup ratio │ + │ • Save complete state file │ + │ • Save pattern cache for future runs │ + └─────────────────────────────────────────────────────────────┘ + +Scheme Generation Process +-------------------------- + +For each region being profiled: + +1. **Pattern Identification**: Compute structural pattern signature +2. **Pattern Schemes Initialization**: Create or retrieve ``PatternSchemes`` for pattern +3. **Cache Seeding**: Add schemes from pattern cache (warm-start) +4. **Iterative Generation**: Generate new schemes up to configured limit + + * Random selection of insertion points + * Diversity filtering (avoid duplicates) + * Pattern-relative addressing + +5. **Evaluation**: Each scheme is exported, benchmarked, and ranked +6. **Best Selection**: Scheme with lowest latency becomes pattern's best scheme + +Pattern-Relative Addressing +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Schemes are defined using pattern-relative indices: + +.. code-block:: python + + # Pattern-relative insertion point + NodeInputInsertionPoint(node_index=2, input_index=0) + + # Resolved to actual tensor for Region A + "conv1_output" # Node 2 in Region A's pattern + + # Resolved to actual tensor for Region B (same pattern) + "conv5_output" # Node 2 in Region B's pattern + +This portability enables: + +* One optimization per pattern instead of per region +* Transfer learning across similar models +* Significant reduction in search space + +State Management +---------------- + +Incremental State Saving +~~~~~~~~~~~~~~~~~~~~~~~~~ + +State is saved after each region optimization: + +**State File Contents (YAML):** + +State is saved to ``autotuner_state.yaml`` (or ``--state_file``). The pattern cache is saved alongside as ``_pattern_cache.yaml`` (e.g. ``autotuner_state_pattern_cache.yaml``). + +.. code-block:: yaml + + baseline_latency_ms: 12.5 + current_profile_pattern_schemes_signature: null # or pattern sig if interrupted mid-region + config: { ... } + patterns: + - pattern_signature: "abc123..." + schemes: [...] + best_scheme_index: 0 + - pattern_signature: "def456..." + schemes: [...] + best_scheme_index: 1 + +**Crash Recovery:** + +If optimization is interrupted: + +1. Rerun workflow with same output directory +2. State file is automatically detected and loaded +3. Already-profiled patterns are skipped +4. Optimization continues from next unprofiled region + +Pattern Cache +------------- + +Warm-Start Optimization +~~~~~~~~~~~~~~~~~~~~~~~ + +Pattern cache files store top-performing schemes: + +**Cache File Structure (YAML):** + +.. code-block:: yaml + + patterns: + pattern_def456: + signature: "def456..." + schemes: + - insertion_points: [...] + latency_ms: 9.8 + distance: 5 + - insertion_points: [...] + latency_ms: 10.1 + distance: 7 + max_entries: 16 + +**Usage:** + +1. After first optimization, pattern cache saved automatically +2. For similar models, load cache at initialization +3. Cache schemes tested first before random generation +4. Enables faster convergence to optimal solutions + +**Diversity Filtering:** + +* Schemes are filtered by minimum Hamming distance +* Ensures cache contains diverse candidates +* Prevents redundant similar schemes + +Region Discovery Details +======================== + +Hierarchical Partitioning Strategy +----------------------------------- + +The region search algorithm builds a hierarchical tree of regions: + +Level 0: Leaf Regions +~~~~~~~~~~~~~~~~~~~~~~ + +* Individual operations or small operation sequences +* Conv, MatMul, Gemm, Add, etc. +* Forward/backward expansion around seed nodes +* Direct boundary computation + +Level 1+: Composite Regions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +* Merging of related leaf regions +* Pattern-based combination (Conv+BN+Relu) +* Sequence merging (Linear→Linear→Linear) +* Hierarchical boundaries (child inputs/outputs) + +Region Boundaries +----------------- + +Input Tensors +~~~~~~~~~~~~~ + +Tensors consumed by region nodes but produced outside: + +* From model inputs +* From nodes in other regions +* Used to determine Q/DQ insertion at region entry + +Output Tensors +~~~~~~~~~~~~~~ + +Tensors produced by region nodes and consumed outside: + +* By nodes in other regions +* As model outputs +* Used to determine Q/DQ insertion at region exit + +Boundary Computation Algorithm +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +1. Collect all tensors consumed by region nodes +2. Filter out tensors produced within region +3. Remaining = input boundary tensors +4. Collect all tensors produced by region nodes +5. Filter out tensors only consumed within region +6. Remaining = output boundary tensors + +Insertion Point Selection +========================= + +The autotuner uses the same three insertion point types described in **3. Q/DQ Insertion Points** (``NodeInputInsertionPoint``, ``RegionOutputInsertionPoint``, ``ChildRegionInputInsertionPoint``). In practice: **node input** quantization is the most common (e.g. at Conv/MatMul inputs); **region output** and **child region input** quantization apply at composite-region boundaries (e.g. residual connections) and enable hierarchical strategies. + +Scheme Generation Strategies +----------------------------- + +Random Sampling +~~~~~~~~~~~~~~~ + +* Randomly select subset of available insertion points +* Probability-based selection (configurable) +* Generates diverse candidate schemes +* Default strategy for exploration + +Cache-Guided Sampling +~~~~~~~~~~~~~~~~~~~~~ + +* When pattern cache available, test cached schemes first +* Provides warm-start for faster convergence +* Falls back to random sampling after cache exhausted + +Diversity Filtering +~~~~~~~~~~~~~~~~~~~ + +* Compute Hamming distance between schemes +* Reject schemes too similar to already-tested ones +* Ensures exploration of diverse configurations +* Minimum distance threshold configurable + +Performance Evaluation +====================== + +TensorRT Engine Building +------------------------- + +For each scheme: + +1. **ONNX Export**: Generate model with Q/DQ nodes applied +2. **Parser**: TensorRT parses ONNX graph +3. **Optimization**: TensorRT layer fusion, kernel selection +4. **Timing Cache**: Reuse measured kernel timings +5. **Engine Build**: Generate optimized engine binary + +Latency Measurement +------------------- + +Benchmarking Protocol: + +1. **Engine Loading**: Load built engine to GPU +2. **Warmup Phase**: Run N iterations (default: 5) + + * Eliminate cold-start effects + * Prime GPU caches + +3. **Timing Phase**: Run M iterations (default: 10-100) + + * Measure end-to-end latency per iteration + * Synchronize GPU after each iteration + +4. **Aggregation**: Compute median latency (robust to outliers) + +Best Scheme Selection +---------------------- + +For each pattern: + +* Track all successfully-benchmarked schemes +* Rank by measured latency (lower is better) +* Select scheme with minimum latency +* Apply to all regions matching pattern + +Usage Patterns +============== + +Command-Line Interface +---------------------- + +**Prog:** ``modelopt.onnx.quantization.autotune``. Arguments use underscores. Short options: ``-m`` (onnx_path), ``-o`` (output_dir), ``-s`` (schemes_per_region), ``-v`` (verbose). Run ``python -m modelopt.onnx.quantization.autotune --help`` for full help. + +Command-line arguments +^^^^^^^^^^^^^^^^^^^^^^ + +**Model and output** + +* ``--onnx_path``, ``-m`` (required) — Path to ONNX model file. +* ``--output_dir``, ``-o`` — Output directory for results. Default: ``./autotuner_output``. + +**Autotuning strategy** + +* ``--schemes_per_region``, ``-s`` — Number of schemes to test per region. Default: ``30``. +* ``--pattern_cache`` — Path to pattern cache YAML for warm-start. Default: ``None``. +* ``--qdq_baseline`` — Path to QDQ baseline ONNX model to import quantization patterns. Default: ``None``. +* ``--state_file`` — State file path for resume. Default: ``/autotuner_state.yaml``. +* ``--node_filter_list`` — Path to file of wildcard patterns (one per line); regions with no matching nodes are skipped. Default: ``None``. + +**Quantization** + +* ``--quant_type`` — Quantization data type. Choices: ``int8``, ``fp8``. Default: ``int8``. +* ``--default_dq_dtype`` — Default DQ output dtype when not deduced. Choices: ``float16``, ``float32``, ``bfloat16``. Default: ``float32``. + +**TensorRT benchmark** + +* ``--use_trtexec`` — Use trtexec for benchmarking instead of TensorRT Python API. Default: ``False``. +* ``--timing_cache`` — TensorRT timing cache file path. Default: system temp ``trtexec_timing.cache``. +* ``--warmup_runs`` — Number of warmup runs. Default: ``5``. +* ``--timing_runs`` — Number of timing runs. Default: ``20``. +* ``--plugin_libraries``, ``--plugins`` — TensorRT plugin libraries (``.so``), space-separated. Default: ``None``. +* ``--trtexec_benchmark_args`` — Extra arguments to trtexec as a single quoted string (e.g. ``'--fp16 --workspace=4096'`` or ``'--remoteAutoTuningConfig=...'``). Default: ``None``. + +**Logging** + +* ``--verbose``, ``-v`` — Enable verbose DEBUG logging. + +Basic Usage +~~~~~~~~~~~ + +.. code-block:: bash + + # Default INT8 quantization (output dir default: ./autotuner_output) + python -m modelopt.onnx.quantization.autotune --onnx_path model.onnx + + # Specify output and FP8 with more schemes + python -m modelopt.onnx.quantization.autotune \ + --onnx_path model.onnx \ + --output_dir ./output \ + --quant_type fp8 \ + --schemes_per_region 50 + +Advanced Usage +~~~~~~~~~~~~~~ + +.. code-block:: bash + + # Pattern cache warm-start + python -m modelopt.onnx.quantization.autotune \ + --onnx_path model.onnx \ + --output_dir ./output \ + --pattern_cache ./previous_run/autotuner_state_pattern_cache.yaml + + # Import patterns from existing QDQ model + python -m modelopt.onnx.quantization.autotune \ + --onnx_path model.onnx \ + --output_dir ./output \ + --qdq_baseline quantized_baseline.onnx + + # Custom state file and node filter (skip regions with no matching nodes) + python -m modelopt.onnx.quantization.autotune \ + --onnx_path model.onnx \ + --output_dir ./output \ + --state_file ./output/custom_state.yaml \ + --node_filter_list nodes_to_include.txt + + # Resume after interruption: rerun with same output_dir; state is auto-loaded + python -m modelopt.onnx.quantization.autotune \ + --onnx_path model.onnx \ + --output_dir ./output + + # Use trtexec and pass extra args (e.g. remote autotuning) + python -m modelopt.onnx.quantization.autotune \ + --onnx_path model.onnx \ + --output_dir ./output \ + --use_trtexec \ + --trtexec_benchmark_args "--remoteAutoTuningConfig=..." + + # Custom timing cache and DQ dtype + python -m modelopt.onnx.quantization.autotune \ + --onnx_path model.onnx \ + --output_dir ./output \ + --timing_cache /path/to/cache \ + --default_dq_dtype float16 + +Python API +---------- + +High-Level Workflow +~~~~~~~~~~~~~~~~~~~ + +.. code-block:: python + + from pathlib import Path + from modelopt.onnx.quantization.autotune.workflows import ( + region_pattern_autotuning_workflow, + ) + + # Pattern-based optimization (recommended). Call init_benchmark_instance first if not using CLI. + autotuner = region_pattern_autotuning_workflow( + model_path="model.onnx", + output_dir=Path("./output"), + num_schemes_per_region=30, + pattern_cache_file=None, + state_file=None, + quant_type="int8", + default_dq_dtype="float32", + qdq_baseline_model=None, + node_filter_list=None, + verbose=False, + ) + +Low-Level API +~~~~~~~~~~~~~ + +Initialize the global benchmark with ``init_benchmark_instance``, then use ``benchmark_onnx_model`` for measurements. The workflow uses this same global; when calling the workflow from Python you do not need to call ``init_benchmark_instance`` yourself (the CLI does it). + +.. code-block:: python + + import onnx + from modelopt.onnx.quantization.autotune import QDQAutotuner, Config + from modelopt.onnx.quantization.autotune.workflows import ( + init_benchmark_instance, + benchmark_onnx_model, + ) + + # Initialize benchmark (required before benchmark_onnx_model) + init_benchmark_instance( + use_trtexec=False, + timing_cache_file="/tmp/timing.cache", + warmup_runs=5, + timing_runs=20, + ) + + # Load and initialize autotuner + model = onnx.load("model.onnx") + autotuner = QDQAutotuner(model) + config = Config(default_quant_type="fp8") + autotuner.initialize(config) + + # Measure baseline + autotuner.export_onnx("baseline.onnx", insert_qdq=False) + baseline_latency = benchmark_onnx_model("baseline.onnx") + autotuner.submit(baseline_latency) + + # Profile each region + for region in autotuner.regions: + autotuner.set_profile_region(region, commit=True) + for _ in range(30): + scheme_idx = autotuner.generate() + if scheme_idx == -1: + break + model_bytes = autotuner.export_onnx(None, insert_qdq=True) + latency = benchmark_onnx_model(model_bytes) + autotuner.submit(latency, success=(latency != float("inf"))) + + # Finalize and export + autotuner.set_profile_region(None, commit=True) + autotuner.export_onnx("optimized.onnx", insert_qdq=True) + +Design Rationale +================ + +Pattern-Based Optimization +-------------------------- + +The autotuner uses a pattern-based optimization approach: + +**How It Works:** + +* Regions with identical structural patterns are grouped together +* Each unique pattern is optimized once with N schemes tested +* The best scheme for a pattern is automatically applied to all regions matching that pattern +* This dramatically reduces the number of benchmarks required + +**Benefits:** + +* **Efficiency**: Optimize each unique pattern once instead of every region independently +* **Consistency**: All structurally similar regions use the same quantization strategy +* **Scalability**: Time scales with number of unique patterns, not total regions +* **Transfer Learning**: Pattern cache enables warm-start on similar models + +**Trade-offs:** + +* Assumes structural similarity implies performance similarity +* May not capture performance variations due to different input/output contexts +* Models with many unique patterns see less benefit + +**Best For:** + +* Models with repeated structures (transformers, ResNets, etc.) +* Most production models where consistent quantization is desirable +* Scenarios where optimization time is constrained + +Forward-Only Region Search +--------------------------- + +The current implementation focuses on forward (downstream) region expansion: + +* Simpler boundary computation +* Aligns with typical dataflow (inputs → outputs) +* Sufficient for most optimization scenarios +* Backward expansion can be added if needed + +Hierarchical vs Flat Regions +----------------------------- + +Hierarchical region structure provides: + +* **Multi-Granularity Optimization**: Can optimize at different abstraction levels +* **Composability**: Child regions can be optimized independently +* **Scalability**: Handles large models by partitioning into manageable pieces +* **Pattern Reuse**: Patterns can be defined at multiple levels + +Incremental State Saving +------------------------- + +State is saved after each region instead of at the end: + +* **Crash Recovery**: Long optimizations (hours/days) can be resumed +* **Early Access**: Partial results available before completion +* **Debugging**: Can inspect intermediate state +* **Resource Management**: Can pause/resume optimization as needed + +Limitations and Future Work +============================ + +Current Limitations +------------------- + +1. **Search Space Exploration** + + * Random sampling may miss optimal configurations + * No gradient-based or learned search strategies + * Number of schemes per region is fixed + +2. **Pattern Matching** + + * Assumes structural similarity implies performance similarity + * May miss performance variations due to input data or context + +3. **Quantization Types** + + * Uniform quantization for all Q/DQ nodes in a scheme + * No mixed-precision within schemes + +4. **Benchmarking Overhead** + + * TensorRT engine build time dominates (even with timing cache) + * Each scheme requires full engine rebuild + +5. **Input Sensitivity** + + * Performance measured on default/dummy inputs + * May not generalize to all input distributions + +Future Enhancements +------------------- + +1. **Advanced Search Strategies** + + * Reinforcement learning-based exploration + * Bayesian optimization for scheme selection + * Evolutionary algorithms for population-based search + +2. **Mixed-Precision Support** + + * Different quantization types per insertion point + * Learnable precision selection + * Per-layer quantization bit-width + +3. **Accuracy Constraint** + + * Optimize for latency while maintaining accuracy threshold + * Multi-objective optimization (latency + accuracy) + * Accuracy-aware scheme selection and evaluation + * Integration with calibration and validation datasets + * Pareto frontier exploration for latency-accuracy trade-offs + +Glossary +======== + +.. glossary:: + + Q/DQ Nodes + QuantizeLinear (Q) and DequantizeLinear (DQ) nodes in ONNX that convert between + floating-point and quantized integer representations. + + Region + A hierarchical subgraph in an ONNX computation graph with well-defined input and + output boundaries. Can be LEAF (atomic), COMPOSITE (containing child regions), or + ROOT (entire graph). + + Pattern + A structural signature representing the topology and operation types in a region. + Regions with identical patterns can share insertion schemes. + + Insertion Scheme + A collection of insertion points specifying where to insert Q/DQ nodes within a + region. Schemes use pattern-relative addressing for portability. + + Insertion Point + A specific location where Q/DQ nodes can be inserted: at a node input, region + output, or region boundary. + + Pattern-Relative Addressing + Addressing scheme using indices relative to pattern structure rather than absolute + graph positions, enabling scheme portability across regions with matching patterns. + + Pattern Cache + Collection of top-performing insertion schemes for multiple patterns, used to + warm-start optimization on similar models. + + Baseline Latency + Inference latency of the original model without any Q/DQ nodes inserted, used as + reference for measuring optimization improvement. + + TensorRT Timing Cache + Persistent cache of kernel performance measurements maintained by TensorRT to + accelerate engine building by reusing previously measured timings. + + Scheme Diversity + Measure of how different two insertion schemes are, typically computed as Hamming + distance between their insertion point sets. + +References +========== + +* **ONNX**: https://onnx.ai/ +* **ONNX Technical Details** (numeric types, quantization-related): https://onnx.ai/onnx/technical/index.html +* **TensorRT Documentation**: https://docs.nvidia.com/deeplearning/tensorrt/ +* **NVIDIA Model Optimizer (ModelOpt)**: https://github.com/NVIDIA/Model-Optimizer +* **ONNX GraphSurgeon**: https://github.com/NVIDIA/TensorRT/tree/main/tools/onnx-graphsurgeon