Skip to content

Add ResNet50 support for torch_onnx quantization workflow#1263

Open
ajrasane wants to merge 4 commits intomainfrom
ajarasane/resnet_support
Open

Add ResNet50 support for torch_onnx quantization workflow#1263
ajrasane wants to merge 4 commits intomainfrom
ajarasane/resnet_support

Conversation

@ajrasane
Copy link
Copy Markdown
Contributor

@ajrasane ajrasane commented Apr 14, 2026

Summary

  • Add end-to-end ResNet50 support in the torch_onnx quantization → ONNX export → TRT engine pipeline
  • Fix multiple Conv2d-related export issues that blocked Conv2d-heavy models from working with FP8/INT8/MXFP8/NVFP4/auto quantization modes
  • Fix configure_linear_module_onnx_quantizers to handle all modules with block quantization (not just nn.Linear), fixing NVFP4/MXFP8 export for models with quantized non-Linear modules
  • Add --trt_build flag to torch_quant_to_onnx.py and simplify test infrastructure

Files Changed

  • modelopt/torch/_deploy/utils/torch_onnx.py — Disable FP8 Conv2d weight quantizers and autocast during ONNX export
  • modelopt/torch/quantization/export_onnx.py — Fix configure_linear_module_onnx_quantizers for all module types with block quantization
  • examples/torch_onnx/torch_quant_to_onnx.py — Add --trt_build flag, calibration for FP8 override quantizers, Conv2d→FP8 override for auto mode, filter_func updates
  • examples/torch_onnx/README.md — Add ResNet50 to supported models table
  • tests/examples/torch_onnx/test_torch_quant_to_onnx.py — Add ResNet50 test entry, simplify using --trt_build
  • tests/_test_utils/torch/vision_models.py — Add ResNet50 to timm model registry

Quantization modes passing

  • ✅ FP8, INT8, MXFP8, NVFP4, Auto (all 5 modes pass export + TRT build)
  • INT4_AWQ excluded (pre-existing limitation for all models)

Test plan

  • All 5 resnet50 test modes pass: pytest tests/examples/torch_onnx/test_torch_quant_to_onnx.py -k resnet50 (5/5 passed)
  • Full regression: 18 passed, 2 failed (pre-existing swinv2_tiny fp8/int8 failures)

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features
    • Added ResNet50 to supported ONNX export vision models with FP8, INT8, MXFP8, and NVFP4 quantization support.
    • Introduced TensorRT engine build capability via new --trt_build command-line flag for model optimization.
    • Enhanced quantization calibration workflow with improved FP8 model handling and expanded block-quantization support across module types.

Add end-to-end support for ResNet50 (Conv2d-heavy model) in the
torch_onnx quantization → ONNX export → TRT engine pipeline.

Key fixes for Conv2d-heavy models:
- Disable FP8 Conv2d weight quantizers during ONNX export to avoid
  TorchScript exporter's "kernel of unknown shape" error (FP8
  DequantizeLinear produces dynamic-shape outputs incompatible with
  Conv2d's static kernel requirement)
- Disable autocast for FP8/INT8 quantized models during export (prevents
  dynamic-shape kernels from autocast-induced FP16 casting)
- Fix configure_linear_module_onnx_quantizers to handle all modules with
  block quantization (not just nn.Linear), fixing NVFP4/MXFP8 export for
  models with quantized non-Linear modules like MaxPool2d
- Add calibration step for FP8 override quantizers that aren't calibrated
  by mtq.quantize() in MXFP8/NVFP4 modes
- Override Conv2d block quantizers to FP8 in auto mode for TRT compat
- Add maxpool and global_pool to filter_func (TRT DynamicQuantize
  requires 2D/3D input, but pooling layers operate on 4D tensors)
- Always load calibration data (MXFP8 Conv2d FP8 overrides need it)

Signed-off-by: ajrasane <arasane@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
@ajrasane ajrasane requested review from a team as code owners April 14, 2026 21:28
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Adds resnet50 to tests/docs; enhances the quantization/export pipeline to calibrate previously uncalibrated quantizers, override Conv2d block quantizers to FP8-like per-tensor settings for calibration, and temporarily disable FP8 Conv weight quantizers during ONNX export; adds a TensorRT build flag/step.

Changes

Cohort / File(s) Summary
Docs & Tests
examples/torch_onnx/README.md, tests/_test_utils/torch/vision_models.py, tests/examples/torch_onnx/test_torch_quant_to_onnx.py
Added resnet50 to README and test model list; test removed inline TensorRT verification and now passes --trt_build to the example script.
Quantization CLI / Helpers
examples/torch_onnx/torch_quant_to_onnx.py
Extended exclusion regex to skip downsample, maxpool, global_pool; added _calibrate_uncalibrated_quantizers() and _override_conv2d_to_fp8() and wired them into quantize_model()/auto_quantize_model() to perform forward-pass calibration when a data_loader is provided; added --trt_build and build_trt_engine() to run trtexec.
ONNX Export Runtime
modelopt/torch/_deploy/utils/torch_onnx.py
Added _disable_fp8_conv_weight_quantizers() context manager and included it in the export context when FP8 is present; treat INT8 models as requiring autocast disabled and block BF16/FP16 weight-conversion gating.
ONNX Quantizer Configuration
modelopt/torch/quantization/export_onnx.py
configure_linear_module_onnx_quantizers now sets _onnx_quantizer_type ("dynamic" vs "static") based on presence of block_sizes on input/weight quantizers (block-quantization capability) rather than only for nn.Linear.

Sequence Diagram(s)

sequenceDiagram
    participant Script as Quantization Script
    participant Model as Model
    participant DL as DataLoader
    participant Quant as Quantizers
    participant Calib as Calibrator
    participant Export as ONNX Export

    Script->>Model: load model
    Script->>Model: quantize_model()/auto_quantize_model(data_loader)
    Model->>Quant: enumerate enabled quantizers
    Quant-->>Model: identify uncalibrated & Conv2d block quantizers
    Model->>Calib: _calibrate_uncalibrated_quantizers(data_loader)
    activate Calib
    Calib->>DL: iterate samples for calibration
    Calib->>Calib: enable/disable calib, load calib_amax
    deactivate Calib
    Model->>Model: _override_conv2d_to_fp8(data_loader) if Conv2d block quantizers found
    Model->>Calib: calibrate overridden Conv2d quantizers (forward loop)
    Script->>Export: get_onnx_bytes_and_metadata(model)
    Export->>Export: enter inference/autocast/quantizer contexts
    Export->>Export: enter _disable_fp8_conv_weight_quantizers() if FP8 present
    Export->>Model: export to ONNX
    Export-->>Script: return ONNX bytes + metadata
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective of the pull request—adding ResNet50 support to the torch_onnx quantization workflow—and is directly aligned with the comprehensive changes across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed All modified Python files have been reviewed against SECURITY.md practices. No critical security anti-patterns were found including unsafe torch.load(), numpy.load(), hardcoded trust_remote_code=True, eval()/exec() on user input, or nosec bypasses. The subprocess.run() call correctly uses list-based command structure without shell=True, preventing command injection.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ajarasane/resnet_support

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 14, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1263/

Built to branch gh-pages at 2026-04-14 22:07 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
examples/torch_onnx/torch_quant_to_onnx.py (2)

164-164: Consider using the public amax property instead of checking internal attribute.

The code uses hasattr(quantizer, "_amax") to detect uncalibrated quantizers. Based on the TensorQuantizer implementation, the public amax property returns None when _amax is not set. Using quantizer.amax is None would be more aligned with the public API.

♻️ Suggested fix
-            if quantizer.is_enabled and not quantizer.block_sizes and not hasattr(quantizer, "_amax"):
+            if quantizer.is_enabled and not quantizer.block_sizes and quantizer.amax is None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` at line 164, Replace the internal
attribute check on quantizer with the public property: instead of checking
hasattr(quantizer, "_amax") in the conditional that currently reads if
quantizer.is_enabled and not quantizer.block_sizes and not hasattr(quantizer,
"_amax"): use quantizer.amax is None to detect uncalibrated TensorQuantizer
instances; update the condition to if quantizer.is_enabled and not
quantizer.block_sizes and quantizer.amax is None so it uses the public
TensorQuantizer API.

239-241: Prefer public property setters over internal attribute assignment.

The code directly assigns to _num_bits and _axis, which are internal attributes. Based on the relevant code snippet from tensor_quantizer.py, these have public property setters (num_bits and axis) that also update the calibrator state. Using the internal attributes may leave the calibrator out of sync.

♻️ Suggested fix
                 # Override to FP8 per-tensor
                 quantizer.block_sizes = None
-                quantizer._num_bits = (4, 3)
-                quantizer._axis = None
+                quantizer.num_bits = (4, 3)
+                quantizer.axis = None
                 quantizer.enable_calib()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 239 - 241, The code
assigns internal attributes quantizer._num_bits and quantizer._axis directly
which can leave the calibrator out of sync; replace those direct assignments
with the public property setters quantizer.num_bits = (4, 3) and quantizer.axis
= None (leave quantizer.block_sizes = None as-is) so the setter logic in
tensor_quantizer.py runs and updates the calibrator state accordingly.
modelopt/torch/quantization/export_onnx.py (1)

657-674: Consider restoring quantizer state after export (optional).

The context manager sets _onnx_quantizer_type but doesn't restore the original state in a finally block. While this is likely fine since models aren't typically reused after ONNX export, adding state restoration would make the context manager more robust for edge cases.

♻️ Optional: Add state restoration
 `@contextlib.contextmanager`
 def configure_linear_module_onnx_quantizers(model):
     """Sets the onnx export attributes for the given model.
     ...
     """
+    original_states = []
     for _, module in model.named_modules():
         if hasattr(module, "input_quantizer") and module.input_quantizer.block_sizes:
+            original_states.append((module.input_quantizer, "_onnx_quantizer_type", getattr(module.input_quantizer, "_onnx_quantizer_type", None)))
             module.input_quantizer._onnx_quantizer_type = "dynamic"
         if hasattr(module, "weight_quantizer") and module.weight_quantizer.block_sizes:
+            original_states.append((module.weight_quantizer, "_onnx_quantizer_type", getattr(module.weight_quantizer, "_onnx_quantizer_type", None)))
             module.weight_quantizer._onnx_quantizer_type = "static"
-    yield
+    try:
+        yield
+    finally:
+        for obj, attr, orig_val in original_states:
+            if orig_val is None:
+                if hasattr(obj, attr):
+                    delattr(obj, attr)
+            else:
+                setattr(obj, attr, orig_val)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/export_onnx.py` around lines 657 - 674, The
context manager configure_linear_module_onnx_quantizers mutates
module.input_quantizer._onnx_quantizer_type and
module.weight_quantizer._onnx_quantizer_type but never restores original values;
update configure_linear_module_onnx_quantizers to record the original
_onnx_quantizer_type for each module found via model.named_modules() (checking
hasattr(module, "input_quantizer") / "weight_quantizer" and block_sizes) and
then yield, ensuring a finally block iterates the saved entries to reset each
input_quantizer._onnx_quantizer_type and weight_quantizer._onnx_quantizer_type
back to their original values (including handling missing/None originals).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 161-165: Reformat the if-block to satisfy ruff: keep the same
logic but use consistent spacing and line breaks (wrap the long conditional in
parentheses across lines) and standardize quotes; specifically, keep the initial
guard "if not hasattr(module, attr_name): continue", assign quantizer =
getattr(module, attr_name), then write the conditional as e.g. "if
(quantizer.is_enabled and not quantizer.block_sizes and not hasattr(quantizer,
'_amax')): quantizer.enable_calib()" (or break into multiple indented lines) so
spacing, parentheses, and quotation are ruff-compliant while preserving behavior
of module, attr_name, quantizer, .is_enabled, .block_sizes, ._amax and
enable_calib().

In `@modelopt/torch/_deploy/utils/torch_onnx.py`:
- Around line 587-593: The conditional assignment to conv_wq_context is
misformatted for the project's linter; reformat the expression using a single
parenthesized conditional expression: set conv_wq_context =
(_disable_fp8_conv_weight_quantizers(model) if is_fp8_quantized(model) else
nullcontext()), referencing the existing symbols
_disable_fp8_conv_weight_quantizers, is_fp8_quantized, nullcontext, and model so
the ternary is on one line inside parentheses and passes ruff formatting.

---

Nitpick comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Line 164: Replace the internal attribute check on quantizer with the public
property: instead of checking hasattr(quantizer, "_amax") in the conditional
that currently reads if quantizer.is_enabled and not quantizer.block_sizes and
not hasattr(quantizer, "_amax"): use quantizer.amax is None to detect
uncalibrated TensorQuantizer instances; update the condition to if
quantizer.is_enabled and not quantizer.block_sizes and quantizer.amax is None so
it uses the public TensorQuantizer API.
- Around line 239-241: The code assigns internal attributes quantizer._num_bits
and quantizer._axis directly which can leave the calibrator out of sync; replace
those direct assignments with the public property setters quantizer.num_bits =
(4, 3) and quantizer.axis = None (leave quantizer.block_sizes = None as-is) so
the setter logic in tensor_quantizer.py runs and updates the calibrator state
accordingly.

In `@modelopt/torch/quantization/export_onnx.py`:
- Around line 657-674: The context manager
configure_linear_module_onnx_quantizers mutates
module.input_quantizer._onnx_quantizer_type and
module.weight_quantizer._onnx_quantizer_type but never restores original values;
update configure_linear_module_onnx_quantizers to record the original
_onnx_quantizer_type for each module found via model.named_modules() (checking
hasattr(module, "input_quantizer") / "weight_quantizer" and block_sizes) and
then yield, ensuring a finally block iterates the saved entries to reset each
input_quantizer._onnx_quantizer_type and weight_quantizer._onnx_quantizer_type
back to their original values (including handling missing/None originals).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: cb1433e9-ae94-4122-80a6-bc7f9a108588

📥 Commits

Reviewing files that changed from the base of the PR and between 73be810 and 56a1d21.

📒 Files selected for processing (6)
  • examples/torch_onnx/README.md
  • examples/torch_onnx/torch_quant_to_onnx.py
  • modelopt/torch/_deploy/utils/torch_onnx.py
  • modelopt/torch/quantization/export_onnx.py
  • tests/_test_utils/torch/vision_models.py
  • tests/examples/torch_onnx/test_torch_quant_to_onnx.py

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 27.77778% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.40%. Comparing base (b6c6ec3) to head (ba020d5).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/_deploy/utils/torch_onnx.py 31.25% 11 Missing ⚠️
modelopt/torch/quantization/export_onnx.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1263      +/-   ##
==========================================
- Coverage   76.90%   75.40%   -1.50%     
==========================================
  Files         350      350              
  Lines       40524    40539      +15     
==========================================
- Hits        31166    30570     -596     
- Misses       9358     9969     +611     
Flag Coverage Δ
examples 37.71% <16.66%> (-4.92%) ⬇️
gpu 57.41% <27.77%> (-0.12%) ⬇️
unit 55.60% <27.77%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Reflect all supported quantization modes and Conv2d override behavior.

Signed-off-by: ajrasane <arasane@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 38-46: The docstring in torch_quant_to_onnx.py incorrectly groups
INT4_AWQ with fully supported quantization modes; update the top-level
description to either remove INT4_AWQ from the supported list or add a clear
caveat that INT4_AWQ is a known limitation and may not work end-to-end (e.g.,
"INT4_AWQ is experimental/limited — see PR objectives for current limitations"),
ensuring references to the script name and the quantization modes (FP8, INT8,
MXFP8, NVFP4, INT4_AWQ, AUTO) are adjusted so users won't assume INT4_AWQ is
fully supported.
- Around line 227-256: The override of Conv2d block quantizers to FP8 in
_override_conv2d_to_fp8 must not be applied after mtq.auto_quantize() because
mtq.auto_quantize() returns a search_state based on the candidate configs;
either incorporate the Conv2d->FP8/disable-block-quantization rule into the
candidate generation passed into mtq.auto_quantize() (so those Conv2d layers are
treated as FP8 during the search) or, if you must keep the override path,
recompute and validate the effective_bits/budget and update/refresh the returned
search_state after running _override_conv2d_to_fp8 so the final model’s budget
reflects FP8 costs for Conv2d (reference symbols: _override_conv2d_to_fp8,
mtq.auto_quantize, search_state).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 34fcb2fe-f485-45bd-96e4-df7cebf815fa

📥 Commits

Reviewing files that changed from the base of the PR and between 56a1d21 and 9810e1d.

📒 Files selected for processing (1)
  • examples/torch_onnx/torch_quant_to_onnx.py

Comment on lines +38 to +46
Quantize a timm vision model and export to ONNX for TensorRT deployment.

Supports FP8, INT8, MXFP8, NVFP4, INT4_AWQ, and AUTO (mixed-precision) quantization modes.

The script will:
1. Given the model name, create a timm torch model.
2. Quantize the torch model in MXFP8, NVFP4, INT4_AWQ, or AUTO mode.
3. Export the quantized torch model to ONNX format.
1. Load a pretrained timm model (e.g., ViT, Swin, ResNet).
2. Quantize the model using the specified mode. For models with Conv2d layers,
Conv2d quantization is automatically overridden for TensorRT compatibility
(FP8 for MXFP8/NVFP4, INT8 for INT4_AWQ).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Don’t advertise INT4_AWQ as supported end-to-end here.

The PR objectives still call out INT4_AWQ as a known limitation, but this docstring now groups it with the working modes. Please caveat or remove it here so users do not assume this example is expected to succeed in that mode.

✏️ Suggested wording
-Supports FP8, INT8, MXFP8, NVFP4, INT4_AWQ, and AUTO (mixed-precision) quantization modes.
+Supports FP8, INT8, MXFP8, NVFP4, and AUTO (mixed-precision) quantization modes.
+`INT4_AWQ` remains a known limitation for this example.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 38 - 46, The
docstring in torch_quant_to_onnx.py incorrectly groups INT4_AWQ with fully
supported quantization modes; update the top-level description to either remove
INT4_AWQ from the supported list or add a clear caveat that INT4_AWQ is a known
limitation and may not work end-to-end (e.g., "INT4_AWQ is experimental/limited
— see PR objectives for current limitations"), ensuring references to the script
name and the quantization modes (FP8, INT8, MXFP8, NVFP4, INT4_AWQ, AUTO) are
adjusted so users won't assume INT4_AWQ is fully supported.

Comment on lines +227 to +256
def _override_conv2d_to_fp8(model, data_loader):
"""Override Conv2d layers with NVFP4/MXFP8 block quantization to FP8.

TRT DynamicQuantize requires 2D/3D input, but Conv2d operates on 4D tensors.
This overrides Conv2d block quantizers to FP8 per-tensor and calibrates them.
"""
overridden = []
for _, module in model.named_modules():
if not isinstance(module, torch.nn.Conv2d):
continue
for attr_name in ("input_quantizer", "weight_quantizer"):
if not hasattr(module, attr_name):
continue
quantizer = getattr(module, attr_name)
if quantizer.is_enabled and quantizer.block_sizes:
# Override to FP8 per-tensor
quantizer.block_sizes = None
quantizer._num_bits = (4, 3)
quantizer._axis = None
quantizer.enable_calib()
overridden.append(quantizer)

if overridden:
model.eval()
with torch.no_grad():
for batch in data_loader:
model(batch["image"])
for quantizer in overridden:
quantizer.disable_calib()
quantizer.load_calib_amax()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply the Conv2d TRT restriction before the auto-quant search.

mtq.auto_quantize() optimizes under the requested effective_bits budget and returns a search_state for the formats it actually evaluated. Rewriting any block-quantized Conv2d to FP8 afterwards means the exported model no longer matches that search space: Conv layers selected as NVFP4/MXFP8/INT4 now pay FP8 cost instead, so the final model can overshoot the target budget and the returned search_state becomes stale. Please push the Conv2d compatibility rule into the candidate configs before calling mtq.auto_quantize(), or recompute/validate the post-override budget before returning.

Also applies to: 307-309

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 227 - 256, The
override of Conv2d block quantizers to FP8 in _override_conv2d_to_fp8 must not
be applied after mtq.auto_quantize() because mtq.auto_quantize() returns a
search_state based on the candidate configs; either incorporate the
Conv2d->FP8/disable-block-quantization rule into the candidate generation passed
into mtq.auto_quantize() (so those Conv2d layers are treated as FP8 during the
search) or, if you must keep the override path, recompute and validate the
effective_bits/budget and update/refresh the returned search_state after running
_override_conv2d_to_fp8 so the final model’s budget reflects FP8 costs for
Conv2d (reference symbols: _override_conv2d_to_fp8, mtq.auto_quantize,
search_state).

Move TRT engine build logic into the script as a --trt_build flag,
removing the duplicate trtexec invocation from the test file.

Signed-off-by: ajrasane <arasane@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
examples/torch_onnx/torch_quant_to_onnx.py (2)

39-47: ⚠️ Potential issue | 🟡 Minor

Remove INT4_AWQ from the supported-modes docstring.

This still reads like end-to-end support, but the PR summary calls out INT4_AWQ as a known limitation for this example.

✏️ Suggested wording
-Supports FP8, INT8, MXFP8, NVFP4, INT4_AWQ, and AUTO (mixed-precision) quantization modes.
+Supports FP8, INT8, MXFP8, NVFP4, and AUTO (mixed-precision) quantization modes.
+`INT4_AWQ` remains a known limitation for this example.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 39 - 47, Update the
module-level docstring that lists supported quantization modes by removing
"INT4_AWQ" from the supported-modes sentence (the top-level
docstring/description in torch_quant_to_onnx.py); ensure any other occurrences
in the same docstring (e.g., the second paragraph that enumerates FP8, INT8,
MXFP8, NVFP4, INT4_AWQ, and AUTO) are updated so INT4_AWQ is no longer
mentioned, leaving the remaining modes and explanatory text intact.

228-257: ⚠️ Potential issue | 🟠 Major

Apply the Conv2d TRT restriction inside the auto-quant search.

mtq.auto_quantize() returns a search_state for the formats it actually evaluated. Rewriting selected Conv2d layers to FP8 afterwards means the exported model no longer matches that search space or its effective_bits budget, so the returned search_state is stale. Push this compatibility rule into the candidate configs before mtq.auto_quantize(), or recompute/validate the budget after the override.

Also applies to: 308-310

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 228 - 257, The
current post-hoc override in _override_conv2d_to_fp8 mutates Conv2d quantizers
after mtq.auto_quantize has already produced a search_state and effective_bits
budget, making search_state stale; move the Conv2d -> FP8 compatibility rule
into the candidate configuration generation that mtq.auto_quantize consumes (or
run a budget/validation pass to recompute search_state/effective_bits after
performing the override). Specifically, incorporate the TRT Conv2d restriction
into the candidate configs or pre-filter candidates before calling
mtq.auto_quantize, or if you keep _override_conv2d_to_fp8, call a function to
recompute/validate search_state and effective_bits (the returned object from
mtq.auto_quantize) immediately after performing the override so the exported
model and reported budget remain consistent.
🧹 Nitpick comments (1)
examples/torch_onnx/torch_quant_to_onnx.py (1)

198-202: Avoid forcing calibration data for quantizers that are disabled later.

This now downloads calibration data for every non-auto run, then does the extra calibration pass before mtq.disable_quantizer(). For models whose Conv2d quantizers are later filtered out, that makes mxfp8/nvfp4 pay the Tiny-ImageNet + calibration cost even though no active override survives to export. Consider applying the filter before _calibrate_uncalibrated_quantizers(), or loading calibration data lazily only when an enabled Conv2d override remains.

Also applies to: 462-473

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 198 - 202, The
current flow calls _calibrate_uncalibrated_quantizers(quantized_model,
data_loader) before mtq.disable_quantizer(...), causing calibration data to be
downloaded even for quantizers that will be disabled; change the order or gate
calibration so only enabled overrides are calibrated: either call
mtq.disable_quantizer(quantized_model, filter_func) before invoking
_calibrate_uncalibrated_quantizers, or add a pre-check that filters
quantized_model (using filter_func) to determine if any Conv2d override remains
enabled and only then load/process data_loader and call
_calibrate_uncalibrated_quantizers; reference functions:
_calibrate_uncalibrated_quantizers, mtq.disable_quantizer, filter_func,
quantized_model, and data_loader.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 39-47: Update the module-level docstring that lists supported
quantization modes by removing "INT4_AWQ" from the supported-modes sentence (the
top-level docstring/description in torch_quant_to_onnx.py); ensure any other
occurrences in the same docstring (e.g., the second paragraph that enumerates
FP8, INT8, MXFP8, NVFP4, INT4_AWQ, and AUTO) are updated so INT4_AWQ is no
longer mentioned, leaving the remaining modes and explanatory text intact.
- Around line 228-257: The current post-hoc override in _override_conv2d_to_fp8
mutates Conv2d quantizers after mtq.auto_quantize has already produced a
search_state and effective_bits budget, making search_state stale; move the
Conv2d -> FP8 compatibility rule into the candidate configuration generation
that mtq.auto_quantize consumes (or run a budget/validation pass to recompute
search_state/effective_bits after performing the override). Specifically,
incorporate the TRT Conv2d restriction into the candidate configs or pre-filter
candidates before calling mtq.auto_quantize, or if you keep
_override_conv2d_to_fp8, call a function to recompute/validate search_state and
effective_bits (the returned object from mtq.auto_quantize) immediately after
performing the override so the exported model and reported budget remain
consistent.

---

Nitpick comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 198-202: The current flow calls
_calibrate_uncalibrated_quantizers(quantized_model, data_loader) before
mtq.disable_quantizer(...), causing calibration data to be downloaded even for
quantizers that will be disabled; change the order or gate calibration so only
enabled overrides are calibrated: either call
mtq.disable_quantizer(quantized_model, filter_func) before invoking
_calibrate_uncalibrated_quantizers, or add a pre-check that filters
quantized_model (using filter_func) to determine if any Conv2d override remains
enabled and only then load/process data_loader and call
_calibrate_uncalibrated_quantizers; reference functions:
_calibrate_uncalibrated_quantizers, mtq.disable_quantizer, filter_func,
quantized_model, and data_loader.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 54e5c119-bb09-448b-ba29-0ad682692eb3

📥 Commits

Reviewing files that changed from the base of the PR and between 9810e1d and 54cefb2.

📒 Files selected for processing (2)
  • examples/torch_onnx/torch_quant_to_onnx.py
  • tests/examples/torch_onnx/test_torch_quant_to_onnx.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/examples/torch_onnx/test_torch_quant_to_onnx.py

Signed-off-by: ajrasane <arasane@nvidia.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/torch_onnx/torch_quant_to_onnx.py`:
- Around line 248-250: Replace the direct internal assignments to
quantizer._num_bits and quantizer._axis with the public property setters so the
calibrator stays synchronized: use quantizer.num_bits = (4, 3) and
quantizer.axis = None instead of writing to _num_bits and _axis (keeping
quantizer.block_sizes assignment as-is); update references in the same snippet
to call the num_bits and axis properties on the quantizer object to ensure
proper calibrator/internal state updates.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 703357c3-97d1-492b-a949-8b09ff915c3c

📥 Commits

Reviewing files that changed from the base of the PR and between 54cefb2 and ba020d5.

📒 Files selected for processing (2)
  • examples/torch_onnx/torch_quant_to_onnx.py
  • modelopt/torch/_deploy/utils/torch_onnx.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/_deploy/utils/torch_onnx.py

Comment on lines +248 to +250
quantizer.block_sizes = None
quantizer._num_bits = (4, 3)
quantizer._axis = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if there's a public API for setting quantizer num_bits/axis
rg -n "def.*num_bits|def.*axis" --type=py modelopt/torch/quantization/

Repository: NVIDIA/Model-Optimizer

Length of output: 3259


🏁 Script executed:

#!/bin/bash
# Get the file and examine the problematic lines with context
cat -n examples/torch_onnx/torch_quant_to_onnx.py | sed -n '235,260p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1289


🏁 Script executed:

#!/bin/bash
# Examine the public num_bits and axis property implementations
cat -n modelopt/torch/quantization/nn/modules/tensor_quantizer.py | sed -n '272,285p'
cat -n modelopt/torch/quantization/nn/modules/tensor_quantizer.py | sed -n '372,385p'

Repository: NVIDIA/Model-Optimizer

Length of output: 967


Use public property setters for num_bits and axis to maintain calibrator synchronization.

At lines 249-250, directly assigning to _num_bits and _axis bypasses the property setters, which synchronize these values with the calibrator's internal state. This creates an inconsistency where the quantizer and calibrator have mismatched configuration.

Change:

quantizer._num_bits = (4, 3)
quantizer._axis = None

To:

quantizer.num_bits = (4, 3)
quantizer.axis = None

Line 248 correctly uses the public block_sizes property; apply the same pattern here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/torch_onnx/torch_quant_to_onnx.py` around lines 248 - 250, Replace
the direct internal assignments to quantizer._num_bits and quantizer._axis with
the public property setters so the calibrator stays synchronized: use
quantizer.num_bits = (4, 3) and quantizer.axis = None instead of writing to
_num_bits and _axis (keeping quantizer.block_sizes assignment as-is); update
references in the same snippet to call the num_bits and axis properties on the
quantizer object to ensure proper calibrator/internal state updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant