[Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 #1416 by kevalmorabia97 · Pull Request #1426 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-05-11T11:13:22Z

Cherry-picked PRs

Summary by CodeRabbit

New Features
- SPEEDBench now uses stratified sampling for deterministic, balanced dataset selection.
- Added legacy quantization conversion shims for INT4, MXFP8 and FP4→2DQ workflows.
- AWQ Lite: fallback handling for uncalibrated per-expert quantizers during export.
Bug Fixes
- Clamp FP8 scales in NVFP4 quantization to avoid NaNs.
- Fixed warmup steps formatting in finetune launch script.
Improvements
- LM-Eval integration updated for v0.4.10+ compatibility.
- TensorRT execution routed through a dedicated trtexec helper.
Tests
- Added/regressed tests covering quantization shims, FP8 scale behavior, export fallbacks, and LM eval.

…compat (#1393) Fix for NVBug 6120631 to fix ``` finetune.py: error: argument --warmup_steps/--warmup-steps: invalid int value: '0.0' ```  ## Summary by CodeRabbit * **Chores** * Corrected parameter format in finetuning example script for consistency.  Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

When the dataset has a `category` column with >1 distinct categories and ``--num_requests N`` is below the dataset size, take ceil(N / num_categories) rows from each category and round-robin interleave them so any prefix is balanced. Falls back to the existing ``range(N)`` slice when category metadata is absent or there's only one category. Fixes a sampling bug where SPEED-Bench parquet files are sorted by category, so e.g. ``--num_requests 64`` on throughput_8k pulls 64 high_entropy prompts and zero from low_entropy / mixed.  ## Summary by CodeRabbit * **New Features** * SPEEDBench dataset truncation now performs deterministic, balanced stratified sampling across categories (round-robin) when multiple categories exist and a smaller sample size is requested. * Falls back to simple prefix selection when no category or only one category is present, preserving prior behavior in that case.  --------- Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…1268) ### What does this PR do? Type of change: code improvement Subprocess is needed to run `trtexec` commands but those require `nosec` approval. This PR centralizes it into a single function for the ONNX workflow to avoid future approval triggers. The torch workflow has that centralized in `run_command` (renamed to `_run_trtexec_streamed`). Original request: #1259 (comment) ### Usage ```python results = _run_trtexec(cmd) ``` ### Testing N/A ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A  - Did you write any new necessary tests?: N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A   ## Summary by CodeRabbit * **Refactor** * Standardized how the app invokes and logs the external conversion/benchmarking tool across build, profile, and benchmark flows for consistent execution and clearer logs. * **Behavior Change** * The tool is now invoked by name from PATH (configurable path option removed); execution and logging were centralized. * **Bug Fixes** * Improved missing-tool error messaging: failures now instruct to ensure the external tool is available in PATH.  --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

## Summary - Saturates `per_block_scale * 448 / per_block_scale_max` to ≤ 448 before the `to(torch.float8_e4m3fn)` cast in `NVFP4QTensor.get_weights_scaling_factor_from_quantizer`. - Adds a regression test that reproduces the NaN byte without the clamp. ## Why When `_amax` contains a zero entry (e.g. an all-zero weight block left untouched by max calibration), the existing `per_block_scale[per_block_scale == 0] = 1.0` safety net drives the pre-cast value to `1.0 * 448 / (global_amax / 6)`. `fp8_e4m3fn` has no Inf — anything `≥ 480` rounds to NaN — so a 0x7F byte slips into the exported `weight_scale`. This was observed in a saved Kimi-K2.6-NVFP4-MSE checkpoint at `language_model.model.layers.1.mlp.experts.21.down_proj.weight_scale[4001, 18]`. The MSE FP8 sweep itself never produces zero per-block amax (it always emits at least `c[0] * global_amax`), but any export path where `_amax` ends up zero — including pure max calibration — hits the bug. With the clamp the byte saturates to `0x7E` (= 448, fp8 max finite) and dequantization is unaffected: the FP4 nibbles for an all-zero block are all 0, so `0 × 448 × weight_scale_2 = 0` regardless of the stored fp8 scale. For non-degenerate blocks the clamp is a no-op since `per_block_amax ≤ global_amax` already bounds the pre-cast value at 448. ## Test plan - [x] New regression test `test_export_fp8_scale_no_nan_for_zero_amax_block` fails on `main`'s export code (reproduces the 0x7F NaN byte) and passes with the clamp. - [x] Existing tests in `tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py` still pass (10/10). 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Improved numerical stability in FP8 quantization scaling by preventing overflow and NaN conditions * Enhanced handling of edge cases in quantization processing for zero-weight blocks  Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…tom configs (#1402) The export step's collect_shared_input_modules() runs a dummy forward pass to trace shared input modules for resmoothing/requantization. This forward call hits the same AttributeError as PR #1324 on configs that don't assign use_cache (e.g., stepfun-ai/Step-3.5-Flash's Step3p5Config): AttributeError: 'Step3p5Config' object has no attribute 'use_cache' Wrap the dummy forward with the _disable_use_cache context manager so it sets/restores config.use_cache around the call, mirroring the calibration fix from #1324. Covers all callsites since both unified_export_hf and plugins/vllm_fakequant_hf funnel through collect_shared_input_modules. ### What does this PR do? Type of change: ?   ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Improved model export process by optimizing cache handling during profiling operations.  Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…le_2 consistent (#1411) ## Summary - Pre-fill the source `gate_up_proj_weight_quantizers[idx]._amax` from the **fused** `gate_up[idx]` tensor when it is uncalibrated, **before** the per-projection deepcopy in `_export_fused_experts`. Both gate's clone and up's clone then inherit the same scalar amax, so `weight_scale_2 = amax / (6 · 448)` matches across the W1/W3 fusion that vLLM expects at load time. - Add a unit regression test that builds a fused-experts module with intentionally mismatched gate vs. up weight magnitudes, leaves every expert uncalibrated, and asserts the gate and up wrappers carry the same amax into the FP4 quantization step. Fails on `main`, passes with this fix. ## Why NVbug 6142360: Qwen3.5-MoE / Qwen3-Next NVFP4 checkpoints produced by ModelOpt yielded garbled output under vLLM (`1\n1\n1\n…` style degenerate token loops). The vLLM log showed `w1_weight_scale_2 must match w3_weight_scale_2. Accuracy may be affected.`. Root cause: in `_export_fused_experts`, when an expert receives no calibration tokens (common for low-frequency experts even with `--calib_size 512` on 128-expert MoEs), the per-projection fallback computed `amax` independently from each split slice: ```python w_quantizer.amax = weight_slice.abs().amax().to(torch.float32) ``` `weight_slice` is the gate-only or up-only 2-D half of the fused `gate_up_proj`. Since gate and up have different magnitudes, gate and up end up with different scalar amax values — and therefore different `weight_scale_2`. vLLM fuses W1 and W3 into a single weight at load time and asserts a single shared scale; the half whose scale was discarded is now off by the gate/up magnitude ratio (~10× was typical), which catastrophically corrupts the MoE output. The fix derives the fallback amax once from the whole `gate_up[idx]` tensor before the deepcopies, so gate and up share the same amax — exactly what calibration would have produced if any token had hit the expert. Calibrated experts are unaffected (the new code path is gated on `_amax` being missing or zero). `down_proj` keeps its existing per-projection fallback because it has its own quantizer with no fusion partner. ## End-to-end vLLM verification Used `vllm/vllm-openai:nightly` Docker on RTX 6000 Ada, Marlin NVFP4 backend, with a real Qwen3-30B-A3B-Instruct NVFP4 checkpoint. The bug requires the export-time scale skew, so I A/B-tested by mutating `gate_proj.weight_scale_2` directly: | Variant | gate vs up `weight_scale_2` | Output for "Write an article about AI." | |---|---|---| | Baseline (original checkpoint) | matched (calibrated) | coherent: "The Rise of Artificial Intelligence: Transforming the World One Algorithm at a Time…" | | Bug-simulated (gate = up / 10 for all 6143 expert pairs) | mismatched ~10× | `__':\n__':\n__':\n…` (degenerate loop, same shape as user's `1\n1\n1\n…`) | | Fix-simulated (gate restored to up) | matched | coherent: same article as baseline | The 10× skew matches what the bug actually emits for uncalibrated experts (e.g. `gate=2.77e-5, up=2.30e-4` from a synthetic repro), and reproduces the same garbled-loop pattern the user reported. ## Test plan - [x] New unit test `tests/unit/torch/quantization/plugins/test_fused_experts.py::TestExportFusedExperts::test_uncalibrated_expert_gate_up_share_amax` passes. - [x] Full `tests/unit/torch/quantization/plugins/test_fused_experts.py`: 28/28 pass. - [x] Broader `tests/unit/torch/export/` + `tests/unit/torch/quantization/`: 609 passed, 9 skipped. - [x] vLLM end-to-end A/B (table above): bug reproduces with mismatched scales, fix produces coherent output identical to baseline. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  - Did you get Claude approval on this PR?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Enhanced handling of uncalibrated mixture-of-experts weight quantizers during export with improved consistency checks and informative warnings. * **Tests** * Added regression test for uncalibrated expert quantization behavior.  Signed-off-by: weimingc <weimingc@nvidia.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…zer disabled (#1410) ## Summary `awq_lite.setup()` disables `module.input_quantizer` at the start of search. The calibrated branch re-enables it inside `postprocess()`, but the uncalibrated branch (no cache-pass tokens, e.g. an MoE expert that never gets routed) never did. Worse, for experts that had cache hits but missed the search pass, the per-channel `_amax` left over from `max_calibrate` during cache mode tripped `preprocess_linear_fusion`'s `numel == 1` assertion and prevented `_export_quantized_weight` from emitting a per-tensor `input_scale`. Result: per-expert `input_scale` was missing in the exported HF checkpoint, and TRT-LLM `CutlassFusedMoE` crashed on load with `KeyError: '<idx>.w1.input_scale'` for any expert that did not see enough calibration tokens (e.g. Qwen3-30B-A3B + `nvfp4_awq` from the bug report). ## Fix Mirror the calibrated `postprocess()` path in `modelopt/torch/quantization/model_calib.py`: collapse any per-channel `_amax` to scalar (axis=None) and re-enable the `input_quantizer`. ## Test plan - [x] Added regression test `test_awq_lite_uncalibrated_linear_keeps_input_quantizer_enabled` using `NVFP4_AWQ_LITE_CFG` with a two-branch model where only one branch is exercised; verifies the uncalibrated linear's `input_quantizer` remains enabled after `mtq.quantize`. - [x] End-to-end pipeline test on a tiny synthetic Qwen3-MoE (8 experts, top-1 routing) confirms all 48 expert `input_scale` keys are present in the exported state_dict (vs. multiple missing pre-fix). - [x] Manual repro of the bug command (Qwen3-30B-A3B, `--qformat nvfp4_awq`) confirmed the missing `input_scale` keys before the fix. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Fixed AWQ-Lite quantization for uncalibrated modules so export/preprocessing invariants are preserved even when calibration/parameter updates are skipped. * **Tests** * Added regression tests: one verifies uncalibrated linear modules keep their input quantizer enabled after quantization; another verifies weight-only AWQ-Lite cases keep the input quantizer disabled as expected.  --------- Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? Type of change: ?   ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  - Did you get Claude approval on this PR?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Updated compatibility handling to support Transformers 5.0 while maintaining backward compatibility with earlier versions.  Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

## What does this PR do? **Type of change:** Bug fix (backward-compat restoration), follow-up to #1356. #1356 restored `modelopt.onnx.llm_export_utils` as a deprecation shim so that TensorRT-Edge-LLM 0.6.1 could resume importing `fold_fp8_qdq_to_dq`. QA reports that the next line of `tensorrt_edgellm/onnx_export/onnx_utils.py` still aborts at module load: ```python from modelopt.onnx.llm_export_utils.surgeon_utils import fold_fp8_qdq_to_dq # OK from modelopt.onnx.quantization.qdq_utils import (fp4qdq_to_2dq, quantize_weights_to_int4, quantize_weights_to_mxfp8) # ImportError: cannot import name 'fp4qdq_to_2dq' from 'modelopt.onnx.quantization.qdq_utils' ``` Those three symbols were removed when the LLM ONNX export pipeline was refactored into the staged `modelopt.onnx.export` exporters (`INT4QuantExporter`, `NVFP4QuantExporter`, `MXFP8QuantExporter`). The new exporters still implement the same conversions, but edgellm 0.6.1 imports the old top-level functions unconditionally, so every `tensorrt-edgellm-*` CLI fails immediately — same 5/5 (100%) failure rate as before, just one line further in. ## What this PR does - Restores `fp4qdq_to_2dq`, `quantize_weights_to_int4`, and `quantize_weights_to_mxfp8` on `modelopt.onnx.quantization.qdq_utils` as deprecation shims that: - emit a `DeprecationWarning` pointing readers at the new `modelopt.onnx.export` exporters and at TensorRT-Edge-LLM as the long-term migration target, - reuse the existing in-tree helpers (`_cast_fp8` here, `_cast_fp4` and `_replace_fp4qdq_with_2dq` from `modelopt.onnx.export.nvfp4_exporter` via a lazy import to avoid a circular import — `nvfp4_exporter` already imports `onnx_dtype_map` from this module), - reuse `pack_weights_to_int4`, `get_amax`, `compute_e8m0`, `get_weights_scaling_factor[_2]`, `quantize`, `get_attribute`, `has_attribute`, and `read_f16_tensor_as_fp32` from their canonical locations rather than duplicating them. - Adds the matching imports at the top of `qdq_utils.py`. No new files, no public API removals, no behavior changes for any existing caller. I also audited every other `modelopt` import edgellm 0.6.1 makes (`modelopt.torch.quantization`, `modelopt.torch.opt`, `modelopt.torch.export.quant_utils`, `modelopt.torch.opt.plugins.huggingface`, `modelopt.onnx.quantization.gs_patching`, `modelopt.torch.quantization.nn.{QuantLinear,TensorQuantizer}`, `modelopt.torch.quantization.utils.{is_quantized,is_quantized_linear}`) and confirmed each still resolves on current `main`; only the three `qdq_utils` symbols above were missing. ## Testing - All four edgellm-side import lines from the failing traceback now succeed against this branch (verified by importing them from the v0.6.1 source tree). - `tests/unit/onnx/quantization/test_qdq_utils.py` + `tests/unit/onnx/test_onnx_utils.py`: 36 passed in 5s. - Each restored shim emits `DeprecationWarning` when called and runs to completion on an empty `ModelProto` smoke input. - `pre-commit run --files modelopt/onnx/quantization/qdq_utils.py`: ruff, ruff-format, mypy, bandit, license header all pass. ## Notes for reviewers - The shim functions are intentionally near-verbatim copies of the pre-removal implementations rather than thin adapters over the new exporters: the new exporters split work across `compute_scales` / `compress_weights` / `post_process` stages, while edgellm calls the legacy single-shot entry points and depends on side effects (e.g. casting bias to fp16 inside `quantize_weights_to_int4`) that the staged exporters do not all reproduce in one call. Wrapping them up was higher-risk than restoring the legacy paths verbatim with a deprecation warning. - Removal target: same as the existing `llm_export_utils` shim — drop together once edgellm releases a version that targets `modelopt.onnx.export` directly. ## Before your PR is "*Ready for review*" - [x] Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed. - [x] Is this change backward compatible?: **Yes** — purely additive; no existing API changes. - [x] Did you write any new necessary tests?: relied on existing qdq_utils unit tests + a manual repro of the failing edgellm import path. - [x] Did you add or update any necessary documentation?: deprecation message in each shim points to the migration target. - [x] Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: not applicable for a compat shim.  ## Summary by CodeRabbit * **Chores** * Added deprecated legacy quantization conversion utilities to preserve compatibility; invoking them now emits deprecation warnings. * **Documentation** * Updated helper docstrings to note compatibility/shim usage. * **Tests** * Added end-to-end smoke tests verifying the deprecated shims emit warnings and produce expected transformed/quantized models.  --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-05-11T11:13:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0787cc16-3ffd-417b-9408-e17966495872

📥 Commits

Reviewing files that changed from the base of the PR and between 3f28248 and 5b16509.

📒 Files selected for processing (1)

pyproject.toml

✅ Files skipped from review due to trivial changes (1)

pyproject.toml

📝 Walkthrough

Walkthrough

This PR consolidates TensorRT subprocess execution into shared helpers, adds legacy ONNX quantization shims for backwards compatibility, fixes FP8 scale overflow handling, improves quantizer state consistency for fused experts, extends model export compatibility for newer dependency versions, and updates evaluation examples to integrate with lm-eval 0.4.10+.

Changes

TensorRT Subprocess Execution Refactoring

Layer / File(s)	Summary
Shared trtexec Helper `modelopt/onnx/quantization/ort_utils.py`, `modelopt/onnx/quantization/autotune/benchmark.py`	Introduces `_run_trtexec()` with subprocess wrapping, timeout support, and clearer error handling; updates `_check_for_trtexec()` and replaces direct subprocess usage in benchmark.
TensorRT Constants Cleanup `modelopt/torch/_deploy/_runtime/tensorrt/constants.py`	Removes `TRTEXEC` and `TRTEXEC_PATH` constants no longer needed after subprocess refactoring.
Engine Builder Integration `modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py`	Adds `_run_trtexec_with_logging()` wrapper for hardcoded trtexec invocation; refactors `build_engine` and `profile_engine` to build command args without executable path and use the new wrapper.
Test Mocking Updates `tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py`	Updates test fixture to mock refactored `_run_trtexec_with_logging` instead of `_run_command`.

Legacy ONNX Quantization Deprecation Shims

Layer / File(s)	Summary
Helper Documentation `modelopt/onnx/export/nvfp4_exporter.py`	Adds docstring notes documenting that `_cast_fp4` and `_replace_fp4qdq_with_2dq` are reused by deprecated shims and must maintain stable signatures.
Legacy Conversion Functions `modelopt/onnx/quantization/qdq_utils.py`	Adds three deprecation-wrapped conversion shims: `quantize_weights_to_int4`, `quantize_weights_to_mxfp8`, and `fp4qdq_to_2dq`; expands quantization helper imports and adds deprecation messaging.
Shim Test Coverage `tests/unit/onnx/quantization/test_qdq_utils.py`	Adds `TestLegacyEdgeLLMShims` with end-to-end regression tests verifying deprecation warnings and correct graph/initializer transformations for all three shims.

Quantizer State and FP8 Scale Handling

Layer / File(s)	Summary
FP8 Scale Overflow Prevention `modelopt/torch/quantization/qtensor/nvfp4_tensor.py`, `tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py`	Clamps FP8 scale values to 448.0 before casting to `float8_e4m3fn` to prevent NaN encoding; adds regression test checking no NaN-encoded bytes and expected saturation.
Fused Expert amax Fallback & AWQ-Lite Postprocessing `modelopt/torch/export/moe_utils.py`, `modelopt/torch/quantization/model_calib.py`, `tests/unit/torch/quantization/plugins/test_fused_experts.py`, `tests/unit/torch/quantization/test_calib.py`	Adds pre-split amax fallback for fused gate/up quantizers and extends AWQ-Lite disabled-expert postprocess to collapse per-channel amax and restore input_quantizer state; includes regression tests for both behaviors.

Model Export and Dependency Compatibility

Layer / File(s)	Summary
use_cache Handling `modelopt/torch/export/unified_export_hf.py`	Wraps dummy forward pass in `collect_shared_input_modules` with `_disable_use_cache` to support models where `use_cache` affects probe forward.
Transformers Version Gating `modelopt/torch/opt/plugins/transformers.py`	Detects Transformers version at runtime and conditionally passes `load_config` to cached zero3 loader only for versions >= 5.0.

LLM Evaluation Framework Integration

Layer / File(s)	Summary
lm_eval Version and CLI `examples/llm_eval/lm_eval_hf.py`	Enforces lm_eval >= 0.4.10 (ImportError if older); switches to `HarnessCLI`/`setup_logging`; adds `_add_modelopt_args` parser-extension hook and `_inject_modelopt_args_into_model_args`; refactors `__main__` to call `cli.execute(args)`.
Requirements Update `examples/llm_eval/requirements.txt`, `examples/puzzletron/requirements.txt`	Updates llm_eval requirement to `lm_eval[api,ifeval]>=0.4.10` and removes `lm-eval==0.4.8` from puzzletron example.
Test Suite `tests/examples/llm_eval/test_llm_eval.py`	Adds `test_lm_eval_hf()` for HF eval and replaces previous FP8 llama test with Qwen3-based `test_qwen3_eval_fp8()`; updates imports and tiny-model creation helpers.

Example Scripts and Dataset Utilities

Layer / File(s)	Summary
Parameter Updates `examples/llm_sparsity/weight_sparsity/launch_finetune.sh`	Changes `--warmup_steps` from `0.0` to `0`.
Dataset Sampling `examples/specdec_bench/specdec_bench/datasets/speed.py`	Replaces simple prefix truncation with category-stratified round-robin sampling via new `_stratified_select()` helper when `num_samples` is specified.
pyproject hf Extra Formatting `pyproject.toml`	Adjusts whitespace/alignment in the `hf` optional dependency block without changing dependencies.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

cjluo-nv
sugunav14

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	PR contains 5 `# nosec` comments in modified files, violating SECURITY.md which prohibits this pattern.	Remove `# nosec` comments from ort_utils.py, engine_builder.py, benchmark.py. Add inline safety justifications. Request `@NVIDIA/modelopt-setup-codeowners` review with justification in PR description per SECURITY.md.
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately reflects the core purpose: cherry-picking multiple numbered PRs into a release branch. It is concise, specific, and clearly communicates the main change.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cherry-picks/release-0.44.0

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…1416) ### What does this PR do? Type of change: Bug fix Related: NVbug 6153721. To make `lm_eval_hf.py` and the `vllm_causallms` adapter compatible with vLLM 0.20+, **lm_eval must be upgraded to 0.4.11**. `examples/llm_eval/lm_eval_hf.py` failed to import on lm-eval >= 0.4.10: ``` ImportError: cannot import name 'parse_eval_args' from 'lm_eval.__main__' ``` lm-eval 0.4.10 replaced `lm_eval.__main__.{setup_parser, parse_eval_args, cli_evaluate}` with a `HarnessCLI`-based interface in `lm_eval._cli`. This PR drops the legacy code path and drives `HarnessCLI` directly: - Attach the ModelOpt arguments (`--quant_cfg`, `--auto_quantize_*`, `--calib_*`, `--compress`, `--sparse_cfg`) to the new `run` subparser. - After parsing, move those keys out of the argparse namespace and into `args.model_args` (now a dict, courtesy of `MergeDictAction`), so `EvaluatorConfig.from_cli` doesn't reject them as unknown kwargs and so they reach our `HFLM.create_from_arg_obj` override. - Hard-require lm-eval >= 0.4.10 via `packaging.version.Version` and bump `examples/llm_eval/requirements.txt` and `examples/puzzletron/requirements.txt` accordingly. ### Usage ```bash # Same CLI as before — HarnessCLI auto-inserts the `run` subcommand for legacy-style invocations. python examples/llm_eval/lm_eval_hf.py \ --model hf \ --model_args pretrained=<HF model> \ --tasks hellaswag \ --quant_cfg FP8_DEFAULT_CFG \ --batch_size 4 ``` ### Testing Added `tests/examples/llm_eval/test_llm_eval.py::test_lm_eval_hf` — an end-to-end test that builds a tiny qwen3 (no GPU required) and runs `lm_eval_hf.py` against MMLU with `--limit 0.1`. Verifies both the new HarnessCLI integration and the `HFLM.create_from_arg_obj` override actually execute. Existing FP8 PTQ test (`test_qwen3_eval_fp8`) was retargeted to the same tiny qwen3 helper so it no longer pulls down the 1.1B TinyLlama checkpoint. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ❌ — drops support for lm-eval < 0.4.10. Requirements pins are bumped in the same PR; existing CLI invocations continue to work because HarnessCLI auto-inserts the `run` subcommand for legacy-style argv. - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A - Did you get Claude approval on this PR?: ❌ ### Additional Information NVbug 6153721 — vLLM 0.20+ compatibility requires lm_eval 0.4.11. This PR pins to >= 0.4.10 to fix the import error; bumping the floor to 0.4.11 (or letting users opt in via vLLM extras) unblocks vLLM 0.20+ usage.  ## Summary by CodeRabbit * **Chores** * Require lm_eval >= 0.4.10 for evaluation examples and remove an lm-eval pin from a puzzletron example. * Improve LM-eval CLI integration so model-related CLI options (including trust_remote_code) are recognized and forwarded correctly. * **Tests** * Added an end-to-end smoke test for the LM evaluation example. * Updated evaluation tests to use the new tiny model setup for more robust validation.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

github-actions · 2026-05-11T11:19:25Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1426/
Built to branch `gh-pages` at 2026-05-11 13:26 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py (1)

67-67: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Match mocked subprocess output type to runtime contract (bytes).

Line 67 returns "success" as str, but _run_trtexec_with_logging returns bytes. Keeping the mock as bytes avoids type-skew in tests.

Proposed fix

-        mock_run.return_value = (0, "success")
+        mock_run.return_value = (0, b"success")

(Then update string assertions in this test file to b"success".)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py` at line 67,
The test mocks subprocess output as a str but the runtime function
_run_trtexec_with_logging returns bytes; change mock_run.return_value in
test_engine_builder.py from (0, "success") to (0, b"success") and update any
assertions comparing the captured output (e.g., equality checks against
"success") to use b"success" so types match the runtime contract; look for uses
of mock_run, _run_trtexec_with_logging-related assertions, and string
comparisons in that test file and replace them with the bytes literal.

examples/llm_eval/lm_eval_hf.py (1)

193-217: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't overwrite user --model_args with parser defaults.

Lines 248-259 unconditionally copy every ModelOpt parser field into args.model_args, but several of those fields have defaults on Lines 193, 206, 217, plus the implicit False from Line 231. That means a call like --model_args calib_size=1024,compress=True gets silently rewritten back to 512 / False unless the same values are also passed as top-level flags.

create_from_arg_obj() already supplies these defaults, so the safer pattern is to leave parser defaults unset and only inject keys the user explicitly provided.

Proposed fix

     parser.add_argument(
-        "--calib_size", type=int, help="Calibration size for quantization", default=512
+        "--calib_size", type=int, help="Calibration size for quantization"
     )
@@
     parser.add_argument(
         "--auto_quantize_method",
         type=str,
-        default="gradient",
         choices=["gradient", "kl_div"],
         help=(
             "Method for auto_quantize sensitivity analysis. 'gradient' uses gradient-based method "
@@
     parser.add_argument(
         "--auto_quantize_score_size",
         type=int,
-        default=128,
         help=(
             "Number of samples to use for auto_quantize scoring. Most of auto_quantize time is spent on "
             "sensitivity score estimation, so reducing this speeds it up while only minimally affecting "
@@
     parser.add_argument(
         "--compress",
         action="store_true",
+        default=None,
         help="Compress the model after quantization",
     )
@@
     for key in _MODELOPT_ARG_KEYS:
         if hasattr(args, key):
-            model_args[key] = getattr(args, key)
+            value = getattr(args, key)
+            if value is not None:
+                model_args[key] = value
             delattr(args, key)

Also applies to: 229-232, 241-259

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_eval/lm_eval_hf.py` around lines 193 - 217, The code is
overwriting user-provided args.model_args with parser default values for
ModelOpt fields (e.g., calib_size, auto_quantize_bits, auto_quantize_method,
auto_quantize_score_size and the implicit booleans); instead, only merge keys
that the user explicitly passed at the command line into args.model_args so
create_from_arg_obj() can supply defaults. Modify the merge logic that currently
copies every ModelOpt parser field into args.model_args to: detect which flags
were explicitly set (via argparse's actions or by comparing against
parser.get_default for each option) and only inject those keys into
args.model_args (leave absent fields untouched so create_from_arg_obj() provides
defaults). Ensure this change applies to the same merging blocks that handle
calib_size/auto_quantize_* and any other ModelOpt fields referenced earlier.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 745-773: Add an early guard in _stratified_select to handle
non-positive n: if n <= 0 return dataset.select(range(n)) before checking
"category" presence or building cat_to_rows, so the interleaving code never runs
for n<=0 and the function returns an empty selection deterministically.

In `@modelopt/onnx/quantization/ort_utils.py`:
- Line 66: Remove the inline Bandit suppression by deleting the trailing "#
nosec B603" on the subprocess.run call so the security scan can validate it;
ensure the call remains using the argv list variable `cmd` (as already used) and
keeps the existing parameters (capture_output=True, text=True, timeout=timeout)
or replace with an explicit safer pattern if desired (e.g., subprocess.run(cmd,
capture_output=True, text=True, timeout=timeout, check=True))—locate the return
statement that invokes subprocess.run(cmd, ...) in ort_utils.py and remove only
the "# nosec B603" comment.

In `@modelopt/onnx/quantization/qdq_utils.py`:
- Around line 1264-1272: The current filter builds weight_dq_nodes by selecting
all DequantizeLinear nodes, causing index errors when activation DQ nodes (whose
inputs aren't in initializer_map) are processed; update the selection logic to
only include DequantizeLinear nodes whose input[0] is present in initializer_map
(i.e., backed by an initializer) before iterating. Locate the list comprehension
that defines weight_dq_nodes and change it to check initializer_map membership
of node.input[0] (or validate via get_tensor_producer_nodes/initializer_map) so
only weight-backed DequantizeLinear nodes are processed.
- Around line 1348-1359: The branch that replaces "Constant_output_0" with
"scale" and unconditionally appends a new initializer can create duplicate
initializer names; update the logic in qdq_utils.py around
scale_name/initializer_map/tensor_producer_map so that before calling
graph.initializer.append(...) you check whether an initializer with the target
name already exists (e.g., consult initializer_map or graph.initializer names)
and if it does, reuse that existing initializer and do not append a new one;
alternatively generate a unique name when needed and set node.input[1] to that
unique or existing initializer name to avoid duplicate initializers.

In `@modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py`:
- Around line 64-70: Remove the `# nosec` suppressions around the subprocess
invocation in engine_builder.py and harden the call instead: keep using
subprocess.run with a list (no shell=True), remove the comments,
validate/sanitize the `args` list before building `cmd = ["trtexec", *args]`
(reject or escape unsafe characters and ensure each entry is a string), verify
the executable exists (e.g., use shutil.which("trtexec") and raise a clear
error), and wrap subprocess.run in try/except to log and rethrow the caught
exception from the subprocess.run invocation so no security bypass comment is
needed; update references to subprocess.run, cmd, and args accordingly.

---

Outside diff comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Around line 193-217: The code is overwriting user-provided args.model_args
with parser default values for ModelOpt fields (e.g., calib_size,
auto_quantize_bits, auto_quantize_method, auto_quantize_score_size and the
implicit booleans); instead, only merge keys that the user explicitly passed at
the command line into args.model_args so create_from_arg_obj() can supply
defaults. Modify the merge logic that currently copies every ModelOpt parser
field into args.model_args to: detect which flags were explicitly set (via
argparse's actions or by comparing against parser.get_default for each option)
and only inject those keys into args.model_args (leave absent fields untouched
so create_from_arg_obj() provides defaults). Ensure this change applies to the
same merging blocks that handle calib_size/auto_quantize_* and any other
ModelOpt fields referenced earlier.

In `@tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py`:
- Line 67: The test mocks subprocess output as a str but the runtime function
_run_trtexec_with_logging returns bytes; change mock_run.return_value in
test_engine_builder.py from (0, "success") to (0, b"success") and update any
assertions comparing the captured output (e.g., equality checks against
"success") to use b"success" so types match the runtime contract; look for uses
of mock_run, _run_trtexec_with_logging-related assertions, and string
comparisons in that test file and replace them with the bytes literal.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: caa499f6-ea17-46a7-a8dd-11708cb15dd5

📥 Commits

Reviewing files that changed from the base of the PR and between cc06062 and 3f28248.

📒 Files selected for processing (22)

examples/llm_eval/lm_eval_hf.py
examples/llm_eval/requirements.txt
examples/llm_sparsity/weight_sparsity/launch_finetune.sh
examples/puzzletron/requirements.txt
examples/specdec_bench/specdec_bench/datasets/speed.py
modelopt/onnx/export/nvfp4_exporter.py
modelopt/onnx/quantization/autotune/benchmark.py
modelopt/onnx/quantization/ort_utils.py
modelopt/onnx/quantization/qdq_utils.py
modelopt/torch/_deploy/_runtime/tensorrt/constants.py
modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py
modelopt/torch/export/moe_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/opt/plugins/transformers.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
tests/examples/llm_eval/test_llm_eval.py
tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py
tests/unit/onnx/quantization/test_qdq_utils.py
tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py
tests/unit/torch/quantization/plugins/test_fused_experts.py
tests/unit/torch/quantization/test_calib.py

💤 Files with no reviewable changes (2)

examples/puzzletron/requirements.txt
modelopt/torch/_deploy/_runtime/tensorrt/constants.py

coderabbitai · 2026-05-11T11:22:42Z

+    def _stratified_select(dataset: "Dataset", n: int) -> "Dataset":
+        """Select ``n`` samples uniformly across the ``category`` column.
+
+        Round-robin across categories until ``n`` rows are collected. The
+        resulting prefix is balanced; once a smaller category is exhausted
+        the remaining categories continue contributing, so exactly ``n``
+        rows are returned whenever ``n`` does not exceed the dataset size.
+        Falls back to ``range(n)`` when ``category`` is absent or there is
+        only one category. Indices come from ``range(category_size)`` (not
+        random) so behavior is deterministic.
+        """
+        if "category" not in dataset.column_names:
+            return dataset.select(range(n))
+        cat_to_rows: dict[str, list[int]] = {}
+        for i, c in enumerate(dataset["category"]):
+            cat_to_rows.setdefault(c, []).append(i)
+        if len(cat_to_rows) <= 1:
+            return dataset.select(range(n))
+        cat_lists = list(cat_to_rows.values())
+        interleaved: list[int] = []
+        max_len = max(len(c) for c in cat_lists)
+        for i in range(max_len):
+            for c in cat_lists:
+                if i < len(c):
+                    interleaved.append(c[i])
+                    if len(interleaved) == n:
+                        return dataset.select(interleaved)
+        return dataset.select(interleaved)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard non-positive sample counts before interleaving.

When n <= 0, this method can still return non-empty results for multi-category datasets (because the round-robin loop appends before any len(interleaved) == n check can succeed). That makes num_samples=0 behave incorrectly.

Suggested fix

`@staticmethod` def _stratified_select(dataset: "Dataset", n: int) -> "Dataset": + if n <= 0: + return dataset.select([]) + if "category" not in dataset.column_names: return dataset.select(range(n))

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/specdec_bench/specdec_bench/datasets/speed.py` around lines 745 - 773, Add an early guard in _stratified_select to handle non-positive n: if n <= 0 return dataset.select(range(n)) before checking "category" presence or building cat_to_rows, so the interleaving code never runs for n<=0 and the function returns an empty selection deterministically.

coderabbitai · 2026-05-11T11:22:42Z

+    """
+    cmd = ["trtexec", *(args or [])]
+    try:
+        return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)  # nosec B603


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove the inline Bandit suppression.

This subprocess call is already using an argv list, so the new # nosec B603 is just a policy violation on a security-sensitive path and should not be merged as-is.

Suggested fix

- return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout) # nosec B603 + return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)

As per coding guidelines, "Bandit security checks must pass without exceptions. # nosec comments are not allowed as a bypass for security checks."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout) # nosec B603

return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/ort_utils.py` at line 66, Remove the inline Bandit suppression by deleting the trailing "# nosec B603" on the subprocess.run call so the security scan can validate it; ensure the call remains using the argv list variable `cmd` (as already used) and keeps the existing parameters (capture_output=True, text=True, timeout=timeout) or replace with an explicit safer pattern if desired (e.g., subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, check=True))—locate the return statement that invokes subprocess.run(cmd, ...) in ort_utils.py and remove only the "# nosec B603" comment.

coderabbitai · 2026-05-11T11:22:42Z

+    weight_dq_nodes = [node for node in graph.node if node.op_type == "DequantizeLinear"]
+    tensor_producer_map = get_tensor_producer_nodes(graph)
+
+    nodes_to_remove = []
+    for node in weight_dq_nodes:
+        weight_name = node.input[0]
+        scale_name = node.input[1]
+        logger.debug(f"Processing INT4 conversion for weight {weight_name}")
+        weight = numpy_helper.to_array(initializer_map[weight_name])


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Filter this shim to weight-backed DequantizeLinear nodes only.

Right now weight_dq_nodes includes every DequantizeLinear in the graph. The first activation DQ will fail at initializer_map[weight_name] because its input is not an initializer. That makes the compatibility shim unusable on mixed QDQ graphs even though only weight DQs should be rewritten here.

Suggested fix

- weight_dq_nodes = [node for node in graph.node if node.op_type == "DequantizeLinear"] + weight_dq_nodes = [ + node + for node in graph.node + if node.op_type == "DequantizeLinear" + and node.input + and node.input[0] in initializer_map + ]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/qdq_utils.py` around lines 1264 - 1272, The current filter builds weight_dq_nodes by selecting all DequantizeLinear nodes, causing index errors when activation DQ nodes (whose inputs aren't in initializer_map) are processed; update the selection logic to only include DequantizeLinear nodes whose input[0] is present in initializer_map (i.e., backed by an initializer) before iterating. Locate the list comprehension that defines weight_dq_nodes and change it to check initializer_map membership of node.input[0] (or validate via get_tensor_producer_nodes/initializer_map) so only weight-backed DequantizeLinear nodes are processed.

coderabbitai · 2026-05-11T11:22:42Z

+        if scale_name not in initializer_map:
+            # Remove scale producer if it's a Constant node
+            scale_name = node.input[1]
+            scale_producer = tensor_producer_map[scale_name]
+            if scale_producer.op_type == "Constant":
+                graph.node.remove(scale_producer)
+
+            # Create a new scale tensor
+            scale_name = scale_name.replace("Constant_output_0", "scale")
+            scale_tensor = onnx.numpy_helper.from_array(scale, scale_name)
+            graph.initializer.append(scale_tensor)
+            node.input[1] = scale_name


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't append a second initializer with the same replacement name.

In the constant-scale path, Constant_output_0 is renamed to scale and then always appended as a new initializer. The fixture added in this PR already carries an unused scale initializer, so this branch produces duplicate initializer names, which makes the ONNX graph ambiguous/invalid.

Suggested fix

- scale_tensor = onnx.numpy_helper.from_array(scale, scale_name) - graph.initializer.append(scale_tensor) - node.input[1] = scale_name + scale_tensor = onnx.numpy_helper.from_array(scale, scale_name) + if scale_name in initializer_map: + initializer_map[scale_name].CopyFrom(scale_tensor) + else: + graph.initializer.append(scale_tensor) + initializer_map[scale_name] = graph.initializer[-1] + node.input[1] = scale_name

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/quantization/qdq_utils.py` around lines 1348 - 1359, The branch that replaces "Constant_output_0" with "scale" and unconditionally appends a new initializer can create duplicate initializer names; update the logic in qdq_utils.py around scale_name/initializer_map/tensor_producer_map so that before calling graph.initializer.append(...) you check whether an initializer with the target name already exists (e.g., consult initializer_map or graph.initializer names) and if it does, reuse that existing initializer and do not append a new one; alternatively generate a unique name when needed and set node.input[1] to that unique or existing initializer name to avoid duplicate initializers.

coderabbitai · 2026-05-11T11:22:42Z

+    import subprocess  # nosec
+
+    cmd = ["trtexec", *args]
    logging.info(" ".join(cmd))
-    with NamedTemporaryFile("w+b") as log:
-        p = subprocess.Popen(cmd, stdout=log, stderr=log, cwd=str(cwd) if cwd else None)  # nosec
-        p.wait()
-        log.seek(0)
-        output = log.read()
-        if p.returncode != 0:
-            logging.error(output.decode(errors="ignore"))
-        return p.returncode, output
+    try:
+        result = subprocess.run(  # nosec B603 - cmd[0] is hardcoded "trtexec"
+            cmd,


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove # nosec suppressions in subprocess invocation.

Line 64 and Line 69 use # nosec bypasses. This violates repo security policy and must be removed (or escalated with explicit approved exception flow in PR metadata).

Proposed fix

- import subprocess # nosec + import subprocess @@ - result = subprocess.run( # nosec B603 - cmd[0] is hardcoded "trtexec" + result = subprocess.run( cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=str(cwd) if cwd else None, timeout=3600, )

As per coding guidelines: "Any use of '# nosec' comments to bypass Bandit security checks is not allowed."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py` around lines 64 - 70, Remove the `# nosec` suppressions around the subprocess invocation in engine_builder.py and harden the call instead: keep using subprocess.run with a list (no shell=True), remove the comments, validate/sanitize the `args` list before building `cmd = ["trtexec", *args]` (reject or escape unsafe characters and ensure each entry is a string), verify the executable exists (e.g., use shutil.which("trtexec") and raise a clear error), and wrap subprocess.run in try/except to log and rethrow the caught exception from the subprocess.run invocation so no security bypass comment is needed; update references to subprocess.run, cmd, and args accordingly.

codecov · 2026-05-11T11:29:39Z

Codecov Report

❌ Patch coverage is 83.84615% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.69%. Comparing base (cc06062) to head (5b16509).

Files with missing lines	Patch %	Lines
modelopt/onnx/quantization/qdq_utils.py	86.12%	29 Missing ⚠️
.../torch/_deploy/_runtime/tensorrt/engine_builder.py	66.66%	6 Missing ⚠️
modelopt/torch/quantization/model_calib.py	37.50%	5 Missing ⚠️
modelopt/onnx/quantization/ort_utils.py	85.71%	1 Missing ⚠️
modelopt/torch/opt/plugins/transformers.py	85.71%	1 Missing ⚠️

Additional details and impacted files

@@                Coverage Diff                 @@
##           release/0.44.0    #1426      +/-   ##
==================================================
- Coverage           77.19%   76.69%   -0.51%     
==================================================
  Files                 467      466       -1     
  Lines               50581    50778     +197     
==================================================
- Hits                39048    38944     -104     
- Misses              11533    11834     +301

Flag	Coverage Δ
examples	`41.27% <10.76%> (-0.32%)`	⬇️
gpu	`59.19% <10.38%> (-0.83%)`	⬇️
regression	`14.63% <1.65%> (-0.26%)`	⬇️
unit	`52.79% <75.76%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 and others added 9 commits May 11, 2026 04:12

kevalmorabia97 requested review from a team as code owners May 11, 2026 11:13

kevalmorabia97 requested review from RalphMao, cjluo-nv, realAsma, sugunav14 and yeyu-nvidia and removed request for a team May 11, 2026 11:13

kevalmorabia97 requested review from a team as code owners May 11, 2026 11:15

kevalmorabia97 changed the title ~~[Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408~~ [Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 #1416 May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

kevalmorabia97 removed the request for review from RalphMao May 11, 2026 13:19

Pin deepspeed<0.19 for older torch compatibility

5b16509

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 #1416#1426

[Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 #1416#1426
kevalmorabia97 wants to merge 11 commits intorelease/0.44.0from
cherry-picks/release-0.44.0

kevalmorabia97 commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Pre-merge checks failed

Uh oh!

github-actions Bot commented May 11, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-11 13:26 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 11, 2026

Uh oh!

coderabbitai Bot May 11, 2026

Uh oh!

coderabbitai Bot May 11, 2026

Uh oh!

coderabbitai Bot May 11, 2026

Uh oh!

coderabbitai Bot May 11, 2026

Uh oh!

codecov Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

	return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout) # nosec B603
	return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)

Conversation

kevalmorabia97 commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cherry-picked PRs

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-11 13:26 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

kevalmorabia97 commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-11 13:26 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

codecov Bot commented May 11, 2026 •

edited

Loading