Skip to content

[Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 #1416#1426

Open
kevalmorabia97 wants to merge 11 commits intorelease/0.44.0from
cherry-picks/release-0.44.0
Open

[Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 #1416#1426
kevalmorabia97 wants to merge 11 commits intorelease/0.44.0from
cherry-picks/release-0.44.0

Conversation

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@kevalmorabia97 kevalmorabia97 commented May 11, 2026

Cherry-picked PRs

Summary by CodeRabbit

  • New Features

    • SPEEDBench now uses stratified sampling for deterministic, balanced dataset selection.
    • Added legacy quantization conversion shims for INT4, MXFP8 and FP4→2DQ workflows.
    • AWQ Lite: fallback handling for uncalibrated per-expert quantizers during export.
  • Bug Fixes

    • Clamp FP8 scales in NVFP4 quantization to avoid NaNs.
    • Fixed warmup steps formatting in finetune launch script.
  • Improvements

    • LM-Eval integration updated for v0.4.10+ compatibility.
    • TensorRT execution routed through a dedicated trtexec helper.
  • Tests

    • Added/regressed tests covering quantization shims, FP8 scale behavior, export fallbacks, and LM eval.

Review Change Stack

kevalmorabia97 and others added 9 commits May 11, 2026 04:12
…compat (#1393)

Fix for NVBug 6120631 to fix

```
finetune.py: error: argument --warmup_steps/--warmup-steps: invalid int value: '0.0'
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Corrected parameter format in finetuning example script for
consistency.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
When the dataset has a `category` column with >1 distinct categories and
``--num_requests N`` is below the dataset size, take ceil(N /
num_categories) rows from each category and round-robin interleave them
so any prefix is balanced. Falls back to the existing ``range(N)`` slice
when category metadata is absent or there's only one category.

Fixes a sampling bug where SPEED-Bench parquet files are sorted by
category, so e.g. ``--num_requests 64`` on throughput_8k pulls 64
high_entropy prompts and zero from low_entropy / mixed.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* SPEEDBench dataset truncation now performs deterministic, balanced
stratified sampling across categories (round-robin) when multiple
categories exist and a smaller sample size is requested.
* Falls back to simple prefix selection when no category or only one
category is present, preserving prior behavior in that case.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…1268)

### What does this PR do?

Type of change: code improvement

Subprocess is needed to run `trtexec` commands but those require `nosec`
approval. This PR centralizes it into a single function for the ONNX
workflow to avoid future approval triggers. The torch workflow has that
centralized in `run_command` (renamed to `_run_trtexec_streamed`).

Original request:
#1259 (comment)

### Usage

```python
results = _run_trtexec(cmd)
```

### Testing
N/A

### Before your PR is "*Ready for review*"

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!---
Mandatory -->
- Did you write any new necessary tests?: N/A <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Standardized how the app invokes and logs the external
conversion/benchmarking tool across build, profile, and benchmark flows
for consistent execution and clearer logs.
* **Behavior Change**
* The tool is now invoked by name from PATH (configurable path option
removed); execution and logging were centralized.
* **Bug Fixes**
* Improved missing-tool error messaging: failures now instruct to ensure
the external tool is available in PATH.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
## Summary

- Saturates `per_block_scale * 448 / per_block_scale_max` to ≤ 448
before the `to(torch.float8_e4m3fn)` cast in
`NVFP4QTensor.get_weights_scaling_factor_from_quantizer`.
- Adds a regression test that reproduces the NaN byte without the clamp.

## Why

When `_amax` contains a zero entry (e.g. an all-zero weight block left
untouched by max calibration), the existing
`per_block_scale[per_block_scale == 0] = 1.0` safety net drives the
pre-cast value to `1.0 * 448 / (global_amax / 6)`. `fp8_e4m3fn` has no
Inf — anything `≥ 480` rounds to NaN — so a 0x7F byte slips into the
exported `weight_scale`.

This was observed in a saved Kimi-K2.6-NVFP4-MSE checkpoint at
`language_model.model.layers.1.mlp.experts.21.down_proj.weight_scale[4001,
18]`. The MSE FP8 sweep itself never produces zero per-block amax (it
always emits at least `c[0] * global_amax`), but any export path where
`_amax` ends up zero — including pure max calibration — hits the bug.
With the clamp the byte saturates to `0x7E` (= 448, fp8 max finite) and
dequantization is unaffected: the FP4 nibbles for an all-zero block are
all 0, so `0 × 448 × weight_scale_2 = 0` regardless of the stored fp8
scale. For non-degenerate blocks the clamp is a no-op since
`per_block_amax ≤ global_amax` already bounds the pre-cast value at 448.

## Test plan

- [x] New regression test
`test_export_fp8_scale_no_nan_for_zero_amax_block` fails on `main`'s
export code (reproduces the 0x7F NaN byte) and passes with the clamp.
- [x] Existing tests in
`tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py` still
pass (10/10).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved numerical stability in FP8 quantization scaling by preventing
overflow and NaN conditions
* Enhanced handling of edge cases in quantization processing for
zero-weight blocks

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…tom configs (#1402)

The export step's collect_shared_input_modules() runs a dummy forward
pass to trace shared input modules for resmoothing/requantization. This
forward call hits the same AttributeError as PR #1324 on configs that
don't assign use_cache (e.g., stepfun-ai/Step-3.5-Flash's
Step3p5Config):

  AttributeError: 'Step3p5Config' object has no attribute 'use_cache'

Wrap the dummy forward with the _disable_use_cache context manager so it
sets/restores config.use_cache around the call, mirroring the
calibration fix from #1324. Covers all callsites since both
unified_export_hf and plugins/vllm_fakequant_hf funnel through
collect_shared_input_modules.

### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->

<!-- Details about the change. -->

### Usage

```python
# Add a code snippet demonstrating how to use this
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved model export process by optimizing cache handling during
profiling operations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…le_2 consistent (#1411)

## Summary
- Pre-fill the source `gate_up_proj_weight_quantizers[idx]._amax` from
the **fused** `gate_up[idx]` tensor when it is uncalibrated, **before**
the per-projection deepcopy in `_export_fused_experts`. Both gate's
clone and up's clone then inherit the same scalar amax, so
`weight_scale_2 = amax / (6 · 448)` matches across the W1/W3 fusion that
vLLM expects at load time.
- Add a unit regression test that builds a fused-experts module with
intentionally mismatched gate vs. up weight magnitudes, leaves every
expert uncalibrated, and asserts the gate and up wrappers carry the same
amax into the FP4 quantization step. Fails on `main`, passes with this
fix.

## Why

NVbug 6142360: Qwen3.5-MoE / Qwen3-Next NVFP4 checkpoints produced by
ModelOpt yielded garbled output under vLLM (`1\n1\n1\n…` style
degenerate token loops). The vLLM log showed `w1_weight_scale_2 must
match w3_weight_scale_2. Accuracy may be affected.`.

Root cause: in `_export_fused_experts`, when an expert receives no
calibration tokens (common for low-frequency experts even with
`--calib_size 512` on 128-expert MoEs), the per-projection fallback
computed `amax` independently from each split slice:

```python
w_quantizer.amax = weight_slice.abs().amax().to(torch.float32)
```

`weight_slice` is the gate-only or up-only 2-D half of the fused
`gate_up_proj`. Since gate and up have different magnitudes, gate and up
end up with different scalar amax values — and therefore different
`weight_scale_2`. vLLM fuses W1 and W3 into a single weight at load time
and asserts a single shared scale; the half whose scale was discarded is
now off by the gate/up magnitude ratio (~10× was typical), which
catastrophically corrupts the MoE output.

The fix derives the fallback amax once from the whole `gate_up[idx]`
tensor before the deepcopies, so gate and up share the same amax —
exactly what calibration would have produced if any token had hit the
expert. Calibrated experts are unaffected (the new code path is gated on
`_amax` being missing or zero). `down_proj` keeps its existing
per-projection fallback because it has its own quantizer with no fusion
partner.

## End-to-end vLLM verification

Used `vllm/vllm-openai:nightly` Docker on RTX 6000 Ada, Marlin NVFP4
backend, with a real Qwen3-30B-A3B-Instruct NVFP4 checkpoint. The bug
requires the export-time scale skew, so I A/B-tested by mutating
`gate_proj.weight_scale_2` directly:

| Variant | gate vs up `weight_scale_2` | Output for "Write an article
about AI." |
|---|---|---|
| Baseline (original checkpoint) | matched (calibrated) | coherent: "The
Rise of Artificial Intelligence: Transforming the World One Algorithm at
a Time…" |
| Bug-simulated (gate = up / 10 for all 6143 expert pairs) | mismatched
~10× | `__':\n__':\n__':\n…` (degenerate loop, same shape as user's
`1\n1\n1\n…`) |
| Fix-simulated (gate restored to up) | matched | coherent: same article
as baseline |

The 10× skew matches what the bug actually emits for uncalibrated
experts (e.g. `gate=2.77e-5, up=2.30e-4` from a synthetic repro), and
reproduces the same garbled-loop pattern the user reported.

## Test plan
- [x] New unit test
`tests/unit/torch/quantization/plugins/test_fused_experts.py::TestExportFusedExperts::test_uncalibrated_expert_gate_up_share_amax`
passes.
- [x] Full
`tests/unit/torch/quantization/plugins/test_fused_experts.py`: 28/28
pass.
- [x] Broader `tests/unit/torch/export/` +
`tests/unit/torch/quantization/`: 609 passed, 9 skipped.
- [x] vLLM end-to-end A/B (table above): bug reproduces with mismatched
scales, fix produces coherent output identical to baseline.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->
- Did you get Claude approval on this PR?: ✅ / ❌ / N/A <!--- Run
`/claude review`. NVIDIA org members can self-trigger for complex
changes; orthogonal to CodeRabbit. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced handling of uncalibrated mixture-of-experts weight quantizers
during export with improved consistency checks and informative warnings.

* **Tests**
  * Added regression test for uncalibrated expert quantization behavior.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: weimingc <weimingc@nvidia.com>
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
…zer disabled (#1410)

## Summary

`awq_lite.setup()` disables `module.input_quantizer` at the start of
search. The calibrated branch re-enables it inside `postprocess()`, but
the uncalibrated branch (no cache-pass tokens, e.g. an MoE expert that
never gets routed) never did. Worse, for experts that had cache hits but
missed the search pass, the per-channel `_amax` left over from
`max_calibrate` during cache mode tripped `preprocess_linear_fusion`'s
`numel == 1` assertion and prevented `_export_quantized_weight` from
emitting a per-tensor `input_scale`.

Result: per-expert `input_scale` was missing in the exported HF
checkpoint, and TRT-LLM `CutlassFusedMoE` crashed on load with
`KeyError: '<idx>.w1.input_scale'` for any expert that did not see
enough calibration tokens (e.g. Qwen3-30B-A3B + `nvfp4_awq` from the bug
report).

## Fix

Mirror the calibrated `postprocess()` path in
`modelopt/torch/quantization/model_calib.py`: collapse any per-channel
`_amax` to scalar (axis=None) and re-enable the `input_quantizer`.

## Test plan

- [x] Added regression test
`test_awq_lite_uncalibrated_linear_keeps_input_quantizer_enabled` using
`NVFP4_AWQ_LITE_CFG` with a two-branch model where only one branch is
exercised; verifies the uncalibrated linear's `input_quantizer` remains
enabled after `mtq.quantize`.
- [x] End-to-end pipeline test on a tiny synthetic Qwen3-MoE (8 experts,
top-1 routing) confirms all 48 expert `input_scale` keys are present in
the exported state_dict (vs. multiple missing pre-fix).
- [x] Manual repro of the bug command (Qwen3-30B-A3B, `--qformat
nvfp4_awq`) confirmed the missing `input_scale` keys before the fix.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Fixed AWQ-Lite quantization for uncalibrated modules so
export/preprocessing invariants are preserved even when
calibration/parameter updates are skipped.

* **Tests**
* Added regression tests: one verifies uncalibrated linear modules keep
their input quantizer enabled after quantization; another verifies
weight-only AWQ-Lite cases keep the input quantizer disabled as
expected.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
### What does this PR do?

Type of change: ? <!-- Use one of the following: Bug fix, new feature,
new example, new tests, documentation. -->

<!-- Details about the change. -->

### Usage

```python
# Add a code snippet demonstrating how to use this
```

### Testing
<!-- Mention how have you tested your change if applicable. -->

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain
why. -->
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A
<!--- Mandatory -->
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory
for new features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes
or backward incompatible changes. -->
- Did you get Claude approval on this PR?: ✅ / ❌ / N/A <!--- Run
`/claude review`. NVIDIA org members can self-trigger for complex
changes; orthogonal to CodeRabbit. -->

### Additional Information
<!-- E.g. related issue. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Updated compatibility handling to support Transformers 5.0 while
maintaining backward compatibility with earlier versions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Suguna Velury <178320438+sugunav14@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
## What does this PR do?

**Type of change:** Bug fix (backward-compat restoration), follow-up to
#1356.

#1356 restored `modelopt.onnx.llm_export_utils` as a deprecation shim so
that TensorRT-Edge-LLM 0.6.1 could resume importing
`fold_fp8_qdq_to_dq`. QA reports that the next line of
`tensorrt_edgellm/onnx_export/onnx_utils.py` still aborts at module
load:

```python
from modelopt.onnx.llm_export_utils.surgeon_utils import fold_fp8_qdq_to_dq   # OK
from modelopt.onnx.quantization.qdq_utils import (fp4qdq_to_2dq,
                                                  quantize_weights_to_int4,
                                                  quantize_weights_to_mxfp8)
# ImportError: cannot import name 'fp4qdq_to_2dq' from 'modelopt.onnx.quantization.qdq_utils'
```

Those three symbols were removed when the LLM ONNX export pipeline was
refactored into the staged `modelopt.onnx.export` exporters
(`INT4QuantExporter`, `NVFP4QuantExporter`, `MXFP8QuantExporter`). The
new exporters still implement the same conversions, but edgellm 0.6.1
imports the old top-level functions unconditionally, so every
`tensorrt-edgellm-*` CLI fails immediately — same 5/5 (100%) failure
rate as before, just one line further in.

## What this PR does

- Restores `fp4qdq_to_2dq`, `quantize_weights_to_int4`, and
`quantize_weights_to_mxfp8` on `modelopt.onnx.quantization.qdq_utils` as
deprecation shims that:
- emit a `DeprecationWarning` pointing readers at the new
`modelopt.onnx.export` exporters and at TensorRT-Edge-LLM as the
long-term migration target,
- reuse the existing in-tree helpers (`_cast_fp8` here, `_cast_fp4` and
`_replace_fp4qdq_with_2dq` from `modelopt.onnx.export.nvfp4_exporter`
via a lazy import to avoid a circular import — `nvfp4_exporter` already
imports `onnx_dtype_map` from this module),
- reuse `pack_weights_to_int4`, `get_amax`, `compute_e8m0`,
`get_weights_scaling_factor[_2]`, `quantize`, `get_attribute`,
`has_attribute`, and `read_f16_tensor_as_fp32` from their canonical
locations rather than duplicating them.
- Adds the matching imports at the top of `qdq_utils.py`. No new files,
no public API removals, no behavior changes for any existing caller.

I also audited every other `modelopt` import edgellm 0.6.1 makes
(`modelopt.torch.quantization`, `modelopt.torch.opt`,
`modelopt.torch.export.quant_utils`,
`modelopt.torch.opt.plugins.huggingface`,
`modelopt.onnx.quantization.gs_patching`,
`modelopt.torch.quantization.nn.{QuantLinear,TensorQuantizer}`,
`modelopt.torch.quantization.utils.{is_quantized,is_quantized_linear}`)
and confirmed each still resolves on current `main`; only the three
`qdq_utils` symbols above were missing.

## Testing

- All four edgellm-side import lines from the failing traceback now
succeed against this branch (verified by importing them from the v0.6.1
source tree).
- `tests/unit/onnx/quantization/test_qdq_utils.py` +
`tests/unit/onnx/test_onnx_utils.py`: 36 passed in 5s.
- Each restored shim emits `DeprecationWarning` when called and runs to
completion on an empty `ModelProto` smoke input.
- `pre-commit run --files modelopt/onnx/quantization/qdq_utils.py`:
ruff, ruff-format, mypy, bandit, license header all pass.

## Notes for reviewers

- The shim functions are intentionally near-verbatim copies of the
pre-removal implementations rather than thin adapters over the new
exporters: the new exporters split work across `compute_scales` /
`compress_weights` / `post_process` stages, while edgellm calls the
legacy single-shot entry points and depends on side effects (e.g.
casting bias to fp16 inside `quantize_weights_to_int4`) that the staged
exporters do not all reproduce in one call. Wrapping them up was
higher-risk than restoring the legacy paths verbatim with a deprecation
warning.
- Removal target: same as the existing `llm_export_utils` shim — drop
together once edgellm releases a version that targets
`modelopt.onnx.export` directly.

## Before your PR is "*Ready for review*"

- [x] Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed.
- [x] Is this change backward compatible?: **Yes** — purely additive; no
existing API changes.
- [x] Did you write any new necessary tests?: relied on existing
qdq_utils unit tests + a manual repro of the failing edgellm import
path.
- [x] Did you add or update any necessary documentation?: deprecation
message in each shim points to the migration target.
- [x] Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
not applicable for a compat shim.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Added deprecated legacy quantization conversion utilities to preserve
compatibility; invoking them now emits deprecation warnings.
* **Documentation**
  * Updated helper docstrings to note compatibility/shim usage.
* **Tests**
* Added end-to-end smoke tests verifying the deprecated shims emit
warnings and produce expected transformed/quantized models.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners May 11, 2026 11:13
@kevalmorabia97 kevalmorabia97 requested review from RalphMao, cjluo-nv, realAsma, sugunav14 and yeyu-nvidia and removed request for a team May 11, 2026 11:13
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0787cc16-3ffd-417b-9408-e17966495872

📥 Commits

Reviewing files that changed from the base of the PR and between 3f28248 and 5b16509.

📒 Files selected for processing (1)
  • pyproject.toml
✅ Files skipped from review due to trivial changes (1)
  • pyproject.toml

📝 Walkthrough

Walkthrough

This PR consolidates TensorRT subprocess execution into shared helpers, adds legacy ONNX quantization shims for backwards compatibility, fixes FP8 scale overflow handling, improves quantizer state consistency for fused experts, extends model export compatibility for newer dependency versions, and updates evaluation examples to integrate with lm-eval 0.4.10+.

Changes

TensorRT Subprocess Execution Refactoring

Layer / File(s) Summary
Shared trtexec Helper
modelopt/onnx/quantization/ort_utils.py, modelopt/onnx/quantization/autotune/benchmark.py
Introduces _run_trtexec() with subprocess wrapping, timeout support, and clearer error handling; updates _check_for_trtexec() and replaces direct subprocess usage in benchmark.
TensorRT Constants Cleanup
modelopt/torch/_deploy/_runtime/tensorrt/constants.py
Removes TRTEXEC and TRTEXEC_PATH constants no longer needed after subprocess refactoring.
Engine Builder Integration
modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py
Adds _run_trtexec_with_logging() wrapper for hardcoded trtexec invocation; refactors build_engine and profile_engine to build command args without executable path and use the new wrapper.
Test Mocking Updates
tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py
Updates test fixture to mock refactored _run_trtexec_with_logging instead of _run_command.

Legacy ONNX Quantization Deprecation Shims

Layer / File(s) Summary
Helper Documentation
modelopt/onnx/export/nvfp4_exporter.py
Adds docstring notes documenting that _cast_fp4 and _replace_fp4qdq_with_2dq are reused by deprecated shims and must maintain stable signatures.
Legacy Conversion Functions
modelopt/onnx/quantization/qdq_utils.py
Adds three deprecation-wrapped conversion shims: quantize_weights_to_int4, quantize_weights_to_mxfp8, and fp4qdq_to_2dq; expands quantization helper imports and adds deprecation messaging.
Shim Test Coverage
tests/unit/onnx/quantization/test_qdq_utils.py
Adds TestLegacyEdgeLLMShims with end-to-end regression tests verifying deprecation warnings and correct graph/initializer transformations for all three shims.

Quantizer State and FP8 Scale Handling

Layer / File(s) Summary
FP8 Scale Overflow Prevention
modelopt/torch/quantization/qtensor/nvfp4_tensor.py, tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py
Clamps FP8 scale values to 448.0 before casting to float8_e4m3fn to prevent NaN encoding; adds regression test checking no NaN-encoded bytes and expected saturation.
Fused Expert amax Fallback & AWQ-Lite Postprocessing
modelopt/torch/export/moe_utils.py, modelopt/torch/quantization/model_calib.py, tests/unit/torch/quantization/plugins/test_fused_experts.py, tests/unit/torch/quantization/test_calib.py
Adds pre-split amax fallback for fused gate/up quantizers and extends AWQ-Lite disabled-expert postprocess to collapse per-channel amax and restore input_quantizer state; includes regression tests for both behaviors.

Model Export and Dependency Compatibility

Layer / File(s) Summary
use_cache Handling
modelopt/torch/export/unified_export_hf.py
Wraps dummy forward pass in collect_shared_input_modules with _disable_use_cache to support models where use_cache affects probe forward.
Transformers Version Gating
modelopt/torch/opt/plugins/transformers.py
Detects Transformers version at runtime and conditionally passes load_config to cached zero3 loader only for versions >= 5.0.

LLM Evaluation Framework Integration

Layer / File(s) Summary
lm_eval Version and CLI
examples/llm_eval/lm_eval_hf.py
Enforces lm_eval >= 0.4.10 (ImportError if older); switches to HarnessCLI/setup_logging; adds _add_modelopt_args parser-extension hook and _inject_modelopt_args_into_model_args; refactors __main__ to call cli.execute(args).
Requirements Update
examples/llm_eval/requirements.txt, examples/puzzletron/requirements.txt
Updates llm_eval requirement to lm_eval[api,ifeval]>=0.4.10 and removes lm-eval==0.4.8 from puzzletron example.
Test Suite
tests/examples/llm_eval/test_llm_eval.py
Adds test_lm_eval_hf() for HF eval and replaces previous FP8 llama test with Qwen3-based test_qwen3_eval_fp8(); updates imports and tiny-model creation helpers.

Example Scripts and Dataset Utilities

Layer / File(s) Summary
Parameter Updates
examples/llm_sparsity/weight_sparsity/launch_finetune.sh
Changes --warmup_steps from 0.0 to 0.
Dataset Sampling
examples/specdec_bench/specdec_bench/datasets/speed.py
Replaces simple prefix truncation with category-stratified round-robin sampling via new _stratified_select() helper when num_samples is specified.
pyproject hf Extra Formatting
pyproject.toml
Adjusts whitespace/alignment in the hf optional dependency block without changing dependencies.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • cjluo-nv
  • sugunav14

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error PR contains 5 # nosec comments in modified files, violating SECURITY.md which prohibits this pattern. Remove # nosec comments from ort_utils.py, engine_builder.py, benchmark.py. Add inline safety justifications. Request @NVIDIA/modelopt-setup-codeowners review with justification in PR description per SECURITY.md.
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately reflects the core purpose: cherry-picking multiple numbered PRs into a release branch. It is concise, specific, and clearly communicates the main change.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cherry-picks/release-0.44.0

Comment @coderabbitai help to get the list of available commands and usage tips.

…1416)

### What does this PR do?

Type of change: Bug fix

Related: NVbug 6153721. To make `lm_eval_hf.py` and the `vllm_causallms`
adapter compatible with vLLM 0.20+, **lm_eval must be upgraded to
0.4.11**.

`examples/llm_eval/lm_eval_hf.py` failed to import on lm-eval >= 0.4.10:

```
ImportError: cannot import name 'parse_eval_args' from 'lm_eval.__main__'
```

lm-eval 0.4.10 replaced `lm_eval.__main__.{setup_parser,
parse_eval_args, cli_evaluate}` with a `HarnessCLI`-based interface in
`lm_eval._cli`. This PR drops the legacy code path and drives
`HarnessCLI` directly:

- Attach the ModelOpt arguments (`--quant_cfg`, `--auto_quantize_*`,
`--calib_*`, `--compress`, `--sparse_cfg`) to the new `run` subparser.
- After parsing, move those keys out of the argparse namespace and into
`args.model_args` (now a dict, courtesy of `MergeDictAction`), so
`EvaluatorConfig.from_cli` doesn't reject them as unknown kwargs and so
they reach our `HFLM.create_from_arg_obj` override.
- Hard-require lm-eval >= 0.4.10 via `packaging.version.Version` and
bump `examples/llm_eval/requirements.txt` and
`examples/puzzletron/requirements.txt` accordingly.

### Usage

```bash
# Same CLI as before — HarnessCLI auto-inserts the `run` subcommand for legacy-style invocations.
python examples/llm_eval/lm_eval_hf.py \
    --model hf \
    --model_args pretrained=<HF model> \
    --tasks hellaswag \
    --quant_cfg FP8_DEFAULT_CFG \
    --batch_size 4
```

### Testing

Added `tests/examples/llm_eval/test_llm_eval.py::test_lm_eval_hf` — an
end-to-end test that builds a tiny qwen3 (no GPU required) and runs
`lm_eval_hf.py` against MMLU with `--limit 0.1`. Verifies both the new
HarnessCLI integration and the `HFLM.create_from_arg_obj` override
actually execute. Existing FP8 PTQ test (`test_qwen3_eval_fp8`) was
retargeted to the same tiny qwen3 helper so it no longer pulls down the
1.1B TinyLlama checkpoint.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ❌ — drops support for lm-eval <
0.4.10. Requirements pins are bumped in the same PR; existing CLI
invocations continue to work because HarnessCLI auto-inserts the `run`
subcommand for legacy-style argv.
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A
- Did you get Claude approval on this PR?: ❌

### Additional Information

NVbug 6153721 — vLLM 0.20+ compatibility requires lm_eval 0.4.11. This
PR pins to >= 0.4.10 to fix the import error; bumping the floor to
0.4.11 (or letting users opt in via vLLM extras) unblocks vLLM 0.20+
usage.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Chores**
* Require lm_eval >= 0.4.10 for evaluation examples and remove an
lm-eval pin from a puzzletron example.
* Improve LM-eval CLI integration so model-related CLI options
(including trust_remote_code) are recognized and forwarded correctly.

* **Tests**
  * Added an end-to-end smoke test for the LM evaluation example.
* Updated evaluation tests to use the new tiny model setup for more
robust validation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 requested review from a team as code owners May 11, 2026 11:15
@kevalmorabia97 kevalmorabia97 changed the title [Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 [Cherry-pick] PRs #1393 #1389 #1268 #1397 #1402 #1411 #1410 #1419 #1408 #1416 May 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 11, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1426/

Built to branch gh-pages at 2026-05-11 13:26 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py (1)

67-67: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Match mocked subprocess output type to runtime contract (bytes).

Line 67 returns "success" as str, but _run_trtexec_with_logging returns bytes. Keeping the mock as bytes avoids type-skew in tests.

Proposed fix
-        mock_run.return_value = (0, "success")
+        mock_run.return_value = (0, b"success")

(Then update string assertions in this test file to b"success".)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py` at line 67,
The test mocks subprocess output as a str but the runtime function
_run_trtexec_with_logging returns bytes; change mock_run.return_value in
test_engine_builder.py from (0, "success") to (0, b"success") and update any
assertions comparing the captured output (e.g., equality checks against
"success") to use b"success" so types match the runtime contract; look for uses
of mock_run, _run_trtexec_with_logging-related assertions, and string
comparisons in that test file and replace them with the bytes literal.
examples/llm_eval/lm_eval_hf.py (1)

193-217: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't overwrite user --model_args with parser defaults.

Lines 248-259 unconditionally copy every ModelOpt parser field into args.model_args, but several of those fields have defaults on Lines 193, 206, 217, plus the implicit False from Line 231. That means a call like --model_args calib_size=1024,compress=True gets silently rewritten back to 512 / False unless the same values are also passed as top-level flags.

create_from_arg_obj() already supplies these defaults, so the safer pattern is to leave parser defaults unset and only inject keys the user explicitly provided.

Proposed fix
     parser.add_argument(
-        "--calib_size", type=int, help="Calibration size for quantization", default=512
+        "--calib_size", type=int, help="Calibration size for quantization"
     )
@@
     parser.add_argument(
         "--auto_quantize_method",
         type=str,
-        default="gradient",
         choices=["gradient", "kl_div"],
         help=(
             "Method for auto_quantize sensitivity analysis. 'gradient' uses gradient-based method "
@@
     parser.add_argument(
         "--auto_quantize_score_size",
         type=int,
-        default=128,
         help=(
             "Number of samples to use for auto_quantize scoring. Most of auto_quantize time is spent on "
             "sensitivity score estimation, so reducing this speeds it up while only minimally affecting "
@@
     parser.add_argument(
         "--compress",
         action="store_true",
+        default=None,
         help="Compress the model after quantization",
     )
@@
     for key in _MODELOPT_ARG_KEYS:
         if hasattr(args, key):
-            model_args[key] = getattr(args, key)
+            value = getattr(args, key)
+            if value is not None:
+                model_args[key] = value
             delattr(args, key)

Also applies to: 229-232, 241-259

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_eval/lm_eval_hf.py` around lines 193 - 217, The code is
overwriting user-provided args.model_args with parser default values for
ModelOpt fields (e.g., calib_size, auto_quantize_bits, auto_quantize_method,
auto_quantize_score_size and the implicit booleans); instead, only merge keys
that the user explicitly passed at the command line into args.model_args so
create_from_arg_obj() can supply defaults. Modify the merge logic that currently
copies every ModelOpt parser field into args.model_args to: detect which flags
were explicitly set (via argparse's actions or by comparing against
parser.get_default for each option) and only inject those keys into
args.model_args (leave absent fields untouched so create_from_arg_obj() provides
defaults). Ensure this change applies to the same merging blocks that handle
calib_size/auto_quantize_* and any other ModelOpt fields referenced earlier.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 745-773: Add an early guard in _stratified_select to handle
non-positive n: if n <= 0 return dataset.select(range(n)) before checking
"category" presence or building cat_to_rows, so the interleaving code never runs
for n<=0 and the function returns an empty selection deterministically.

In `@modelopt/onnx/quantization/ort_utils.py`:
- Line 66: Remove the inline Bandit suppression by deleting the trailing "#
nosec B603" on the subprocess.run call so the security scan can validate it;
ensure the call remains using the argv list variable `cmd` (as already used) and
keeps the existing parameters (capture_output=True, text=True, timeout=timeout)
or replace with an explicit safer pattern if desired (e.g., subprocess.run(cmd,
capture_output=True, text=True, timeout=timeout, check=True))—locate the return
statement that invokes subprocess.run(cmd, ...) in ort_utils.py and remove only
the "# nosec B603" comment.

In `@modelopt/onnx/quantization/qdq_utils.py`:
- Around line 1264-1272: The current filter builds weight_dq_nodes by selecting
all DequantizeLinear nodes, causing index errors when activation DQ nodes (whose
inputs aren't in initializer_map) are processed; update the selection logic to
only include DequantizeLinear nodes whose input[0] is present in initializer_map
(i.e., backed by an initializer) before iterating. Locate the list comprehension
that defines weight_dq_nodes and change it to check initializer_map membership
of node.input[0] (or validate via get_tensor_producer_nodes/initializer_map) so
only weight-backed DequantizeLinear nodes are processed.
- Around line 1348-1359: The branch that replaces "Constant_output_0" with
"scale" and unconditionally appends a new initializer can create duplicate
initializer names; update the logic in qdq_utils.py around
scale_name/initializer_map/tensor_producer_map so that before calling
graph.initializer.append(...) you check whether an initializer with the target
name already exists (e.g., consult initializer_map or graph.initializer names)
and if it does, reuse that existing initializer and do not append a new one;
alternatively generate a unique name when needed and set node.input[1] to that
unique or existing initializer name to avoid duplicate initializers.

In `@modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py`:
- Around line 64-70: Remove the `# nosec` suppressions around the subprocess
invocation in engine_builder.py and harden the call instead: keep using
subprocess.run with a list (no shell=True), remove the comments,
validate/sanitize the `args` list before building `cmd = ["trtexec", *args]`
(reject or escape unsafe characters and ensure each entry is a string), verify
the executable exists (e.g., use shutil.which("trtexec") and raise a clear
error), and wrap subprocess.run in try/except to log and rethrow the caught
exception from the subprocess.run invocation so no security bypass comment is
needed; update references to subprocess.run, cmd, and args accordingly.

---

Outside diff comments:
In `@examples/llm_eval/lm_eval_hf.py`:
- Around line 193-217: The code is overwriting user-provided args.model_args
with parser default values for ModelOpt fields (e.g., calib_size,
auto_quantize_bits, auto_quantize_method, auto_quantize_score_size and the
implicit booleans); instead, only merge keys that the user explicitly passed at
the command line into args.model_args so create_from_arg_obj() can supply
defaults. Modify the merge logic that currently copies every ModelOpt parser
field into args.model_args to: detect which flags were explicitly set (via
argparse's actions or by comparing against parser.get_default for each option)
and only inject those keys into args.model_args (leave absent fields untouched
so create_from_arg_obj() provides defaults). Ensure this change applies to the
same merging blocks that handle calib_size/auto_quantize_* and any other
ModelOpt fields referenced earlier.

In `@tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py`:
- Line 67: The test mocks subprocess output as a str but the runtime function
_run_trtexec_with_logging returns bytes; change mock_run.return_value in
test_engine_builder.py from (0, "success") to (0, b"success") and update any
assertions comparing the captured output (e.g., equality checks against
"success") to use b"success" so types match the runtime contract; look for uses
of mock_run, _run_trtexec_with_logging-related assertions, and string
comparisons in that test file and replace them with the bytes literal.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: caa499f6-ea17-46a7-a8dd-11708cb15dd5

📥 Commits

Reviewing files that changed from the base of the PR and between cc06062 and 3f28248.

📒 Files selected for processing (22)
  • examples/llm_eval/lm_eval_hf.py
  • examples/llm_eval/requirements.txt
  • examples/llm_sparsity/weight_sparsity/launch_finetune.sh
  • examples/puzzletron/requirements.txt
  • examples/specdec_bench/specdec_bench/datasets/speed.py
  • modelopt/onnx/export/nvfp4_exporter.py
  • modelopt/onnx/quantization/autotune/benchmark.py
  • modelopt/onnx/quantization/ort_utils.py
  • modelopt/onnx/quantization/qdq_utils.py
  • modelopt/torch/_deploy/_runtime/tensorrt/constants.py
  • modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py
  • modelopt/torch/export/moe_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/opt/plugins/transformers.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/qtensor/nvfp4_tensor.py
  • tests/examples/llm_eval/test_llm_eval.py
  • tests/gpu/torch/quantization/test_nvfp4_static_quantizer_cuda.py
  • tests/unit/onnx/quantization/test_qdq_utils.py
  • tests/unit/torch/deploy/_runtime/tensorrt/test_engine_builder.py
  • tests/unit/torch/quantization/plugins/test_fused_experts.py
  • tests/unit/torch/quantization/test_calib.py
💤 Files with no reviewable changes (2)
  • examples/puzzletron/requirements.txt
  • modelopt/torch/_deploy/_runtime/tensorrt/constants.py

Comment on lines +745 to +773
def _stratified_select(dataset: "Dataset", n: int) -> "Dataset":
"""Select ``n`` samples uniformly across the ``category`` column.

Round-robin across categories until ``n`` rows are collected. The
resulting prefix is balanced; once a smaller category is exhausted
the remaining categories continue contributing, so exactly ``n``
rows are returned whenever ``n`` does not exceed the dataset size.
Falls back to ``range(n)`` when ``category`` is absent or there is
only one category. Indices come from ``range(category_size)`` (not
random) so behavior is deterministic.
"""
if "category" not in dataset.column_names:
return dataset.select(range(n))
cat_to_rows: dict[str, list[int]] = {}
for i, c in enumerate(dataset["category"]):
cat_to_rows.setdefault(c, []).append(i)
if len(cat_to_rows) <= 1:
return dataset.select(range(n))
cat_lists = list(cat_to_rows.values())
interleaved: list[int] = []
max_len = max(len(c) for c in cat_lists)
for i in range(max_len):
for c in cat_lists:
if i < len(c):
interleaved.append(c[i])
if len(interleaved) == n:
return dataset.select(interleaved)
return dataset.select(interleaved)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard non-positive sample counts before interleaving.

When n <= 0, this method can still return non-empty results for multi-category datasets (because the round-robin loop appends before any len(interleaved) == n check can succeed). That makes num_samples=0 behave incorrectly.

Suggested fix
 `@staticmethod`
 def _stratified_select(dataset: "Dataset", n: int) -> "Dataset":
+    if n <= 0:
+        return dataset.select([])
+
     if "category" not in dataset.column_names:
         return dataset.select(range(n))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/specdec_bench/specdec_bench/datasets/speed.py` around lines 745 -
773, Add an early guard in _stratified_select to handle non-positive n: if n <=
0 return dataset.select(range(n)) before checking "category" presence or
building cat_to_rows, so the interleaving code never runs for n<=0 and the
function returns an empty selection deterministically.

"""
cmd = ["trtexec", *(args or [])]
try:
return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout) # nosec B603
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove the inline Bandit suppression.

This subprocess call is already using an argv list, so the new # nosec B603 is just a policy violation on a security-sensitive path and should not be merged as-is.

Suggested fix
-        return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)  # nosec B603
+        return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)

As per coding guidelines, "Bandit security checks must pass without exceptions. # nosec comments are not allowed as a bypass for security checks."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout) # nosec B603
return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/ort_utils.py` at line 66, Remove the inline Bandit
suppression by deleting the trailing "# nosec B603" on the subprocess.run call
so the security scan can validate it; ensure the call remains using the argv
list variable `cmd` (as already used) and keeps the existing parameters
(capture_output=True, text=True, timeout=timeout) or replace with an explicit
safer pattern if desired (e.g., subprocess.run(cmd, capture_output=True,
text=True, timeout=timeout, check=True))—locate the return statement that
invokes subprocess.run(cmd, ...) in ort_utils.py and remove only the "# nosec
B603" comment.

Comment on lines +1264 to +1272
weight_dq_nodes = [node for node in graph.node if node.op_type == "DequantizeLinear"]
tensor_producer_map = get_tensor_producer_nodes(graph)

nodes_to_remove = []
for node in weight_dq_nodes:
weight_name = node.input[0]
scale_name = node.input[1]
logger.debug(f"Processing INT4 conversion for weight {weight_name}")
weight = numpy_helper.to_array(initializer_map[weight_name])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Filter this shim to weight-backed DequantizeLinear nodes only.

Right now weight_dq_nodes includes every DequantizeLinear in the graph. The first activation DQ will fail at initializer_map[weight_name] because its input is not an initializer. That makes the compatibility shim unusable on mixed QDQ graphs even though only weight DQs should be rewritten here.

Suggested fix
-    weight_dq_nodes = [node for node in graph.node if node.op_type == "DequantizeLinear"]
+    weight_dq_nodes = [
+        node
+        for node in graph.node
+        if node.op_type == "DequantizeLinear"
+        and node.input
+        and node.input[0] in initializer_map
+    ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/qdq_utils.py` around lines 1264 - 1272, The
current filter builds weight_dq_nodes by selecting all DequantizeLinear nodes,
causing index errors when activation DQ nodes (whose inputs aren't in
initializer_map) are processed; update the selection logic to only include
DequantizeLinear nodes whose input[0] is present in initializer_map (i.e.,
backed by an initializer) before iterating. Locate the list comprehension that
defines weight_dq_nodes and change it to check initializer_map membership of
node.input[0] (or validate via get_tensor_producer_nodes/initializer_map) so
only weight-backed DequantizeLinear nodes are processed.

Comment on lines +1348 to +1359
if scale_name not in initializer_map:
# Remove scale producer if it's a Constant node
scale_name = node.input[1]
scale_producer = tensor_producer_map[scale_name]
if scale_producer.op_type == "Constant":
graph.node.remove(scale_producer)

# Create a new scale tensor
scale_name = scale_name.replace("Constant_output_0", "scale")
scale_tensor = onnx.numpy_helper.from_array(scale, scale_name)
graph.initializer.append(scale_tensor)
node.input[1] = scale_name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't append a second initializer with the same replacement name.

In the constant-scale path, Constant_output_0 is renamed to scale and then always appended as a new initializer. The fixture added in this PR already carries an unused scale initializer, so this branch produces duplicate initializer names, which makes the ONNX graph ambiguous/invalid.

Suggested fix
-            scale_tensor = onnx.numpy_helper.from_array(scale, scale_name)
-            graph.initializer.append(scale_tensor)
-            node.input[1] = scale_name
+            scale_tensor = onnx.numpy_helper.from_array(scale, scale_name)
+            if scale_name in initializer_map:
+                initializer_map[scale_name].CopyFrom(scale_tensor)
+            else:
+                graph.initializer.append(scale_tensor)
+                initializer_map[scale_name] = graph.initializer[-1]
+            node.input[1] = scale_name
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/onnx/quantization/qdq_utils.py` around lines 1348 - 1359, The branch
that replaces "Constant_output_0" with "scale" and unconditionally appends a new
initializer can create duplicate initializer names; update the logic in
qdq_utils.py around scale_name/initializer_map/tensor_producer_map so that
before calling graph.initializer.append(...) you check whether an initializer
with the target name already exists (e.g., consult initializer_map or
graph.initializer names) and if it does, reuse that existing initializer and do
not append a new one; alternatively generate a unique name when needed and set
node.input[1] to that unique or existing initializer name to avoid duplicate
initializers.

Comment on lines +64 to +70
import subprocess # nosec

cmd = ["trtexec", *args]
logging.info(" ".join(cmd))
with NamedTemporaryFile("w+b") as log:
p = subprocess.Popen(cmd, stdout=log, stderr=log, cwd=str(cwd) if cwd else None) # nosec
p.wait()
log.seek(0)
output = log.read()
if p.returncode != 0:
logging.error(output.decode(errors="ignore"))
return p.returncode, output
try:
result = subprocess.run( # nosec B603 - cmd[0] is hardcoded "trtexec"
cmd,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Remove # nosec suppressions in subprocess invocation.

Line 64 and Line 69 use # nosec bypasses. This violates repo security policy and must be removed (or escalated with explicit approved exception flow in PR metadata).

Proposed fix
-    import subprocess  # nosec
+    import subprocess
@@
-        result = subprocess.run(  # nosec B603 - cmd[0] is hardcoded "trtexec"
+        result = subprocess.run(
             cmd,
             stdout=subprocess.PIPE,
             stderr=subprocess.STDOUT,
             cwd=str(cwd) if cwd else None,
             timeout=3600,
         )

As per coding guidelines: "Any use of '# nosec' comments to bypass Bandit security checks is not allowed."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/_deploy/_runtime/tensorrt/engine_builder.py` around lines 64 -
70, Remove the `# nosec` suppressions around the subprocess invocation in
engine_builder.py and harden the call instead: keep using subprocess.run with a
list (no shell=True), remove the comments, validate/sanitize the `args` list
before building `cmd = ["trtexec", *args]` (reject or escape unsafe characters
and ensure each entry is a string), verify the executable exists (e.g., use
shutil.which("trtexec") and raise a clear error), and wrap subprocess.run in
try/except to log and rethrow the caught exception from the subprocess.run
invocation so no security bypass comment is needed; update references to
subprocess.run, cmd, and args accordingly.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 83.84615% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.69%. Comparing base (cc06062) to head (5b16509).

Files with missing lines Patch % Lines
modelopt/onnx/quantization/qdq_utils.py 86.12% 29 Missing ⚠️
.../torch/_deploy/_runtime/tensorrt/engine_builder.py 66.66% 6 Missing ⚠️
modelopt/torch/quantization/model_calib.py 37.50% 5 Missing ⚠️
modelopt/onnx/quantization/ort_utils.py 85.71% 1 Missing ⚠️
modelopt/torch/opt/plugins/transformers.py 85.71% 1 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                 @@
##           release/0.44.0    #1426      +/-   ##
==================================================
- Coverage           77.19%   76.69%   -0.51%     
==================================================
  Files                 467      466       -1     
  Lines               50581    50778     +197     
==================================================
- Hits                39048    38944     -104     
- Misses              11533    11834     +301     
Flag Coverage Δ
examples 41.27% <10.76%> (-0.32%) ⬇️
gpu 59.19% <10.38%> (-0.83%) ⬇️
regression 14.63% <1.65%> (-0.26%) ⬇️
unit 52.79% <75.76%> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kevalmorabia97 kevalmorabia97 removed the request for review from RalphMao May 11, 2026 13:19
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants