Skip to content

add example script for depth importance estimation#1016

Open
chochowski wants to merge 8 commits into
mainfrom
depth_importance_example
Open

add example script for depth importance estimation#1016
chochowski wants to merge 8 commits into
mainfrom
depth_importance_example

Conversation

@chochowski
Copy link
Copy Markdown
Contributor

@chochowski chochowski commented Mar 10, 2026

What does this PR do?

Type of change: ? new-feature

The script goes through the GPTModel transformer patching each block as no-op one-by-one estimating how much removing this block affects the final output representation.

Usage

PYTHONPATH=`pwd`:$PYTHONPATH python  -m torch.distributed.run --nproc_per_node=2 examples/pruning/rank_layer_importance.py --hf_model_name_or_path /.../nvidia/NVIDIA-Nemotron-Nano-12B-v2 --trust_remote_code --calib_dataset_name wikitext  --num_layers_in_first_pipeline_stage 31 --num_layers_in_last_pipeline_stage 31 --num_layers 62 --drop_layers 3 4 5 6 7 8 9 10 11

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features
    • Added example script for layer-wise importance estimation in Megatron models, enabling analysis and ranking of critical layers using calibration data.

@chochowski chochowski requested a review from a team as a code owner March 10, 2026 15:17
@chochowski chochowski requested a review from jenchen13 March 10, 2026 15:17
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

A new script rank_layer_importance.py is added to implement layer-wise importance estimation for Megatron-based models. The implementation includes forward-pass patching utilities to disable layer computations, loss functions for normalized MSE calculation, hooks to collect hidden-state statistics, and an end-to-end calibration pipeline that patches layers iteratively, runs calibration passes, and ranks layers by importance metrics.

Changes

Cohort / File(s) Summary
Layer Importance Estimation Module
examples/megatron_bridge/rank_layer_importance.py
Introduces LastHiddenImportanceHook class for collecting per-sample hidden-state statistics and computing normalized MSE distances. Adds noop forward patches for MLP, attention, Mamba, transformer, and GPT blocks. Includes loss utilities, gate setup, calibration orchestration, and end-to-end importance estimation pipeline with distributed rank gathering and score persistence.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Model as Megatron Model
    participant Hook as LastHiddenImportanceHook
    participant Calibration as Calibration Loop
    participant Metrics as Metrics Collector

    User->>Model: setup_gates(unwrapped_model)
    Model->>Hook: Attach to final LayerNorm/RMSNorm layers
    
    User->>Calibration: estimate_layer_importance(args)
    Calibration->>Model: load_reference(calibration_data)
    Hook->>Metrics: load_reference_hidden_states()
    
    loop For each layer in model
        Calibration->>Model: patch_model(layer_id)
        Note over Model: Disable layer computation via noop
        
        Calibration->>Calibration: forward_loop(data_loader)
        Calibration->>Model: Forward pass with patched layer
        Model->>Hook: hook_fn collects hidden states
        Hook->>Metrics: Compute normalized MSE vs reference
        
        Calibration->>Model: unpatch_model(layer_id)
        Note over Model: Restore original forward
    end
    
    Calibration->>Metrics: gather_across_dp(rankings)
    Note over Metrics: Aggregate metrics across DP ranks
    
    Calibration->>Metrics: collect_scores(aggregation)
    Metrics->>User: Output ranked layer importance scores
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Security Anti-Patterns ✅ Passed No security anti-patterns found: no torch.load(), numpy.load(), eval(), exec(), # nosec comments, or hardcoded trust_remote_code=True in the new file.
Title check ✅ Passed The title clearly and concisely summarizes the main change—adding an example script for depth importance estimation—which matches the primary purpose of the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch depth_importance_example

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/pruning/depth_ranking.py`:
- Around line 303-320: In collect_scores(), avoid hardcoding a length-10 CUDA
tensor and metric assumptions: use the provided use_metric argument when reading
stats, infer device and per-sample vector lengths dynamically, and replace
torch.zeros((10,)).cuda() with a padding strategy (e.g., pad_sequence or
creating zeros using the same device and shape as stat) so torch.stack(res)
won't fail for different DP/batch sizes; ensure aggregation (median/mean)
computes over the correct dim and that drop_group/drop_blocks are applied to the
sorted_indices from the selected metric; apply the same fixes in the related
block around the other occurrence (lines ~510-514) so both places respect
variable sample counts, device, and the caller's metric/drop settings.
- Line 1: Add the standard NVIDIA Apache 2.0 license header to the top of
depth_ranking.py (replace the solitary copyright line), using the full Apache
2.0 block required by the repository policy so the file begins with the complete
license header comment instead of just the copyright notice.
- Around line 16-17: The example imports test-only utilities
(get_mcore_gpt_model, set_seed and the inference helpers) from _test_utils which
couples the example to the test tree; replace those imports with supported
runtime equivalents or relocate the helper implementations into the examples
package. Concretely, remove references to
_test_utils.torch.megatron.models.get_mcore_gpt_model and
_test_utils.torch.misc.set_seed and either import model-building and seed
utilities from the public API (or a stable examples.helpers module), and copy
any inference helper functions used on lines ~93-96 into
examples/pruning/helpers.py and import them from there so the example no longer
depends on tests.
- Around line 78-80: Wrap the transformer_engine imports in a try/except and set
a flag (e.g., HAS_TE) so importing the example won't fail when
transformer_engine is not installed, and import/alias torch.nn.LayerNorm as the
fallback norm; then update setup_gates() to detect and handle both TE norm
classes (RMSNorm/LayerNorm from transformer_engine when HAS_TE) and standard
PyTorch norms (torch.nn.LayerNorm and any other supported norm types) by
creating the same logits_gate_list entries for either case; additionally, add a
defensive check before using model.logits_gate_list[0] to raise a clear error or
construct a default gate when logits_gate_list is empty, and optionally ensure
main() pins transformer_impl to the TE backend when intended (or document that
transformer_engine must be installed).
- Around line 353-384: The layer-to-rank mapping is wrong because
layer_id_in_this_rank() assumes equal split via num_layers_per_pp and offset;
update it to use Megatron's true pipeline offsets (e.g., call the utility used
in the distill plugin like TransformerLayer._get_layer_offset() or compute
start/end using config.num_layers_in_first_pipeline_stage and
config.num_layers_in_last_pipeline_stage together with
get_pipeline_model_parallel_rank()/get_pipeline_model_parallel_world_size()) so
that global layer IDs map to local indices correctly; adjust
layer_id_in_this_rank() to return the local index or -1 based on that computed
start/end range, and ensure patch_model/unpatch_model still use the returned
local index consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f0060447-cc9c-4f31-b6c7-4a5eebc9255f

📥 Commits

Reviewing files that changed from the base of the PR and between cbab377 and 7a29122.

📒 Files selected for processing (1)
  • examples/pruning/depth_ranking.py

Comment thread examples/pruning/depth_ranking.py Outdated
Comment thread examples/pruning/depth_ranking.py Outdated
Comment thread examples/pruning/depth_ranking.py Outdated
Comment thread examples/pruning/depth_ranking.py Outdated
Comment thread examples/pruning/depth_ranking.py Outdated
@chochowski chochowski changed the title [DRAFT] add example script for depth importance estimation add example script for depth importance estimation Mar 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/pruning/rank_layer_importance.py`:
- Around line 245-248: The topology flags (--num_layers,
--num_layers_in_first_pipeline_stage, --num_layers_in_last_pipeline_stage) are
treated as required but are optional and the current layer_id_in_this_rank()
math assumes uniform per-rank spans causing incorrect global IDs; make these
args required via parser.add_argument(..., required=True) or add validation
after parsing to compute per-rank layer spans from the loaded model (e.g.,
derive total num_layers from model.num_layers or model.config and compute
per-rank start/end counts using world_rank/world_size and the provided
first/last stage sizes), then replace the existing ternary/offset logic with
computed per-rank start_index/end_index used by layer_id_in_this_rank() and the
loops over range(args.num_layers) so that patches target the correct global
layer IDs.
- Around line 166-183: The hook_fn currently appends full GPU tensors to
self.reference_hidden in the reference-loading phase (self.reference_load),
which causes OOM; instead either (a) avoid caching full tensors by performing
the comparison immediately per calibration batch (compute
normalized_mse_loss_per_sample between hidden_out and the corresponding
reference and append results to self.hidden_distance/self.logits_distance), or
(b) if caching is required, move/slice the activation off device before storing
(e.g., hidden_out = hidden_out.detach().cpu() or store a reduced
summary/selected slices) and for logits use teacher_logits =
self.lm_head(...).detach().cpu() so memory stays on CPU/disk; update references
to self.reference_hidden, hook_fn, normalized_mse_loss_per_sample,
self.hidden_distance, self.logits_distance, and self.reference_load accordingly.
- Around line 81-82: Remove the unused module-level CUDA allocations kl_loss and
mse_loss (the torch.nn.KLDivLoss(...) and torch.nn.MSELoss(...) calls) which run
at import time and call .cuda(); delete these lines entirely since they are
never referenced and cause unwanted GPU initialization before dist.setup(); if
loss objects are needed later, instantiate them locally after distributed setup
(e.g., inside main or training functions) and avoid calling .cuda() at import
time.

In `@examples/pruning/README.md`:
- Around line 233-240: Update the two command fenced blocks around the usage of
rank_layer_importance.py to include a blank line before and after each fence and
add the language tag "bash" after the opening ```; also correct the typo
"interative" to "iterative" in the explanatory sentence that references
iterative estimation (search for rank_layer_importance.py and the paragraph
mentioning interative/iterative).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a5d03e5d-3a39-4448-a24c-be180264fe5e

📥 Commits

Reviewing files that changed from the base of the PR and between 7a29122 and c92db19.

📒 Files selected for processing (2)
  • examples/pruning/README.md
  • examples/pruning/rank_layer_importance.py

Comment on lines +81 to +82
kl_loss = torch.nn.KLDivLoss(reduction='batchmean', log_target=True).cuda()
mse_loss = torch.nn.MSELoss(reduce=True, reduction='mean').cuda()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Module-scope CUDA allocations:"
rg -n '^\s*[A-Za-z_][A-Za-z0-9_]*\s*=\s*torch\.nn\..*\.cuda\(\)' examples/pruning/rank_layer_importance.py

echo
echo "Distributed/device setup points in the same file:"
rg -n -C2 'dist\.setup\(|torch\.cuda\.set_device|LOCAL_RANK|\.cuda\(' examples/pruning/rank_layer_importance.py

Repository: NVIDIA/Model-Optimizer

Length of output: 914


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Usage of kl_loss and mse_loss variables ==="
rg -n 'kl_loss|mse_loss' examples/pruning/rank_layer_importance.py

echo
echo "=== File structure around lines 81-82 to confirm module scope ==="
sed -n '75,95p' examples/pruning/rank_layer_importance.py | cat -n

echo
echo "=== Verify __main__ block and entry point ==="
sed -n '410,419p' examples/pruning/rank_layer_importance.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 2114


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Total lines in file ==="
wc -l examples/pruning/rank_layer_importance.py

echo
echo "=== Exact match for 'kl_loss' references (case-sensitive, whole word) ==="
rg -w 'kl_loss' examples/pruning/rank_layer_importance.py

echo
echo "=== Exact match for 'mse_loss' references (case-sensitive, whole word) ==="
rg -w 'mse_loss' examples/pruning/rank_layer_importance.py

Repository: NVIDIA/Model-Optimizer

Length of output: 581


Remove unused module-scope CUDA allocations.

Lines 81–82 allocate kl_loss and mse_loss at module import time, before dist.setup() runs. This causes every worker to touch the default GPU during import, making the script fragile under torch.distributed.run. Additionally, these variables are never referenced anywhere in the file—the code uses F.mse_loss() instead—so they should be removed entirely rather than moved.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/rank_layer_importance.py` around lines 81 - 82, Remove the
unused module-level CUDA allocations kl_loss and mse_loss (the
torch.nn.KLDivLoss(...) and torch.nn.MSELoss(...) calls) which run at import
time and call .cuda(); delete these lines entirely since they are never
referenced and cause unwanted GPU initialization before dist.setup(); if loss
objects are needed later, instantiate them locally after distributed setup
(e.g., inside main or training functions) and avoid calling .cuda() at import
time.

Comment on lines +166 to +183
def hook_fn(self, module, input, output):
# seq x batch x dim
hidden_out = output.detach().permute(1, 0, 2) # batch x seq x dim

# if loading the reference form teacher
if self.reference_load:
self.reference_hidden.append(hidden_out)
return

# if computing the distance to the reference
sample_id = len(self.hidden_distance)
#MSE
self.hidden_distance.append( normalized_mse_loss_per_sample(hidden_out, self.reference_hidden[sample_id]).mean() )
# if computing the distance to the teacher's logits
if self.lm_head:
teacher_logits = self.lm_head(self.reference_hidden[sample_id].permute(1, 0, 2))[0].detach()
logits = self.lm_head(hidden_out.permute(1, 0, 2))[0].detach()
self.logits_distance.append( normalized_mse_loss_per_sample(logits, teacher_logits).mean() )
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Caching every reference hidden state on-device will OOM quickly.

self.reference_hidden.append(hidden_out) keeps the full final hidden tensor for every calibration batch. With the defaults in this file (calib_num_samples=1024, seq_length=4096) and the 12B example documented in the README, that is on the order of tens of GB on the last PP rank before the ranking loop even starts. Please stream the comparison batch-by-batch, or at least slice/offload the reference activations before caching them.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/rank_layer_importance.py` around lines 166 - 183, The
hook_fn currently appends full GPU tensors to self.reference_hidden in the
reference-loading phase (self.reference_load), which causes OOM; instead either
(a) avoid caching full tensors by performing the comparison immediately per
calibration batch (compute normalized_mse_loss_per_sample between hidden_out and
the corresponding reference and append results to
self.hidden_distance/self.logits_distance), or (b) if caching is required,
move/slice the activation off device before storing (e.g., hidden_out =
hidden_out.detach().cpu() or store a reduced summary/selected slices) and for
logits use teacher_logits = self.lm_head(...).detach().cpu() so memory stays on
CPU/disk; update references to self.reference_hidden, hook_fn,
normalized_mse_loss_per_sample, self.hidden_distance, self.logits_distance, and
self.reference_load accordingly.

Comment thread examples/megatron_bridge/rank_layer_importance.py
Comment thread examples/pruning/README.md Outdated
Comment on lines +233 to +240
- To estimate importance of each layer one can run `rank_layer_importance.py` script. This script computes importance of each layer by comparing the MSE between the final hidden representation with and without that layer.

```
python -m torch.distributed.run --nproc_per_node=2 /path/to/modelopt/examples/pruning/rank_layer_importance.py --hf_model_name_or_path /path/to/hf-checkpoint/nvidia/NVIDIA-Nemotron-Nano-12B-v2 --trust_remote_code --calib_dataset_name wikitext --num_layers_in_first_pipeline_stage 31 --num_layers_in_last_pipeline_stage 31 --num_layers 62
```
- One can also pass indices of layers that should be dropped always. This allows running an interative estimation e.g. in first iteration score all layers, pick 5 least important, and in the next iteration pass these 5 layers to be dropped, so that it ranks the rest of the layers assuming these 5 are dropped.
```
python -m torch.distributed.run --nproc_per_node=2 /path/to/modelopt/examples/pruning/rank_layer_importance.py --hf_model_name_or_path /path/to/hf-checkpoint/nvidia/NVIDIA-Nemotron-Nano-12B-v2 --trust_remote_code --calib_dataset_name wikitext --num_layers_in_first_pipeline_stage 31 --num_layers_in_last_pipeline_stage 31 --num_layers 62 --drop_layers 6 7 9 32 41
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix the new command fences and typo.

Both command blocks need a bash language and blank lines around the fences, and Line 238 still says interative.

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 235-235: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


[warning] 237-237: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 239-239: Fenced code blocks should be surrounded by blank lines

(MD031, blanks-around-fences)


[warning] 239-239: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/README.md` around lines 233 - 240, Update the two command
fenced blocks around the usage of rank_layer_importance.py to include a blank
line before and after each fence and add the language tag "bash" after the
opening ```; also correct the typo "interative" to "iterative" in the
explanatory sentence that references iterative estimation (search for
rank_layer_importance.py and the paragraph mentioning interative/iterative).

@@ -0,0 +1,419 @@
# Adapted from https://github.com/EleutherAI/lm-evaluation-harness/tree/aa457edc3d64d81530159cd3a182932320c78f8c

# MIT License
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the NVIDIA Apache license and change the year to 2026

from megatron.core.packed_seq_params import PackedSeqParams
from megatron.core.rerun_state_machine import RerunMode, get_rerun_state_machine

from megatron.bridge.models.mamba.mamba_provider import MambaModelProvider
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this example belong in the Megatron-Bridge repo? @kevalmorabia97
we probably want to keep MBridge examples in the MBridge repo, to reduce circular dependencies between the two repos

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move to ModelOpt/examples/megatron_bridge where we have pruning and distillation for M-Bridge already

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@chochowski commits seem to have been messed up. PTAL

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py (1)

123-149: ⚠️ Potential issue | 🟡 Minor

Good refactoring to single-pass calibration.

The consolidation from multiple forward passes to a single pass with all thresholds is a solid performance improvement. However, there's a potential silent data loss if sparsity_list length doesn't match self.threshold_trials:

Line 140: zip(self.threshold_trials, sparsity_list) will silently truncate to the shorter sequence. While the assertion at lines 320-322 ensures all modules return consistent lengths, it doesn't verify this matches len(self.threshold_trials).

🛡️ Proposed fix to add length verification
         for sample_stat in per_sample_stats:
             length = sample_stat["sample_length"]
             sparsity_list = sample_stat["sparsity"]
+            if len(sparsity_list) != len(self.threshold_trials):
+                raise ValueError(
+                    f"Sparsity list length ({len(sparsity_list)}) doesn't match "
+                    f"threshold_trials length ({len(self.threshold_trials)})"
+                )
             for threshold, sparsity in zip(self.threshold_trials, sparsity_list):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py` around
lines 123 - 149, The loop building all_data_points can silently drop thresholds
because zip(self.threshold_trials, sparsity_list) truncates to the shorter
sequence; add an explicit length check right after per_sample_stats is obtained
(after calling self._extract_calibration_stats) to assert or raise if any
sample_stat["sparsity"] length != len(self.threshold_trials), including clear
context in the error message (e.g., referencing the sample's "sample_length" or
module id) so the mismatch is detected immediately and not silently ignored when
populating all_data_points.
modelopt/torch/sparsity/attention_sparsity/model_sparsify.py (1)

82-90: ⚠️ Potential issue | 🟡 Minor

Update the calibration example to the new target_sparse_ratio shape.

This snippet still shows a scalar target_sparse_ratio, but the current calibration config expects a per-phase dict. Copying the example is inconsistent with the new schema.

Suggested fix
                         "*attention*": {
                             "method": "flash_skip_softmax",
                             "backend": "pytorch",
                             "enable": True,
                             "calibration": {  # Enables automatic threshold calibration
-                                "target_sparse_ratio": 0.5,
+                                "target_sparse_ratio": {"prefill": 0.5, "decode": 0.5},
                                 "samples": 48,
                                 "max_seqlen": 8192,
                             },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/sparsity/attention_sparsity/model_sparsify.py` around lines 82
- 90, The example calibration block for
config["sparse_cfg"]["*attention*"]["calibration"] uses a scalar
target_sparse_ratio but the schema now requires a per-phase dict; update the
example to replace "target_sparse_ratio": 0.5 with a mapping of phases to ratios
(e.g. "target_sparse_ratio": {"train": 0.5, "eval": 0.5}) so that the config key
"calibration" matches the new per-phase shape expected by the calibration logic.
modelopt/torch/sparsity/attention_sparsity/config.py (1)

125-159: ⚠️ Potential issue | 🟡 Minor

Reject empty thresholds maps.

{} currently passes validation even though the field contract says at least one of prefill/decode must be present. That turns a typoed config into a silent runtime fallback instead of a validation error.

Suggested fix
         valid_phases = {"prefill", "decode"}
+        if not v:
+            raise ValueError(
+                "thresholds must include at least one phase: 'prefill' and/or 'decode'"
+            )
         invalid_keys = set(v.keys()) - valid_phases
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/sparsity/attention_sparsity/config.py` around lines 125 - 159,
The validate_thresholds field validator allows an empty dict for thresholds, so
add a check in validate_thresholds to reject an empty mapping and raise a
ValueError when no phase keys are present; specifically, in the
validate_thresholds(cls, v) function (the validator for the thresholds field)
verify that v is non-empty and contains at least one of the valid phases
('prefill' or 'decode') and raise a clear ValueError if v is {} or otherwise
missing both keys before proceeding with the existing per-phase list and length
checks.
examples/llm_ptq/hf_ptq.py (1)

361-373: ⚠️ Potential issue | 🟠 Major

--low_memory_mode still initializes from --qformat, not --recipe.

This branch always feeds QUANT_CFG_CHOICES[args.qformat] into init_quantized_weights(). Later, calibration-only mode only calls mtq.calibrate() and never reapplies recipe.ptq_cfg, so a low-memory recipe run can end up with the wrong quantizer layout and algorithm.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 361 - 373, The code is always
seeding quant_cfg from QUANT_CFG_CHOICES[args.qformat], which breaks low-memory
runs that should use the recipe PTQ layout; change the initialization so
quant_cfg is derived from the recipe (recipe.ptq_cfg) when a recipe is provided
(or when --low_memory_mode is active) instead of args.qformat, then apply the
same KV adjustments (mtq.utils.update_quant_cfg_with_kv_cache_quant and
_set_kv_cache_constant_amax) to that quant_cfg before calling
init_quantized_weights; ensure references to QUANT_CFG_CHOICES, args.qformat,
args.kv_cache_qformat, _KV_CAST_FORMATS, init_quantized_weights, recipe.ptq_cfg
and mtq.calibrate are updated so calibration-only flows use the recipe-derived
quant_cfg.
♻️ Duplicate comments (3)
examples/pruning/rank_layer_importance.py (3)

219-221: ⚠️ Potential issue | 🟠 Major

Make topology inputs required (or derive them) before layer iteration.

args.num_layers is optional but used in range(args.num_layers), and the offset/rank mapping assumes uniform stage spans. This can fail when args are omitted and can patch incorrect global layers in uneven PP layouts.

Also applies to: 303-307, 334-342, 381-381

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/rank_layer_importance.py` around lines 219 - 221, The code
calls range(args.num_layers) and computes an offset/rank mapping assuming
uniform pipeline-stage spans while args.num_layers and stage-size args
(args.num_layers_in_first_pipeline_stage,
args.num_layers_in_last_pipeline_stage) are optional; make these topology inputs
required or derive them before any layer iteration: ensure args.num_layers is
populated by either making parser.add_argument(..., required=True) or computing
it from the model/config (or by summing stage sizes if provided), and
compute/validate args.num_layers_in_first_pipeline_stage and
args.num_layers_in_last_pipeline_stage (or derive them from pipeline layout or
by dividing layers across stages with explicit checks) before any use in
range(args.num_layers) or the offset/rank mapping logic so uneven PP layouts are
handled correctly and raise a clear error if topology cannot be inferred.

152-153: ⚠️ Potential issue | 🔴 Critical

Avoid retaining full reference activations on GPU for all samples.

Line 153 stores full hidden states for every calibration step in self.reference_hidden, which can OOM with long sequences and large sample counts. Stream comparisons or offload cached references to CPU.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/rank_layer_importance.py` around lines 152 - 153, The code
currently appends full GPU-resident hidden states (hidden_out) to
self.reference_hidden in the calibration path (self.reference_load), which can
cause OOM; instead, convert and detach the tensor before storing off-GPU (e.g.,
hidden_out.detach().cpu() or .to('cpu')) or avoid storing all samples by
performing streaming comparisons immediately and only keeping aggregated
statistics; update the logic around self.reference_load/self.reference_hidden in
the calibration routine (where hidden_out is produced) to either offload tensors
to CPU/memory-mapped storage or replace append with an online comparison step so
GPU memory usage stays bounded.

44-45: ⚠️ Potential issue | 🟠 Major

Remove import-time CUDA allocations.

Line 44 and Line 45 allocate CUDA modules at import time and these variables are unused. This can touch the wrong device before distributed setup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/pruning/rank_layer_importance.py` around lines 44 - 45, The two
module instances kl_loss and mse_loss are being created and moved to CUDA at
import time (symbols kl_loss and mse_loss), which touches the wrong device and
they are unused; remove these import-time allocations by deleting these lines or
by instantiating the losses where they are actually used (e.g., inside the
training/evaluation function) and without calling .cuda() at module import; also
fix the MSELoss creation to use the correct signature (remove the
deprecated/redundant reduce=True and use reduction="mean") when constructing it
at usage time.
🟠 Major comments (18)
examples/diffusers/quantization/utils.py-84-86 (1)

84-86: ⚠️ Potential issue | 🟠 Major

proj_out should match anywhere in the name, not just at the start.

The current pattern proj_out.* only matches when the name starts with proj_out because re.match() anchors to the string beginning. This is inconsistent with the other terms, which use .*(...).* to match anywhere. Nested module names like block.proj_out.weight won't match. Use the same "anywhere" style as the other terms for consistency.

Suggested regex fix
-    pattern = re.compile(
-        r"(proj_out.*|.*(time_text_embed|context_embedder|x_embedder|norm_out|time_guidance_embed|stream_modulation).*)"
-    )
+    pattern = re.compile(
+        r".*(proj_out|time_text_embed|context_embedder|x_embedder|norm_out|time_guidance_embed|stream_modulation).*"
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/diffusers/quantization/utils.py` around lines 84 - 86, The regex
assigned to the variable pattern currently uses "proj_out.*", which only matches
at the start when used with re.match; change the pattern so "proj_out" is
matched anywhere in the name like the other terms (for example by using
".*proj_out.*" or including proj_out inside the inner .*(...).* alternation),
updating the compiled pattern expression so nested names like
"block.proj_out.weight" are matched by pattern.
modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py-59-71 (1)

59-71: ⚠️ Potential issue | 🟠 Major

Avoid phase-order dependent threshold reuse.

_update_thresholds() falls back to self.thresholds, so a run that only configures one phase can reuse the last phase’s list for the next phase. After a decode call, the next prefill can silently inherit decode thresholds.

Suggested fix
-        self.thresholds = self.thresholds_config.get("prefill", [1e-3])
+        self.thresholds = list(
+            self.thresholds_config.get("prefill")
+            or self.thresholds_config.get("decode")
+            or [1e-3]
+        )
@@
     def _update_thresholds(self, phase: str):
         """Update thresholds list based on phase."""
-        self.thresholds = self.thresholds_config.get(phase, self.thresholds)
+        self.thresholds = list(
+            self.thresholds_config.get(phase)
+            or self.thresholds_config.get("prefill")
+            or self.thresholds_config.get("decode")
+            or [1e-3]
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py`
around lines 59 - 71, _update_thresholds currently falls back to the existing
self.thresholds which allows thresholds from a previous phase to leak into a new
phase; change _update_thresholds to read from self.thresholds_config and fall
back to a deterministic default (e.g., self.thresholds_config.get("prefill",
[1e-3])) instead of self.thresholds, and ensure you assign a shallow copy to
self.thresholds to avoid shared-list mutation; update the method named
_update_thresholds (and any callers relying on thresholds) so phase transitions
always load explicit config for the requested phase rather than reusing the
last-set list.
modelopt/torch/sparsity/attention_sparsity/stats_manager.py-69-74 (1)

69-74: ⚠️ Potential issue | 🟠 Major

Guard sparse_blocks aggregation against threshold-count changes.

This merge path assumes every forward reports the same list length. That breaks once the same manager sees calibration’s single-threshold stats and static multi-threshold stats: shorter lists leave stale totals, and a longer list will raise on Line 74.

Suggested fix
         incoming = stats["sparse_blocks"]
-        if "sparse_blocks" not in self.aggregated_stats:
-            self.aggregated_stats["sparse_blocks"] = list(incoming)
-        else:
-            for i, val in enumerate(incoming):
-                self.aggregated_stats["sparse_blocks"][i] += val
+        aggregated = self.aggregated_stats.get("sparse_blocks")
+        if aggregated is None:
+            self.aggregated_stats["sparse_blocks"] = list(incoming)
+        elif len(aggregated) != len(incoming):
+            raise ValueError(
+                f"Inconsistent sparse_blocks length: expected {len(aggregated)}, got {len(incoming)}"
+            )
+        else:
+            for i, val in enumerate(incoming):
+                aggregated[i] += val
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/sparsity/attention_sparsity/stats_manager.py` around lines 69
- 74, The aggregation assumes aggregated_stats["sparse_blocks"] and the incoming
list have the same length; update the logic in the method that handles incoming
= stats["sparse_blocks"] so it first compares lengths and handles mismatches: if
aggregated_stats has no "sparse_blocks" key or lengths differ, reinitialize
aggregated_stats["sparse_blocks"] to a copy of incoming (or to a zero-filled
list matching the longer length) to avoid stale totals or IndexError when
enumerating; otherwise, safely sum element-wise by iterating to
min(len(incoming), len(existing)) and copy any remaining tail from the longer
list into aggregated_stats["sparse_blocks"] so both shorter and longer incoming
lists are handled without errors.
experimental/conv/test_implicit_gemm.py-25-49 (1)

25-49: ⚠️ Potential issue | 🟠 Major

Skip on unsupported SMs, not just when CUDA exists.

The runtime in experimental/conv/implicit_gemm_cuda.py enforces SM80+ via _MIN_SM_MAJOR = 8 and raises RuntimeError on older GPUs. The test skip marker currently only checks torch.cuda.is_available(), so on SM70 hardware with CUDA installed, tests will error during execution instead of skipping. The Triton FP4 marker has the same problem: it only checks attribute availability, not SM8.9+, despite its own reason string documenting that requirement.

Suggested fix
+def _sm_at_least(major: int, minor: int = 0) -> bool:
+    if not torch.cuda.is_available():
+        return False
+    return torch.cuda.get_device_capability() >= (major, minor)
+
+
-pytestmark = pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
+pytestmark = pytest.mark.skipif(not _sm_at_least(8, 0), reason="Requires CUDA SM80+")

 requires_triton_fp4 = pytest.mark.skipif(
-    not _triton_fp4_available(),
+    not (_sm_at_least(8, 9) and _triton_fp4_available()),
     reason="Triton fp4_fake_quant_block not available (requires compute >= 8.9)",
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experimental/conv/test_implicit_gemm.py` around lines 25 - 49, The skip logic
must check GPU SM version, not just CUDA presence: update the module-level
pytestmark to skip when CUDA is unavailable OR the current device compute
capability is below major 8 (the runtime constant _MIN_SM_MAJOR = 8 enforced by
experimental.conv.implicit_gemm_cuda), and update
_triton_fp4_available()/requires_triton_fp4 to additionally verify device
compute capability >= 8.9 (major >= 8 and minor >= 9) before claiming Triton fp4
support; use torch.cuda.is_available() and
torch.cuda.get_device_properties(0).major/minor (with safe guards if no device)
to implement these checks and keep the existing import checks
(conv3d_implicit_gemm_cuda and triton kernel attribute) intact.
experimental/conv/implicit_gemm_cuda.py-217-228 (1)

217-228: ⚠️ Potential issue | 🟠 Major

Replace assert with proper runtime validation for preconditions.

The divisibility assertion at line 224 disappears under python -O, leaving the function vulnerable to runtime errors. The downstream CUDA kernel computes num_blocks = N / block_size with integer division, which will silently truncate work if N is not divisible by block_size, violating the documented precondition. Additionally, the function lacks validation for device placement, scalar global_amax, and positive block_size.

Replace the assertion with explicit checks for:

  • block_size > 0
  • x and global_amax are CUDA tensors
  • x and global_amax are on the same device
  • global_amax is a scalar (numel == 1)
  • x_f32.numel() % block_size == 0
Suggested fix
    cuda_mod = _get_cuda_module()

+   if block_size <= 0:
+       raise ValueError(f"block_size must be > 0, got {block_size}")
+   if not x.is_cuda or not global_amax.is_cuda:
+       raise ValueError("x and global_amax must be CUDA tensors")
+   if x.device != global_amax.device:
+       raise ValueError(
+           f"x and global_amax must be on the same device, got {x.device} and {global_amax.device}"
+       )
+   if global_amax.numel() != 1:
+       raise ValueError("global_amax must be a scalar tensor")
    orig_shape = x.shape
    orig_dtype = x.dtype
    x_f32 = x.float().contiguous()
    amax_f32 = global_amax.float().contiguous()

-   assert x_f32.numel() % block_size == 0, (
-       f"numel ({x_f32.numel()}) must be divisible by block_size ({block_size})"
-   )
+   if x_f32.numel() % block_size != 0:
+       raise ValueError(
+           f"numel ({x_f32.numel()}) must be divisible by block_size ({block_size})"
+       )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experimental/conv/implicit_gemm_cuda.py` around lines 217 - 228, Replace the
fragile assert with explicit runtime validation before calling
cuda_mod.fp4_fake_quant_cuda: check that block_size is an int > 0 and raise
ValueError if not; ensure x.is_cuda and global_amax.is_cuda and raise
TypeError/ValueError if either is not CUDA; ensure x.device ==
global_amax.device and raise ValueError if they differ; ensure
global_amax.numel() == 1 and raise ValueError if not; compute N =
x.float().contiguous().numel() (use x_f32.numel()) and verify N % block_size ==
0, raising ValueError with a clear message if it fails; keep the rest (x_f32,
amax_f32, cuda_mod.fp4_fake_quant_cuda) unchanged.
examples/llm_sparsity/attention_sparsity/hf_sa.py-150-155 (1)

150-155: ⚠️ Potential issue | 🟠 Major

Move inputs to the model's entry device, not the default GPU.

With device_map="auto", the model may place different shards on different devices (including CPU for offloading). Hardcoding inputs.cuda() at Line 118 assumes the first shard (input embedding) is on the default GPU, which may not hold. This causes errors or unnecessary device transfers for offloaded or multi-GPU models.

Instead, move inputs to the model's entry device:

dev = model.get_input_embeddings().weight.device
inputs = {k: v.to(dev) for k, v in inputs.items()}

This ensures inputs match wherever Accelerate's auto-dispatch places the first model layer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_sparsity/attention_sparsity/hf_sa.py` around lines 150 - 155,
The code currently assumes inputs are on the default GPU by calling
inputs.cuda(); instead, after loading the model with
AutoModelForCausalLM.from_pretrained(..., device_map="auto"), determine the
model's entry device via model.get_input_embeddings().weight.device and move the
inputs there (e.g., set inputs = {k: v.to(dev) for k,v in inputs.items()}) so
input tensors land on the same device/shard as the model's input embedding and
avoid incorrect hard-coded GPU transfers.
tests/gpu/torch/quantization/test_sequential_calibrate.py-325-331 (1)

325-331: ⚠️ Potential issue | 🟠 Major

This assertion doesn't prove run-mode replay reached layer 1.

actual = model.layers[0](x) only shows that weight_doubling_calib() changed layer 0. The test still passes if the next calibration step never replays those updated activations into layer 1. Record the inputs observed by model.layers[1] during its calibration step and assert those against activations_before_weight_update * 2.0.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_sequential_calibrate.py` around lines 325 -
331, The test currently only checks that weight_doubling_calib updated
model.layers[0] by running actual = model.layers[0](x) but doesn't verify that
the replay used those updated outputs as inputs to model.layers[1]; modify the
test to capture the inputs observed by model.layers[1] during its calibration
step (e.g., attach a short forward hook or wrapper on model.layers[1] to record
the incoming tensor when the calibration for layer 1 runs) and then assert that
the recorded inputs equal activations_before_weight_update * 2.0, keeping the
existing check that layer 0's weights were updated (model.layers[0](x)) and
using the recorded_inputs variable in the new assertion.
modelopt/torch/_deploy/utils/torch_onnx.py-595-624 (1)

595-624: ⚠️ Potential issue | 🟠 Major

Create model_metadata from the final ONNX graph.

model_metadata is built a few lines above this block, before quantize_weights(), FP16 conversion, cast retargeting, and remove_redundant_casts(). That leaves onnx_node_names out of sync with the model bytes returned from this function. Please move create_model_metadata() below the last graph mutation.

📦 Suggested move
-    model_metadata = create_model_metadata(
-        tree_spec_input, tree_spec_output, input_none_names, onnx_opt_graph, model
-    )
-
     onnx_opt_graph = quantize_weights(model, onnx_opt_graph)
@@
     # TODO: Remove manual ir_version change once ORT supports ir_version 11
     # Must be set after all gs.export_onnx() calls as graphsurgeon resets ir_version
     onnx_opt_graph.ir_version = 10
+
+    model_metadata = create_model_metadata(
+        tree_spec_input, tree_spec_output, input_none_names, onnx_opt_graph, model
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/_deploy/utils/torch_onnx.py` around lines 595 - 624, The
model_metadata is being created before the final ONNX graph mutations, so
onnx_node_names become out-of-sync with the returned model bytes; move the
create_model_metadata() call (and any usage that builds model_metadata /
onnx_node_names) to after all graph transformations such as quantize_weights(),
qdq_to_dq(), convert_float_to_float16()/convert_to_f16(),
change_casts_to_fp16(), remove_redundant_casts(),
replace_zero_scale_with_smallest_nonzero(), and the final ir_version assignment
on onnx_opt_graph so model_metadata is generated from the final onnx_opt_graph.
modelopt/onnx/utils.py-1374-1388 (1)

1374-1388: ⚠️ Potential issue | 🟠 Major

Handle graph-input → graph-output casts before bypassing them.

If input_tensor is a graph input, get_producer_nodes() is empty and this branch never reconnects output_tensor. The subsequent node removal leaves a disconnected graph output; keep the cast in that case or replace it with an Identity.

examples/llm_ptq/example_utils.py-617-623 (1)

617-623: ⚠️ Potential issue | 🟠 Major

Reuse the caller’s device_map and model kwargs in the pack-quant path.

This branch hardcodes device_map="auto" and rebuilds kwargs, so device="cpu", use_seq_device_map, max_memory, and attn_implementation are silently ignored for pack-quantized checkpoints.

Suggested fix
             with patch_compressed_linear_loading():
                 model = AutoModelForCausalLM.from_pretrained(
                     ckpt_path,
-                    device_map="auto",
-                    trust_remote_code=trust_remote_code,
-                    torch_dtype="auto",
+                    device_map=device_map,
+                    **model_kwargs,
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 617 - 623, The pack-quant
branch hardcodes device_map="auto" and rebuilds kwargs, causing caller-supplied
device_map and model kwargs (like device, use_seq_device_map, max_memory,
attn_implementation) to be ignored; update the block that calls
AutoModelForCausalLM.from_pretrained inside patch_compressed_linear_loading() to
merge and reuse the original caller kwargs (pass through the incoming device_map
and any model kwargs) instead of overwriting them with device_map="auto" or
rebuilding defaults, ensuring ckpt_path, trust_remote_code and other provided
kwargs are forwarded to AutoModelForCausalLM.from_pretrained so pack-quant
loading respects the caller’s settings.
modelopt/onnx/export/fp8_exporter.py-125-136 (1)

125-136: ⚠️ Potential issue | 🟠 Major

Derive the synthesized FP8 zero-point shape from the scale tensor.

The hardcoded [1] zero-point shape violates ONNX QuantizeLinear specification. Per ONNX spec, zero_point must exactly match the shape of y_scale: for per-tensor quantization both must be scalars, and for per-axis quantization both must be identical 1-D tensors. This hardcoded approach only works for scalar or length-1 scales and will produce invalid ONNX graphs for per-axis/per-channel FP8 quantization (e.g., when scale is a 1-D tensor of length N for N channels). Inspect node.inputs[1] (the scale tensor) and create the zero-point with matching shape.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/export/fp8_exporter.py` around lines 125 - 136, The zero-point
creation for FP8 uses a hardcoded 1-element shape which breaks per-axis
quantization; instead read the scale tensor at node.inputs[1] (the y_scale for
QuantizeLinear), copy its shape/dims into zp_tensor.dims so the zero_point
matches per-tensor vs per-axis shape, and create raw_data that encodes the
appropriate number of FP8 zeros for that shape/type. Update the block that
constructs zp_tensor/LazyValues/zero_point (referencing zp_tensor, LazyValues,
zero_point, and node.inputs[1]) to derive dims from the scale tensor and ensure
the raw_data byte length and element encoding match FLOAT8E4M3FN and the element
count implied by those dims.
modelopt/torch/quantization/model_calib.py-1882-1883 (1)

1882-1883: ⚠️ Potential issue | 🟠 Major

Guard CUDA cache clearing for CPU-only execution paths.

Line 1883 calls torch.cuda.empty_cache() unconditionally. This breaks CPU-only runs and tests. Other calls in this file are guarded with if torch.cuda.is_available(): (lines 408, 646, 682, 687); apply the same pattern here.

Proposed fix
             del layer_inputs
-            torch.cuda.empty_cache()
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1882 - 1883, The
unconditional call to torch.cuda.empty_cache() after deleting layer_inputs can
fail in CPU-only environments; update the block in model_calib.py (near where
layer_inputs is deleted) to call torch.cuda.empty_cache() only if
torch.cuda.is_available(), mirroring the existing guards used elsewhere in this
file (see analogous checks around lines with torch.cuda.is_available()); ensure
you wrap only the torch.cuda.empty_cache() call (leave del layer_inputs
unchanged) so CPU-only tests no longer hit CUDA APIs.
modelopt/torch/quantization/config.py-1049-1056 (1)

1049-1056: ⚠️ Potential issue | 🟠 Major

use_constant_amax does not suppress calibration yet.

This flag only changes _get_amax(). need_calibration() still treats these quantizers as ordinary static quantizers, so _cast KV modes still trigger a calibration loop in plain PTQ flows (for example, int8_wo + fp8_cast), which contradicts the documented “skip calibration” behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1049 - 1056, The
use_constant_amax flag only alters _get_amax() but does not prevent calibration;
update the need_calibration() logic to return False when self.use_constant_amax
is True for KV/cast modes so PTQ flows (e.g., int8_wo + fp8_cast) skip
calibration as documented: modify need_calibration() to check the
use_constant_amax attribute and the quantizer's _cast/kv mode (or equivalent
mode flag) and short-circuit calibration for those instances, ensuring existing
non-KV quantizers keep their original behavior.
modelopt/torch/quantization/config.py-1000-1008 (1)

1000-1008: ⚠️ Potential issue | 🟠 Major

rotate={"enable": True} is still rejected.

The new doc says this field accepts a RotateConfig-shaped dict, but validate_config() still recursively rejects any nested dict equal to {"enable": True} before Pydantic can coerce it. So rotate=True works, while the minimal equivalent dict form currently raises.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/config.py` around lines 1000 - 1008,
validate_config() is currently rejecting nested dicts like {"enable": True} for
the rotate field before Pydantic coercion; update validate_config (the function
named validate_config) to permit Mapping/dict values when the annotated/expected
type for that field is RotateConfig (or a Union that includes RotateConfig) so
Pydantic can coerce the dict into a RotateConfig instance instead of treating it
as an invalid nested config. Concretely, change the early-rejection logic to
detect the field's expected_type (for the rotate symbol or any field whose
expected_type is RotateConfig/BaseModel or a Union containing it) and skip the
recursive "reject nested dict" check for Mapping values, deferring
validation/coercion to Pydantic; keep existing rejection behavior for plain
dicts when the expected_type is not a model-like type.
examples/llm_ptq/hf_ptq.py-923-957 (1)

923-957: ⚠️ Potential issue | 🟠 Major

--recipe skips the KV-cache merge.

update_quant_cfg_with_kv_cache_quant() only runs in the non-recipe branch, so recipe-driven runs never add the *[kv]_bmm_quantizer entries. The later _set_kv_cache_constant_amax() call only tweaks existing KV settings, so --recipe ... --kv_cache_qformat fp8_cast silently produces no KV-cache quantization unless the recipe already hardcodes it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 923 - 957, When a recipe is supplied
(load_recipe -> ModelOptPTQRecipe -> quant_cfg = recipe.ptq_cfg) the code
currently skips applying KV-cache quant updates, so --recipe ...
--kv_cache_qformat ... does not add the kv_bmm_quantizer entries; to fix, after
setting quant_cfg from the recipe compute enable_quant_kv_cache the same way as
the non-recipe branch (enable_quant_kv_cache = args.kv_cache_qformat != "none"),
print the Enable/Disable message, and if enable_quant_kv_cache call
mtq.update_quant_cfg_with_kv_cache_quant(...) using getattr(mtq,
KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"] to merge KV settings
into the recipe-derived quant_cfg (same update used in the non-recipe branch) so
subsequent _set_kv_cache_constant_amax() can modify those entries.
modelopt/torch/quantization/utils/activation_collector.py-259-269 (1)

259-269: ⚠️ Potential issue | 🟠 Major

Rollback the previous layer’s replay state when forward_loop() fails.

_set_layer_states() moves layer i-1 from collected_inputs into cached_inputs and clears the original list before forward_loop() runs. If forward_loop() then raises, the except block only resets the current layer, so retrying the same layer can no longer rebuild prev.cached_inputs once that queue was consumed.

🔁 Suggested rollback shape
         info = layer._seq_calib
+        prev_info = self._decoder_layers[layer_idx - 1]._seq_calib if layer_idx > 0 else None
+        prev_state = None
+        if prev_info is not None:
+            prev_state = (
+                prev_info.mode,
+                deque(prev_info.cached_inputs),
+                list(prev_info.collected_inputs),
+            )
         try:
             forward_loop(self.model)
         except Exception:
             # Reset the current layer so subsequent calls don't see stale state.
             info.mode = "original"
             info.collected_inputs = []
+            if prev_info is not None and prev_state is not None:
+                prev_info.mode, prev_info.cached_inputs, prev_info.collected_inputs = prev_state
             raise

Also applies to: 314-320

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/activation_collector.py` around lines 259 -
269, The code moves the previous layer's collected_inputs into
prev.cached_inputs and clears collected_inputs before forward_loop() but doesn't
restore them if forward_loop() fails; update _set_layer_states()/the surrounding
try/except so that on any exception you rollback the previous layer's state by
restoring prev.mode, resetting prev.cached_inputs back to empty (or None) and
repopulating prev.collected_inputs from the saved list (or restoring the
original list) so retries can rebuild caches; apply the same rollback logic for
the other occurrence that manipulates self._decoder_layers[..]._seq_calib (the
block around the second occurrence noted at 314-320) and ensure you reference
prev.collected_inputs, prev.cached_inputs and prev.mode when implementing the
rollback.
modelopt/torch/quantization/plugins/huggingface.py-1024-1055 (1)

1024-1055: ⚠️ Potential issue | 🟠 Major

unpack_weight() leaves module half-unpacked for non-int32 weight_packed tensors.

The early return (line 1030) bypasses the entire cleanup logic when weight_packed is a non-int32 tensor (e.g., vision modules in BF16). Although forward() correctly treats such tensors as the effective weight, unpack_weight() must still register self.weight, delete compression attributes, and transition to FROZEN status to complete the unpacking operation.

Materialize non-int32 tensors as the decompressed weight and continue through cleanup, rather than returning early:

Suggested fix
         if self.quantization_status == QuantizationStatus.COMPRESSED:
             compressed_data, quant_args = self._build_compressed_data()

             # Skip non-pack-quantized weights (e.g., vision modules stored as BF16)
-            if isinstance(compressed_data["weight_packed"], torch.Tensor):
-                if compressed_data["weight_packed"].dtype != torch.int32:
-                    return
-
-            decompressed = self.compressor.decompress_weight(
-                compressed_data=compressed_data,
-                quantization_args=quant_args,
-            )
+            packed = compressed_data["weight_packed"]
+            if isinstance(packed, torch.Tensor) and packed.dtype != torch.int32:
+                decompressed = packed
+            else:
+                decompressed = self.compressor.decompress_weight(
+                    compressed_data=compressed_data,
+                    quantization_args=quant_args,
+                )
             # Clear any placeholder before registering the real parameter
             self._parameters.pop("weight", None)
             self._buffers.pop("weight", None)
             if "weight" in self.__dict__:
                 del self.__dict__["weight"]
@@
         if hasattr(self, "weight_scale"):
             del self.weight_scale
+        if hasattr(self, "weight_zero_point"):
+            del self.weight_zero_point
         if hasattr(self, "weight_shape"):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 1024 - 1055,
The unpack_weight() path currently returns early when
compressed_data["weight_packed"] exists but is not torch.int32, skipping cleanup
and registration; instead, replace the early return with logic that materializes
the non-int32 packed tensor as the decompressed weight (e.g., assign
decompressed = compressed_data["weight_packed"] and cast to a floating dtype if
needed or clone/detach it), then proceed to clear placeholders
(_parameters/_buffers and self.__dict__ entries), register the new
nn.Parameter(decompressed, requires_grad=False) as
self._parameters["weight"]/self.__dict__["weight"], delete compression
attributes (weight_packed, weight_scale, weight_shape) and finally set
self.quantization_status = QuantizationStatus.FROZEN so the module is fully
unpacked.
modelopt/torch/quantization/plugins/huggingface.py-914-953 (1)

914-953: ⚠️ Potential issue | 🟠 Major

Make the CompressedLinear monkey-patch re-entrant.

The boolean flag _modelopt_init_patched is insufficient for tracking nested contexts. When two callers overlap, the first to exit restores CompressedLinear.__getattr__ globally, leaving the second caller's context with a broken patch mid-execution. Replace the boolean with a depth counter protected by a lock, and only restore the original method when the last active context exits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 914 - 953,
The patch must become re-entrant: replace the boolean
CompressedLinear._modelopt_init_patched with an integer counter (e.g.
_modelopt_patch_count) and a threading.Lock (e.g. _modelopt_patch_lock), acquire
the lock at context entry to increment the counter and store original_getattr
only when counter goes from 0→1, and on context exit acquire the lock to
decrement the counter and restore/delete CompressedLinear.__getattr__ only when
the counter reaches 0; keep the same patched_getattr behavior and ensure all
accesses/updates to _modelopt_patch_count and original_getattr are protected by
the lock.
🟡 Minor comments (9)
modelopt/torch/export/unified_export_hf.py-785-788 (1)

785-788: ⚠️ Potential issue | 🟡 Minor

Warning text is misleading about weight_scale_2.

At Line 785, this branch is checking gate/up quantizer amax mismatches; weight_scale_2 is derived later during quantized weight export. The wording can confuse debugging.

✏️ Suggested wording fix
-            f"Found {synced} MoE expert gate/up projection pair(s) with mismatched "
-            f"weight_scale_2 after requantize_resmooth_fused_llm_layers. "
+            f"Found {synced} MoE expert gate/up projection pair(s) with mismatched "
+            f"weight quantizer amax values after requantize_resmooth_fused_llm_layers. "
             f"This typically means the dummy forward did not activate these experts. "
             f"Taking element-wise max of amaxes for serving-engine fusion."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_hf.py` around lines 785 - 788, The
warning message incorrectly mentions weight_scale_2 but this branch is detecting
mismatched gate/up quantizer amax values after calling
requantize_resmooth_fused_llm_layers; update the f-string that currently
references "weight_scale_2" to clearly state "gate/up quantizer amax" (or
similar wording) and explain that this typically means the dummy forward did not
activate these experts and that element-wise max of amaxes will be taken for
serving-engine fusion; modify the string where variable synced is used so the
message accurately reflects the checked quantity and avoids referencing
weight_scale_2 which is computed later.
modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py-259-262 (1)

259-262: ⚠️ Potential issue | 🟡 Minor

Don’t floor sparse-block totals before summarizing.

batch_size * num_heads makes s * total_blocks fractional in the common case. Truncating it here biases average_sparsity downward in SparseAttentionStatsManager.get_summary().

Suggested fix
-        sparse_blocks_out = [int(s * total_blocks) for s in sparsity_list]
+        sparse_blocks_out = [s * total_blocks for s in sparsity_list]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py`
around lines 259 - 262, The code truncates per-example sparse block counts by
doing sparse_blocks_out = [int(s * total_blocks) for s in sparsity_list], which
biases summaries like SparseAttentionStatsManager.get_summary() and
average_sparsity; change sparse_blocks_out to preserve fractional counts (e.g.,
sparse_blocks_out = [s * total_blocks for s in sparsity_list]) and defer any
integer rounding to the final reporting/aggregation logic (or explicitly use
proper rounding there) so summaries compute correct average sparsity from the
true fractional block counts; update any consumers that expect ints accordingly.
tests/_test_utils/torch/diffusers_models.py-30-33 (1)

30-33: ⚠️ Potential issue | 🟡 Minor

Use narrower exception handling for optional model imports.

except Exception can mask unexpected runtime errors in the diffusers import path. Restrict this to the missing-symbol cases you actually want to tolerate, consistent with the pattern established elsewhere in the codebase (e.g., unified_export_hf.py, diffusers_utils.py).

Suggested narrowing
 try:
     from diffusers.models.transformers import Flux2Transformer2DModel
-except Exception:  # pragma: no cover - optional diffusers models
+except (ImportError, AttributeError):  # pragma: no cover - optional diffusers models
     Flux2Transformer2DModel = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/_test_utils/torch/diffusers_models.py` around lines 30 - 33, The broad
except Exception around the optional import of Flux2Transformer2DModel should be
narrowed to only the expected "missing symbol" errors; change the except clause
that currently swallows all exceptions for the statement "from
diffusers.models.transformers import Flux2Transformer2DModel" to catch only
ImportError (and optionally AttributeError) so genuine runtime errors aren't
masked, while preserving the fallback assignment Flux2Transformer2DModel = None
and the pragma comment.
modelopt_recipes/general/ptq/nvfp4_omlp_only-fp8_kv.yml-18-18 (1)

18-18: ⚠️ Potential issue | 🟡 Minor

Fix the description to match the actual quantizer scope.

Line 18 says this recipe covers “all linear layers,” but the config only re-enables *mlp*, *block_sparse_moe*, and *o_proj* quantizers after default.enable: false. That will mislead users choosing between the default, MLP-only, and OMLP-only presets.

Suggested wording
-  description: NVFP4 static weight and dynamic activation for all linear layers including output projections, FP8 KV cache, max calibration.
+  description: NVFP4 static weight and dynamic activation for MLP, MoE, and output projection linear layers only, FP8 KV cache, max calibration.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/nvfp4_omlp_only-fp8_kv.yml` at line 18, Update
the description string to accurately reflect the quantizer scope: instead of
saying “all linear layers,” state that this recipe enables static weight and
dynamic activation quantization only for MLP layers, block-sparse MoE layers,
and output projections (re-enabled via default.enable: false then re-enabling
"*mlp*", "*block_sparse_moe*", and "*o_proj*"), and mention FP8 KV cache and max
calibration as before so the description matches the actual config.
modelopt_recipes/general/ptq/nvfp4_default-fp8_kv.yml-18-35 (1)

18-35: ⚠️ Potential issue | 🟡 Minor

Description doesn't match the quantizers this recipe enables.

*input_quantizer is enabled globally at Lines 29-35, so this is not a weight-only W4A16 recipe for matching layers. Either update the description or disable activation quantization here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/nvfp4_default-fp8_kv.yml` around lines 18 - 35,
The recipe's description claims "weight only (W4A16)" but the ptq_cfg enables
'*input_quantizer' (activation quantization) alongside '*weight_quantizer', so
update the recipe to match intent: either remove or set enable: false for
'*input_quantizer' in ptq_cfg to keep it weight-only, or change the top-level
description to reflect that both weight and activation quantizers are enabled;
locate the entries named '*input_quantizer' and '*weight_quantizer' in the
ptq_cfg and make the corresponding enable flag or description change
accordingly.
modelopt_recipes/general/ptq/nvfp4_mlp_only-fp8_kv.yml-18-49 (1)

18-49: ⚠️ Potential issue | 🟡 Minor

Description says "all linear layers", but the matchers are MLP/MoE-only.

The only enabled quantizers here are *mlp*... and *block_sparse_moe*... at Lines 22-49. Please align the description with the actual scope, or widen the matcher list.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/general/ptq/nvfp4_mlp_only-fp8_kv.yml` around lines 18 - 49,
The description claims "all linear layers" but the config only enables
quantizers for the '*mlp*weight_quantizer', '*mlp*input_quantizer',
'*block_sparse_moe*weight_quantizer', and '*block_sparse_moe*input_quantizer';
update either the description to reflect MLP/MoE-only scope or expand the
matcher list to include other linear layer patterns (e.g., add entries like
'*linear*weight_quantizer' and '*linear*input_quantizer' or '*dense*...' as
appropriate) so the stated scope matches the actual enabled quantizers.
tests/gpu/torch/quantization/test_sequential_calibrate.py-25-331 (1)

25-331: ⚠️ Potential issue | 🟡 Minor

Move these CPU-only checks into tests/unit.

Nothing in this module allocates CUDA tensors or touches GPU-only kernels, so keeping it under tests/gpu pushes the new coverage into the wrong CI lane. As per coding guidelines, "tests/**/*.py: Write tests using pytest for all new features and examples; organize tests into tests/unit (fast CPU-based), tests/gpu (fast GPU-based), tests/gpu_megatron (Megatron-Core), tests/gpu_trtllm (TensorRT-LLM), and tests/examples (integration tests)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu/torch/quantization/test_sequential_calibrate.py` around lines 25 -
331, This test module contains only CPU-based tests (exercises
sequential_calibrate, LayerActivationCollector and functions like
test_seq_calib_func_called_per_layer,
test_mode_transitions_across_calibration_steps, etc.) but is placed under the
GPU-only test suite; move the test module out of the GPU CI lane into the
CPU/unit test suite so it runs as a fast CPU test. Concretely: relocate the file
to the project’s unit-tests area (so it is discovered under unit tests rather
than GPU), ensure no CUDA-specific fixtures or markers remain, and update any
CI/test-suite configuration that referenced this file so the tests exercising
sequential_calibrate and LayerActivationCollector run in the CPU/unit pipeline.
Ensure imports and monkeypatch usage remain unchanged after the move.
modelopt/recipe/config.py-36-39 (1)

36-39: ⚠️ Potential issue | 🟡 Minor

Fix the ModelOptRecipeBase docstring content.

Line 38-Line 39 describe output-layer replacement behavior that is unrelated to recipe configuration and is misleading.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/recipe/config.py` around lines 36 - 39, The docstring for
ModelOptRecipeBase incorrectly describes output-layer replacement behavior;
update the ModelOptRecipeBase class docstring to accurately summarize purpose
and usage of the recipe base (e.g., what configuration fields it provides,
intended subclassing/validation behavior) and remove or relocate the misleading
sentence about matching "*output_layer*" and replacing attributes with
{"enable": False}; ensure the docstring references ModelOptRecipeBase and
describes its role in model optimization recipes rather than unrelated
layer-replacement behavior.
modelopt/recipe/loader.py-30-49 (1)

30-49: ⚠️ Potential issue | 🟡 Minor

Absolute Path inputs skip suffix probing.

load_recipe(Path("/tmp/foo")) returns the path unchanged, so /tmp/foo.yml and /tmp/foo.yaml are never discovered. The docstring says omitted suffixes are probed for absolute and relative filesystem paths, so str and Path inputs currently behave differently.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/recipe/loader.py` around lines 30 - 49, The function
_resolve_recipe_path treats absolute Path inputs as already-resolved and skips
suffix probing, so Path("/tmp/foo") won't find /tmp/foo.yml; change the early
conditional to treat both str and Path uniformly (convert recipe_path to rp_str
for both absolute and relative Paths) and run the existing BUILTIN_RECIPES_LIB
and filesystem suffix probing logic (using candidate and fs_candidate) for those
types, while preserving the Traversable passthrough; if no probe matches, return
Path(rp_str) as before.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c329587c-7f30-4de1-8dce-1b051d78451d

📥 Commits

Reviewing files that changed from the base of the PR and between c92db19 and 5ee6bfb5453b5a9b2dd7c8b235b899661b3841f4.

📒 Files selected for processing (80)
  • .github/workflows/unit_tests.yml
  • CHANGELOG.rst
  • docs/source/getting_started/_installation_for_Linux.rst
  • examples/diffusers/quantization/models_utils.py
  • examples/diffusers/quantization/utils.py
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/multinode_ptq.py
  • examples/llm_sparsity/attention_sparsity/hf_sa.py
  • examples/pruning/rank_layer_importance.py
  • experimental/conv/README.md
  • experimental/conv/bench_implicit_gemm.py
  • experimental/conv/implicit_gemm_binding.cpp
  • experimental/conv/implicit_gemm_cuda.py
  • experimental/conv/implicit_gemm_kernel.cu
  • experimental/conv/test_implicit_gemm.py
  • modelopt/onnx/autocast/graphsanitizer.py
  • modelopt/onnx/autocast/precisionconverter.py
  • modelopt/onnx/autocast/utils.py
  • modelopt/onnx/export/fp8_exporter.py
  • modelopt/onnx/export/nvfp4_exporter.py
  • modelopt/onnx/utils.py
  • modelopt/recipe/__init__.py
  • modelopt/recipe/_config_loader.py
  • modelopt/recipe/config.py
  • modelopt/recipe/loader.py
  • modelopt/torch/_deploy/utils/torch_onnx.py
  • modelopt/torch/export/diffusers_utils.py
  • modelopt/torch/export/layer_utils.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/opt/config.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/nn/functional.py
  • modelopt/torch/quantization/nn/modules/tensor_quantizer.py
  • modelopt/torch/quantization/plugins/diffusion/diffusers.py
  • modelopt/torch/quantization/plugins/huggingface.py
  • modelopt/torch/quantization/utils/__init__.py
  • modelopt/torch/quantization/utils/activation_collector.py
  • modelopt/torch/quantization/utils/core_utils.py
  • modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py
  • modelopt/torch/sparsity/attention_sparsity/config.py
  • modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py
  • modelopt/torch/sparsity/attention_sparsity/model_sparsify.py
  • modelopt/torch/sparsity/attention_sparsity/stats_manager.py
  • modelopt/torch/utils/network.py
  • modelopt_recipes/__init__.py
  • modelopt_recipes/configs/README
  • modelopt_recipes/general/README
  • modelopt_recipes/general/ptq/fp8_default-fp8_kv.yml
  • modelopt_recipes/general/ptq/nvfp4_default-fp8_kv.yml
  • modelopt_recipes/general/ptq/nvfp4_mlp_only-fp8_kv.yml
  • modelopt_recipes/general/ptq/nvfp4_omlp_only-fp8_kv.yml
  • pyproject.toml
  • tests/_test_utils/torch/diffusers_models.py
  • tests/_test_utils/torch/quantization/quantize_common.py
  • tests/_test_utils/torch/quantization/tensor_quantizer_common.py
  • tests/_test_utils/torch/sparsity/sparse_attention_common.py
  • tests/examples/llm_ptq/test_llm_ptq.py
  • tests/gpu/torch/export/test_export.py
  • tests/gpu/torch/quantization/test_hadamard.py
  • tests/gpu/torch/quantization/test_sequential_calibrate.py
  • tests/gpu/torch/sparsity/attention_sparsity/test_calibration_gpu.py
  • tests/gpu/torch/sparsity/attention_sparsity/test_integration_gpu.py
  • tests/unit/onnx/autocast/test_precisionconverter.py
  • tests/unit/recipe/__init__.py
  • tests/unit/recipe/test_loader.py
  • tests/unit/torch/export/test_export_diffusers.py
  • tests/unit/torch/quantization/plugins/test_huggingface.py
  • tests/unit/torch/quantization/test_calib.py
  • tests/unit/torch/quantization/test_sequential_calibrate.py
  • tests/unit/torch/quantization/test_utils.py
  • tests/unit/torch/sparsity/attention_sparsity/test_flash_skip_softmax.py
  • tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_calibration.py
  • tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_config.py
  • tests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.py
  • tests/unit/torch/sparsity/attention_sparsity/test_stats_manager.py
  • tests/unit/torch/sparsity/attention_sparsity/test_threshold_info.py
  • tox.ini
💤 Files with no reviewable changes (1)
  • modelopt/torch/utils/network.py
✅ Files skipped from review due to trivial changes (4)
  • modelopt_recipes/general/README
  • experimental/conv/README.md
  • tests/unit/recipe/init.py
  • modelopt_recipes/configs/README

Comment on lines +500 to +514
# CASE B: Expert (stay compressed, fix metadata)
elif f"{name}.weight_shape" in checkpoint_weights:
ws = checkpoint_weights[f"{name}.weight_shape"]
if f"{name}.weight_packed" in checkpoint_weights:
module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to(
torch.int32
)
module._parameters.pop("weight", None)
module._buffers.pop("weight", None)
module.__dict__.pop("weight", None)
shape_param = torch.nn.Parameter(ws.to(torch.int32), requires_grad=False)
module._parameters.pop("weight_shape", None)
module.__dict__.pop("weight_shape", None)
module._parameters["weight_shape"] = shape_param
module.__dict__["weight_shape"] = shape_param
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "example_utils.py" -path "*/llm_ptq/*" | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 101


🏁 Script executed:

cat -n examples/llm_ptq/example_utils.py | sed -n '490,530p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2305


🏁 Script executed:

cat -n examples/llm_ptq/example_utils.py | sed -n '480,550p'

Repository: NVIDIA/Model-Optimizer

Length of output: 3534


Move restored expert tensors onto target_device.

The BF16 branch at line 490 uses .to(target_device), but the expert branch at lines 504–506 and 510 only applies dtype conversion without specifying the device. This causes weight_packed and weight_shape to remain on CPU, creating device mismatches when expert modules execute on GPU.

Suggested fix
             elif f"{name}.weight_shape" in checkpoint_weights:
                 ws = checkpoint_weights[f"{name}.weight_shape"]
                 if f"{name}.weight_packed" in checkpoint_weights:
                     module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to(
-                        torch.int32
+                        device=target_device, dtype=torch.int32
                     )
                 module._parameters.pop("weight", None)
                 module._buffers.pop("weight", None)
                 module.__dict__.pop("weight", None)
-                shape_param = torch.nn.Parameter(ws.to(torch.int32), requires_grad=False)
+                shape_param = torch.nn.Parameter(
+                    ws.to(device=target_device, dtype=torch.int32), requires_grad=False
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 500 - 514, The expert branch
fails to move restored tensors to the target device: when setting
module.weight_packed and weight_shape you currently only convert dtype
(torch.int32) but not device, causing CPU/GPU mismatches; update the assignments
that create module.weight_packed and shape_param so they are sent to
target_device as well (e.g., use .to(device=target_device, dtype=torch.int32) or
.to(target_device) after dtype conversion) and then store those device-aware
tensors into module._parameters and module.__dict__ (affecting the symbols
module.weight_packed and weight_shape).

Comment thread examples/llm_ptq/example_utils.py
Comment thread examples/megatron_bridge/rank_layer_importance.py
Comment thread examples/megatron_bridge/rank_layer_importance.py
Comment on lines +123 to +165
cuda_mod = _get_cuda_module()

if x.ndim != 5 or w.ndim != 5:
raise ValueError(f"Expected 5D tensors, got x.ndim={x.ndim}, w.ndim={w.ndim}")
n_batch, cin, d, h, w_in = x.shape
cout, cin_w, kd, kh, kw = w.shape
if cin_w != cin:
raise ValueError(
f"Grouped convolution is not supported (x has {cin} input channels, "
f"w has {cin_w}). This kernel requires groups=1."
)

sd, sh, sw = _triple(stride)
dd, dh, dw = _triple(dilation)
pad_wl, pad_wr, pad_hl, pad_hr, pad_dl, pad_dr = _pad6(padding)

x_pad = F.pad(x, (pad_wl, pad_wr, pad_hl, pad_hr, pad_dl, pad_dr))
dp = d + pad_dl + pad_dr
hp = h + pad_hl + pad_hr
wp = w_in + pad_wl + pad_wr

od = (dp - (dd * (kd - 1) + 1)) // sd + 1
oh = (hp - (dh * (kh - 1) + 1)) // sh + 1
ow = (wp - (dw * (kw - 1) + 1)) // sw + 1

m = n_batch * od * oh * ow
k = cin * kd * kh * kw

w_flat = w.reshape(cout, k).transpose(0, 1).contiguous()

x_pad = x_pad.float().contiguous()
w_flat = w_flat.float().contiguous()

has_bias = bias is not None
bias_t = bias.float().contiguous() if has_bias else torch.empty(0, device=x.device) # type: ignore[union-attr]

if quant_act and act_amax is None:
raise ValueError("act_amax is required when quant_act=True")

do_quant = quant_act
amax_t = act_amax.float().contiguous() if do_quant else torch.empty(0, device=x.device) # type: ignore[union-attr]

y_flat = cuda_mod.conv3d_implicit_gemm_cuda(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Current conv3d wrapper validation:"
sed -n '123,165p' experimental/conv/implicit_gemm_cuda.py

echo
echo "Existing device/scalar guards in this file:"
rg -n '\.is_cuda|device\s*!=|numel\(\)\s*!=\s*1' experimental/conv/implicit_gemm_cuda.py || true

Repository: NVIDIA/Model-Optimizer

Length of output: 1674


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get the full function definition including its signature
echo "=== Full conv3d_implicit_gemm_cuda function ==="
sed -n '100,200p' experimental/conv/implicit_gemm_cuda.py

echo
echo "=== Check for any device/is_cuda validation elsewhere in the file ==="
rg -n 'is_cuda|\.device|assert' experimental/conv/implicit_gemm_cuda.py | head -30

echo
echo "=== Check fp4_fake_quant function mentioned in scratchpad ==="
rg -n -A 20 'def fp4_fake_quant' experimental/conv/implicit_gemm_cuda.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4311


Add device and scalar validation before CUDA kernel dispatch.

The function casts x, w, bias, and act_amax to float with .float().contiguous(), but this does not move CPU tensors to CUDA. When these potentially mixed-device or CPU tensors are passed directly to cuda_mod.conv3d_implicit_gemm_cuda(...), the raw CUDA kernel receives invalid pointers, causing undefined behavior or crashes. Additionally, act_amax is never validated to be a scalar (single-element tensor).

Add explicit checks:

  • x and w must be CUDA tensors
  • All of x, w, bias (if provided), and act_amax (if provided) must be on the same device
  • act_amax must be a scalar tensor
Suggested fix
     cuda_mod = _get_cuda_module()
 
     if x.ndim != 5 or w.ndim != 5:
         raise ValueError(f"Expected 5D tensors, got x.ndim={x.ndim}, w.ndim={w.ndim}")
+    if not x.is_cuda or not w.is_cuda:
+        raise ValueError("x and w must be CUDA tensors")
+    if x.device != w.device:
+        raise ValueError(f"x and w must be on the same device, got {x.device} and {w.device}")
     n_batch, cin, d, h, w_in = x.shape
     cout, cin_w, kd, kh, kw = w.shape
     if cin_w != cin:
         raise ValueError(
             f"Grouped convolution is not supported (x has {cin} input channels, "
             f"w has {cin_w}). This kernel requires groups=1."
         )
+    if bias is not None and bias.device != x.device:
+        raise ValueError(f"bias must be on {x.device}, got {bias.device}")
+    if act_amax is not None and act_amax.device != x.device:
+        raise ValueError(f"act_amax must be on {x.device}, got {act_amax.device}")
+    if act_amax is not None and act_amax.numel() != 1:
+        raise ValueError("act_amax must be a scalar tensor")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experimental/conv/implicit_gemm_cuda.py` around lines 123 - 165, The CUDA
kernel can receive CPU pointers or mismatched-device tensors; before calling
cuda_mod.conv3d_implicit_gemm_cuda ensure tensors are CUDA and on the same
device: verify x.is_cuda and w.is_cuda, check that bias (if has_bias) and
act_amax (if do_quant) are on x.device, and move/raise if not; also validate
act_amax is a scalar tensor (numel()==1) when do_quant is True; apply these
checks near the preparation section before creating x_pad/w_flat and before
invoking conv3d_implicit_gemm_cuda so you avoid sending CPU pointers to the
kernel.

Comment on lines +517 to +534
torch::Tensor fp4_fake_quant_cuda(torch::Tensor x, torch::Tensor global_amax, int block_size) {
// x: [num_blocks, block_size] or flat [N] where N % block_size == 0
auto x_flat = x.contiguous().view({-1});
int N = x_flat.numel();
int num_blocks = N / block_size;

auto y = torch::empty_like(x_flat);

// Launch: one warp (32 threads) per block
int threads_per_block = 256; // 8 warps per CUDA block
int warps_per_block = threads_per_block / 32;
int num_cuda_blocks = (num_blocks + warps_per_block - 1) / warps_per_block;

fp4_fake_quant_kernel<<<num_cuda_blocks, threads_per_block>>>(
x_flat.data_ptr<float>(), y.data_ptr<float>(), global_amax.data_ptr<float>(), num_blocks,
block_size);

return y.view_as(x);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Host-side launch code:"
sed -n '517,627p' experimental/conv/implicit_gemm_kernel.cu

echo
echo "Existing device/stream guards or launch checks in this file:"
rg -n 'CUDAGuard|getCurrentCUDAStream|cudaGetLastError|TORCH_CHECK\(.*cuda' experimental/conv/implicit_gemm_kernel.cu || true

Repository: NVIDIA/Model-Optimizer

Length of output: 7558


Guard kernel launches with device and stream semantics.

Both fp4_fake_quant_cuda (line 517) and conv3d_implicit_gemm_cuda (line 541) use bare <<<>>> launches that ignore the tensor's device and PyTorch's current stream. When tensors on alternate devices (e.g., cuda:1) are passed, the kernel launches on the current device with the default stream instead, causing device pointer mismatches and reordering bugs with surrounding operations.

Wrap each launch sequence with c10::cuda::CUDAGuard to ensure the correct device, use at::cuda::getCurrentCUDAStream() for kernel launch, and check cudaFuncSetAttribute and launch errors with cudaGetLastError().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@experimental/conv/implicit_gemm_kernel.cu` around lines 517 - 534, Wrap the
raw kernel launches in fp4_fake_quant_cuda and conv3d_implicit_gemm_cuda with
proper device and stream scoping: create a c10::cuda::CUDAGuard using the input
tensor device (e.g. x.device()) and get the current stream via
at::cuda::getCurrentCUDAStream(); then launch the kernel using the stream
parameter (kernel<<<grid, block, 0, stream.stream()>>>(...)). After the launch
check cudaGetLastError() and propagate/report any error; if your kernel requires
specific attributes, set them via cudaFuncSetAttribute before launching. Ensure
you apply this pattern to both fp4_fake_quant_cuda and conv3d_implicit_gemm_cuda
so launches use the tensor device and PyTorch stream.

Comment thread modelopt/onnx/utils.py
Comment on lines +1341 to +1363
# Cast to high precision -> cast to low precision, first cast has no impact and can be safely removed
# Cast to low precision -> cast to high precision affects precision and should not be removed
precision_order = [
onnx.TensorProto.DOUBLE,
onnx.TensorProto.FLOAT,
onnx.TensorProto.FLOAT16,
onnx.TensorProto.BFLOAT16,
]
consumers = [n for n in get_consumer_nodes(model, node.output[0]) if n.op_type == "Cast"]

# If the first cast has additional consumers, we should not remove it
if len(consumers) != 1:
return False

next_node = consumers[0]
first_cast_type = output_type
second_cast_type = get_cast_to_type(next_node)

return (
first_cast_type in precision_order
and second_cast_type in precision_order
and precision_order.index(first_cast_type) <= precision_order.index(second_cast_type)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and get context around the specified lines
cat -n modelopt/onnx/utils.py | sed -n '1320,1380p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2899


🏁 Script executed:

# Also look at the broader context to understand the _is_sequential_cast function
cat -n modelopt/onnx/utils.py | sed -n '1300,1400p'

Repository: NVIDIA/Model-Optimizer

Length of output: 5013


🏁 Script executed:

# Search for _is_sequential_cast function definition
rg -n "_is_sequential_cast" modelopt/onnx/utils.py -B 5 -A 30

Repository: NVIDIA/Model-Optimizer

Length of output: 3559


🏁 Script executed:

# Also check for any references to source types or input handling
rg -n "get_node_input_type\|node_input_type\|input_type" modelopt/onnx/utils.py -B 2 -A 2

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🏁 Script executed:

# Get input type information - check if _is_sequential_cast looks at source type
cat -n modelopt/onnx/utils.py | sed -n '1337,1365p' | grep -E "input_type|get_.*type|node.input"

Repository: NVIDIA/Model-Optimizer

Length of output: 272


🏁 Script executed:

# Check how _is_sequential_cast is called and what _bypass_cast_node does
cat -n modelopt/onnx/utils.py | sed -n '1456,1470p'

Repository: NVIDIA/Model-Optimizer

Length of output: 914


🏁 Script executed:

# Look for any tests or examples that use _is_sequential_cast
fd "test" -e py | head -20 | xargs grep -l "_is_sequential_cast" 2>/dev/null || echo "No tests found for this function"

Repository: NVIDIA/Model-Optimizer

Length of output: 98


🏁 Script executed:

# Check ONNX Cast operation semantics - look at any documentation or references
rg -n "FLOAT16\|BFLOAT16\|precision" modelopt/onnx/utils.py -B 1 -A 1 | head -40

Repository: NVIDIA/Model-Optimizer

Length of output: 48


🌐 Web query:

ONNX Cast operation FLOAT16 BFLOAT16 precision rounding semantics

💡 Result:

  • ONNX Cast does not define a configurable rounding mode for casts to FLOAT16 or BFLOAT16. The only rounding controls standardized in the Cast operator today are the round_mode (and saturate) attributes, and they apply only when casting to FLOAT8E8M0 (and saturate applies to float8 conversions more broadly). For other target types (including FLOAT16/BFLOAT16), the spec does not spell out rounding semantics in the operator definition. [1]

  • What most implementations do in practice

    • FLOAT16 (IEEE binary16): conversion from FLOAT/DOUBLE is typically the platform’s IEEE-754 narrowing conversion (commonly round-to-nearest, ties-to-even (RNE)).
    • BFLOAT16: many ecosystems also use RNE for FP32→BF16. For example:
      • The ONNX Python helper used to build tensors documents FP32→BF16 conversion as “rounds-to-nearest-even” by default (with an optional truncation mode). [2]
      • The OpenCL cl_intel_bfloat16_conversions extension states the only supported rounding mode is round-to-nearest-even. [3]
      • NVIDIA’s CUDA bfloat16 conversion APIs describe constructors/casts using the default round-to-nearest-even behavior. [4]

Implication: if you need bit-exact, cross-backend reproducibility for FP32→(FP16/BF16) casting, don’t rely on ONNX Cast alone to mandate rounding; validate the target runtime/hardware behavior (or emulate rounding explicitly before/after casting). [1]

Sources: ONNX Cast operator docs [1], ONNX helper BF16 conversion implementation/docs [2], OpenCL Intel BF16 conversions extension [3], NVIDIA CUDA BF16 conversion docs [4].


Source type needed to determine cast removability—FLOAT16 and BFLOAT16 are incomparable formats.

_is_sequential_cast only inspects the output types of sequential casts but lacks the input type of the first cast. It treats FLOAT16 and BFLOAT16 as a linear precision order, but they are different formats with different exponent/mantissa distributions. A chain like FLOAT → FLOAT16 → BFLOAT16 cannot safely have the first cast removed, since removing it changes the rounding path and result. The function must fetch the source type of the first cast and handle FLOAT16/BFLOAT16 conversions as non-removable to prevent silent numeric corruption.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/utils.py` around lines 1341 - 1363, The current
_is_sequential_cast logic uses only the first cast's output_type and treats
FLOAT16 and BFLOAT16 as ordered; change it to fetch the source/input type of the
first Cast node (e.g., inspect the first Cast node's input value/type via the
node.input[0] and model value_info/graph input/initializer) instead of relying
on output_type, then compare that source_type (first_cast_src) to the second
cast type from get_cast_to_type(next_node); if the conversion path involves
FLOAT16↔BFLOAT16 in either direction treat it as non-removable (return False),
otherwise keep the precision_order comparison but use first_cast_src as the
first element for precision_order.index() checks so FLOAT16 and BFLOAT16 are not
incorrectly ordered.

Comment on lines +896 to +906
Usage::

from modelopt.torch.quantization.plugins.huggingface import patch_compressed_linear_loading

with patch_compressed_linear_loading():
model = AutoModelForCausalLM.from_pretrained(
ckpt_path,
device_map="auto",
trust_remote_code=True,
torch_dtype="auto",
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Locate and examine the huggingface.py file around lines 896-906
fd -t f huggingface.py | grep -E "quantization/plugins" | head -1

Repository: NVIDIA/Model-Optimizer

Length of output: 116


🏁 Script executed:

# Check if the file exists and get its line count
wc -l modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 121


🏁 Script executed:

# Read the relevant section including context around lines 896-906
sed -n '870,930p' modelopt/torch/quantization/plugins/huggingface.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2010


🏁 Script executed:

# Search for the patch_compressed_linear_loading function definition and full docstring
sed -n '880,950p' modelopt/torch/quantization/plugins/huggingface.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 3093


Fix the docstring example: make trust_remote_code caller-controlled.

The usage example in the docstring (lines 896–906) hardcodes trust_remote_code=True, which contradicts NVIDIA's security coding practices. Since this is a copy-pasteable code snippet for end-users, it should demonstrate safe defaults. Either:

  • Omit the trust_remote_code argument (letting transformers default to False), or
  • Use a caller-provided variable (e.g., trust_remote_code=trust_remote_code)

Per SECURITY.md: "Do not hardcode trust_remote_code=True when loading Hugging Face Transformers models. Let the caller decide via a parameter; default to False."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 896 - 906,
The docstring usage example for patch_compressed_linear_loading currently
hardcodes trust_remote_code=True; update the snippet so trust_remote_code is
caller-controlled by either omitting the trust_remote_code argument (to use
transformers' default False) or replacing the literal with a variable (e.g.,
trust_remote_code=trust_remote_code) before calling
AutoModelForCausalLM.from_pretrained in the example; ensure the docstring shows
the safer pattern and mentions the variable so callers can opt-in explicitly.

@chochowski chochowski requested a review from a team as a code owner March 18, 2026 14:37
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
examples/megatron_bridge/rank_layer_importance.py (1)

346-354: ⚠️ Potential issue | 🟠 Major

Broken layer ID mapping when --num_layers_in_first_pipeline_stage is omitted.

When args.num_layers_in_first_pipeline_stage is None, the ternary evaluates to 0, making the condition layer_id >= offset and layer_id < offset always False. This causes all layers to return -1, meaning no block is ever patched.

Either make --num_layers_in_first_pipeline_stage required or infer the per-rank layer count from the model.

Proposed fix - make the argument required
     # Uneven Pipeline Parallelism parameters
     parser.add_argument("--num_layers", type=int, required=True)
-    parser.add_argument("--num_layers_in_first_pipeline_stage", type=int, default=None)
-    parser.add_argument("--num_layers_in_last_pipeline_stage", type=int, default=None)
+    parser.add_argument("--num_layers_in_first_pipeline_stage", type=int, required=True)
+    parser.add_argument("--num_layers_in_last_pipeline_stage", type=int, required=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/rank_layer_importance.py` around lines 346 - 354,
The condition in layer_id_in_this_rank incorrectly uses a ternary that yields 0
when args.num_layers_in_first_pipeline_stage is None, causing every check to
fail; fix by computing an explicit per-rank layer count first (e.g. num_in_first
= args.num_layers_in_first_pipeline_stage if
args.num_layers_in_first_pipeline_stage is not None else args.num_layers //
args.world_size) and then use that variable in the comparison (use layer_id >=
offset and layer_id < offset + num_in_first) so the function returns layer_id -
offset for correct ranks and -1 otherwise; reference layer_id_in_this_rank,
args.num_layers_in_first_pipeline_stage, args.num_layers, args.world_size, and
offset.
🧹 Nitpick comments (5)
examples/megatron_bridge/rank_layer_importance.py (5)

387-389: Unused patch_register variable in dropped layers loop.

The patch_register is assigned but never used since dropped layers remain patched. Consider using _ to indicate the return value is intentionally ignored.

Proposed fix
     for layer_id in args.drop_layers:
-        patch_register = patch_model(layer_id_in_this_rank(layer_id))
+        _ = patch_model(layer_id_in_this_rank(layer_id))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/rank_layer_importance.py` around lines 387 - 389,
The loop assigns patch_register from
patch_model(layer_id_in_this_rank(layer_id)) but never uses it; update the loop
over args.drop_layers to explicitly ignore the return value (e.g., replace the
variable with "_" or remove the assignment) so it's clear the result of
patch_model is intentionally unused; locate the loop using patch_register,
patch_model, layer_id_in_this_rank, and args.drop_layers and change the
assignment accordingly.

66-74: Inconsistent no-op behavior for Mamba layers.

noop_mamba_forward_patch returns the input unchanged (pass-through), while other no-op patches return zeros or clones. If this is intentional (e.g., Mamba layers require identity rather than zeroing for stability), consider adding a brief comment explaining why.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/rank_layer_importance.py` around lines 66 - 74, The
noop_mamba_forward_patch function currently returns hidden_states (identity)
while other no-op patches in this module return zeros or clones, causing
inconsistent behavior; either make noop_mamba_forward_patch match the other
no-op semantics (e.g., return zeros_like(hidden_states) or a clone) by changing
its return to the same pattern, or if identity is intentional for stability, add
a short explanatory comment above noop_mamba_forward_patch describing why Mamba
layers must be passthrough (reference noop_mamba_forward_patch and the other
no-op patch patterns) so readers know the divergence is deliberate.

406-409: Use context manager for file operations.

Same issue as collect_scores - the open() without context manager can leak file handles.

Proposed fix
     if is_pipeline_last_stage() and get_data_parallel_rank() == 0:
         scores = collect_scores(unwrapped_model)
         assert scores is not None
-        pickle.dump(scores, open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb"))
+        with open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb") as f:
+            pickle.dump(scores, f)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/rank_layer_importance.py` around lines 406 - 409,
The file write currently uses open(...) without a context manager which can leak
file handles; in the block guarded by is_pipeline_last_stage() and
get_data_parallel_rank() == 0 (where you call collect_scores(unwrapped_model)
and assert scores is not None), replace pickle.dump(scores,
open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb")) with a with
open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb") as f:
pickle.dump(scores, f) pattern so the file is properly closed after writing.

286-291: Use context manager for file operations and clean up debug output.

The open() call without a context manager can leak file handles. Also, the debug prints with raw f-string=} syntax are verbose for production code.

Proposed fix
-    print(f"{scores=}")
-    pickle.dump(scores, open("scores.p", "wb"))
+    with open("scores.p", "wb") as f:
+        pickle.dump(scores, f)
     print("Layers ordered by <MSE> importance:")
-    print(
-        f"{sorted([(k,v['mse'].mean()) for k,v in scores.items() if v['mse'].numel() > 0], key=lambda x:(x[1]))=}"
-    )
+    ranked = sorted(
+        [(k, v['mse'].mean().item()) for k, v in scores.items() if v['mse'].numel() > 0],
+        key=lambda x: x[1]
+    )
+    print(ranked)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/rank_layer_importance.py` around lines 286 - 291,
Replace the raw debug prints and the bare open() with a context-managed file
write and cleaner output: remove or simplify the prints that use the f"{var=}"
debug syntax (the occurrences printing scores and the sorted MSE output), and
write scores using with open("scores.p", "wb") as f: pickle.dump(scores, f).
Keep a single concise log/print that summarises the result (e.g., “Saved scores
to scores.p” and/or a short human-readable listing from the sorted MSE
computation) instead of the verbose debug f-string; update any references to
scores and the sorted MSE expression accordingly.

44-45: Deprecated reduce parameter in MSELoss.

The reduce parameter has been deprecated in favor of reduction. Using both is redundant and triggers a deprecation warning.

Proposed fix
 kl_loss = torch.nn.KLDivLoss(reduction="batchmean", log_target=True).cuda()
-mse_loss = torch.nn.MSELoss(reduce=True, reduction="mean").cuda()
+mse_loss = torch.nn.MSELoss(reduction="mean").cuda()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/rank_layer_importance.py` around lines 44 - 45, The
MSELoss instantiation uses the deprecated `reduce` parameter alongside
`reduction`; update the MSE loss creation (variable mse_loss using
torch.nn.MSELoss) to remove the `reduce=True` argument and rely only on
`reduction="mean"` and then call .cuda() as before so the line becomes a single
call to torch.nn.MSELoss(reduction="mean").cuda().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Line 181: The log message passed to print_rank_0 contains a typo: change the
string in the call to print_rank_0("> Computing distances to stored refernces")
to correct "refernces" to "references" so it reads "> Computing distances to
stored references"; update the call site where print_rank_0 is invoked in
rank_layer_importance.py accordingly.

---

Duplicate comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Around line 346-354: The condition in layer_id_in_this_rank incorrectly uses a
ternary that yields 0 when args.num_layers_in_first_pipeline_stage is None,
causing every check to fail; fix by computing an explicit per-rank layer count
first (e.g. num_in_first = args.num_layers_in_first_pipeline_stage if
args.num_layers_in_first_pipeline_stage is not None else args.num_layers //
args.world_size) and then use that variable in the comparison (use layer_id >=
offset and layer_id < offset + num_in_first) so the function returns layer_id -
offset for correct ranks and -1 otherwise; reference layer_id_in_this_rank,
args.num_layers_in_first_pipeline_stage, args.num_layers, args.world_size, and
offset.

---

Nitpick comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Around line 387-389: The loop assigns patch_register from
patch_model(layer_id_in_this_rank(layer_id)) but never uses it; update the loop
over args.drop_layers to explicitly ignore the return value (e.g., replace the
variable with "_" or remove the assignment) so it's clear the result of
patch_model is intentionally unused; locate the loop using patch_register,
patch_model, layer_id_in_this_rank, and args.drop_layers and change the
assignment accordingly.
- Around line 66-74: The noop_mamba_forward_patch function currently returns
hidden_states (identity) while other no-op patches in this module return zeros
or clones, causing inconsistent behavior; either make noop_mamba_forward_patch
match the other no-op semantics (e.g., return zeros_like(hidden_states) or a
clone) by changing its return to the same pattern, or if identity is intentional
for stability, add a short explanatory comment above noop_mamba_forward_patch
describing why Mamba layers must be passthrough (reference
noop_mamba_forward_patch and the other no-op patch patterns) so readers know the
divergence is deliberate.
- Around line 406-409: The file write currently uses open(...) without a context
manager which can leak file handles; in the block guarded by
is_pipeline_last_stage() and get_data_parallel_rank() == 0 (where you call
collect_scores(unwrapped_model) and assert scores is not None), replace
pickle.dump(scores, open(f"scores_{get_pipeline_model_parallel_rank()}.p",
"wb")) with a with open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb")
as f: pickle.dump(scores, f) pattern so the file is properly closed after
writing.
- Around line 286-291: Replace the raw debug prints and the bare open() with a
context-managed file write and cleaner output: remove or simplify the prints
that use the f"{var=}" debug syntax (the occurrences printing scores and the
sorted MSE output), and write scores using with open("scores.p", "wb") as f:
pickle.dump(scores, f). Keep a single concise log/print that summarises the
result (e.g., “Saved scores to scores.p” and/or a short human-readable listing
from the sorted MSE computation) instead of the verbose debug f-string; update
any references to scores and the sorted MSE expression accordingly.
- Around line 44-45: The MSELoss instantiation uses the deprecated `reduce`
parameter alongside `reduction`; update the MSE loss creation (variable mse_loss
using torch.nn.MSELoss) to remove the `reduce=True` argument and rely only on
`reduction="mean"` and then call .cuda() as before so the line becomes a single
call to torch.nn.MSELoss(reduction="mean").cuda().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9ddfa505-ef7f-4d1e-be6d-1fbf70fe9b56

📥 Commits

Reviewing files that changed from the base of the PR and between 5ee6bfb5453b5a9b2dd7c8b235b899661b3841f4 and e5904e5b813b684dad923abe0cbba6381c469527.

📒 Files selected for processing (1)
  • examples/megatron_bridge/rank_layer_importance.py

Comment thread examples/megatron_bridge/rank_layer_importance.py Outdated
@chochowski chochowski changed the title add example script for depth importance estimation [DRAFT] add example script for depth importance estimation Mar 18, 2026
@chochowski chochowski closed this Mar 18, 2026
@chochowski chochowski force-pushed the depth_importance_example branch from e5904e5 to 7e2e85a Compare March 18, 2026 20:25
@chochowski chochowski reopened this Mar 18, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
examples/megatron_bridge/rank_layer_importance.py (2)

179-179: ⚠️ Potential issue | 🟡 Minor

Typo in log message is still present.

Line 179 has "refernces"; should be "references".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/rank_layer_importance.py` at line 179, Fix the typo
in the log message passed to print_rank_0 in rank_layer_importance.py: change
the string "refernces" to "references" so the call print_rank_0("> Computing
distances to stored refernces") becomes print_rank_0("> Computing distances to
stored references"); no other changes required.

297-306: ⚠️ Potential issue | 🔴 Critical

Layer-to-rank mapping is incorrect for uneven/multi-stage PP and can skip or mis-patch layers.

Line 303’s assertion is mathematically invalid for pp_size > 2, and layer_id_in_this_rank() (Line 362 onward) assumes every rank owns num_layers_in_first_pipeline_stage. This causes wrong global→local mapping and potential out-of-bounds patching on uneven topologies.

🐛 Proposed fix (compute per-rank spans, then map with start/end)
 def estimate_layer_importance(args: argparse.Namespace):
     pp_size = dist.size()
-    if args.num_layers_in_first_pipeline_stage is None:
-        args.num_layers_in_first_pipeline_stage = args.num_layers // pp_size
-    if args.num_layers_in_last_pipeline_stage is None:
-        args.num_layers_in_last_pipeline_stage = args.num_layers - (
-            args.num_layers_in_first_pipeline_stage * (pp_size - 1)
-        )
-    assert (
-        args.num_layers_in_first_pipeline_stage + args.num_layers_in_last_pipeline_stage
-        == args.num_layers
-    ), "Number of layers in first and last pipeline stages must sum to the total number of layers"
+    if args.num_layers_in_first_pipeline_stage is None and args.num_layers_in_last_pipeline_stage is None:
+        # Balanced fallback when topology flags are omitted.
+        base, rem = divmod(args.num_layers, pp_size)
+        pp_layer_counts = [base + (1 if r < rem else 0) for r in range(pp_size)]
+    else:
+        if (
+            args.num_layers_in_first_pipeline_stage is None
+            or args.num_layers_in_last_pipeline_stage is None
+        ):
+            raise ValueError(
+                "Provide both --num_layers_in_first_pipeline_stage and "
+                "--num_layers_in_last_pipeline_stage, or omit both."
+            )
+        first = args.num_layers_in_first_pipeline_stage
+        last = args.num_layers_in_last_pipeline_stage
+        if pp_size == 1:
+            if args.num_layers != first:
+                raise ValueError("For pp_size=1, num_layers must equal first-stage layer count.")
+            pp_layer_counts = [args.num_layers]
+        elif pp_size == 2:
+            if first + last != args.num_layers:
+                raise ValueError("For pp_size=2, first+last must equal --num_layers.")
+            pp_layer_counts = [first, last]
+        else:
+            middle_total = args.num_layers - first - last
+            if middle_total < 0 or middle_total % (pp_size - 2) != 0:
+                raise ValueError(
+                    "Invalid topology: remaining layers must be non-negative and divisible by middle stages."
+                )
+            middle = middle_total // (pp_size - 2)
+            pp_layer_counts = [first] + [middle] * (pp_size - 2) + [last]
 
     print_rank_0(f"Setting pipeline_model_parallel_size to {pp_size}")
@@
     pp_rank = get_pipeline_model_parallel_rank()
-    offset = (
-        pp_rank * args.num_layers_in_first_pipeline_stage
-        if args.num_layers_in_first_pipeline_stage
-        else 0
-    )
+    offset = sum(pp_layer_counts[:pp_rank])
+    local_layer_count = pp_layer_counts[pp_rank]
@@
     def layer_id_in_this_rank(layer_id):
-        if (
-            layer_id >= offset and layer_id < offset + args.num_layers_in_first_pipeline_stage
-            if args.num_layers_in_first_pipeline_stage
-            else 0
-        ):
+        if offset <= layer_id < offset + local_layer_count:
             return layer_id - offset
         else:
             return -1

Also applies to: 331-335, 362-370

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/rank_layer_importance.py` around lines 297 - 306,
The current logic assumes only two-stage pipeline and that every rank has
args.num_layers_in_first_pipeline_stage, which breaks for pp_size > 2 and causes
layer_id_in_this_rank() to map incorrectly; replace the simplistic computation
and assertion by computing per-rank layer spans (e.g., build an array of
start/end layer indices for each pipeline rank using args.num_layers, pp_size
and any per-stage defaults), then change layer_id_in_this_rank() to map a global
layer id to a rank-local id by finding the rank whose span contains that layer
and subtracting that span's start to get the local index; remove the invalid sum
assertion and instead validate that the union of all per-rank spans covers
0..args.num_layers-1 without overlap or gaps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Around line 129-131: The code stores full-sequence activations in
self.reference_hidden (via lines like reference_hidden.append(hidden_out)) which
can OOM and ignores nlast_tokens; change each append to only keep the last
nlast_tokens frames (or at least the final token if nlast_tokens<=0) and offload
them to CPU/detached tensors before storing. Concretely, where you currently
append hidden_out (in __init__ flow and the other blocks around the 147-152 and
155-168 regions), replace it with something like: slice = hidden_out[:,
-nlast_tokens:, ...] if nlast_tokens>0 else hidden_out[:, -1:, ...]; then call
slice = slice.detach().cpu() and append that to self.reference_hidden so you
retain only the tail tokens on CPU and avoid keeping full-GPU histories. Ensure
any later code that reads self.reference_hidden expects the shortened sequence
length.

---

Duplicate comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Line 179: Fix the typo in the log message passed to print_rank_0 in
rank_layer_importance.py: change the string "refernces" to "references" so the
call print_rank_0("> Computing distances to stored refernces") becomes
print_rank_0("> Computing distances to stored references"); no other changes
required.
- Around line 297-306: The current logic assumes only two-stage pipeline and
that every rank has args.num_layers_in_first_pipeline_stage, which breaks for
pp_size > 2 and causes layer_id_in_this_rank() to map incorrectly; replace the
simplistic computation and assertion by computing per-rank layer spans (e.g.,
build an array of start/end layer indices for each pipeline rank using
args.num_layers, pp_size and any per-stage defaults), then change
layer_id_in_this_rank() to map a global layer id to a rank-local id by finding
the rank whose span contains that layer and subtracting that span's start to get
the local index; remove the invalid sum assertion and instead validate that the
union of all per-rank spans covers 0..args.num_layers-1 without overlap or gaps.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 5fdf2490-afcc-49ae-8b1b-2ff301b67d16

📥 Commits

Reviewing files that changed from the base of the PR and between e5904e5b813b684dad923abe0cbba6381c469527 and b17ccac.

📒 Files selected for processing (1)
  • examples/megatron_bridge/rank_layer_importance.py

Comment thread examples/megatron_bridge/rank_layer_importance.py
@chochowski chochowski changed the title [DRAFT] add example script for depth importance estimation add example script for depth importance estimation Mar 19, 2026
gcunhase and others added 4 commits March 25, 2026 01:54
… available in safety (#1067)

### What does this PR do?

**Type of change**: Bug fix

**Overview**: Remote autotuning via `--autotune` is only available with
`--safe` enabled.

See `trtexec --help`:
```sh
  --remoteAutoTuningConfig           Set the remote auto tuning config. Must be specified with --safe.
                                     Format: protocol://username[:password]@hostname[:port]?param1=value1&param2=value2
                                     Example: ssh://user:pass@192.0.2.100:22?remote_exec_path=/opt/tensorrt/bin&remote_lib_path=/opt/tensorrt/lib
```

### Usage

```python
$ python -m modelopt.onnx.quantization.autotune \
    --onnx_path resnet50_Opset17_bs128.onnx \
    --use_trtexec \
    --trtexec_benchmark_args "--remoteAutoTuningConfig=\"<remote autotuning config>\" --safe"
```

### Testing
See `examples/onnx_ptq/autotune`.

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: N/A
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Remote autotuning now automatically enforces safety mode by including
the --safe flag if not already present, with informative warning
messages when the flag is automatically added.

* **Documentation**
* Updated remote autotuning guide to clarify that safety mode (--safe
flag) must be enabled through the trtexec benchmark arguments for proper
configuration.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
@chochowski chochowski force-pushed the depth_importance_example branch from c24c68a to 7f43534 Compare March 25, 2026 08:54
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 25, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1016/

Built to branch gh-pages at 2026-03-25 13:25 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.22%. Comparing base (ae965a9) to head (e23df71).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1016      +/-   ##
==========================================
+ Coverage   70.18%   70.22%   +0.03%     
==========================================
  Files         228      228              
  Lines       25952    25952              
==========================================
+ Hits        18215    18224       +9     
+ Misses       7737     7728       -9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants