add example script for depth importance estimation#1016
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughA new script Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Model as Megatron Model
participant Hook as LastHiddenImportanceHook
participant Calibration as Calibration Loop
participant Metrics as Metrics Collector
User->>Model: setup_gates(unwrapped_model)
Model->>Hook: Attach to final LayerNorm/RMSNorm layers
User->>Calibration: estimate_layer_importance(args)
Calibration->>Model: load_reference(calibration_data)
Hook->>Metrics: load_reference_hidden_states()
loop For each layer in model
Calibration->>Model: patch_model(layer_id)
Note over Model: Disable layer computation via noop
Calibration->>Calibration: forward_loop(data_loader)
Calibration->>Model: Forward pass with patched layer
Model->>Hook: hook_fn collects hidden states
Hook->>Metrics: Compute normalized MSE vs reference
Calibration->>Model: unpatch_model(layer_id)
Note over Model: Restore original forward
end
Calibration->>Metrics: gather_across_dp(rankings)
Note over Metrics: Aggregate metrics across DP ranks
Calibration->>Metrics: collect_scores(aggregation)
Metrics->>User: Output ranked layer importance scores
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/pruning/depth_ranking.py`:
- Around line 303-320: In collect_scores(), avoid hardcoding a length-10 CUDA
tensor and metric assumptions: use the provided use_metric argument when reading
stats, infer device and per-sample vector lengths dynamically, and replace
torch.zeros((10,)).cuda() with a padding strategy (e.g., pad_sequence or
creating zeros using the same device and shape as stat) so torch.stack(res)
won't fail for different DP/batch sizes; ensure aggregation (median/mean)
computes over the correct dim and that drop_group/drop_blocks are applied to the
sorted_indices from the selected metric; apply the same fixes in the related
block around the other occurrence (lines ~510-514) so both places respect
variable sample counts, device, and the caller's metric/drop settings.
- Line 1: Add the standard NVIDIA Apache 2.0 license header to the top of
depth_ranking.py (replace the solitary copyright line), using the full Apache
2.0 block required by the repository policy so the file begins with the complete
license header comment instead of just the copyright notice.
- Around line 16-17: The example imports test-only utilities
(get_mcore_gpt_model, set_seed and the inference helpers) from _test_utils which
couples the example to the test tree; replace those imports with supported
runtime equivalents or relocate the helper implementations into the examples
package. Concretely, remove references to
_test_utils.torch.megatron.models.get_mcore_gpt_model and
_test_utils.torch.misc.set_seed and either import model-building and seed
utilities from the public API (or a stable examples.helpers module), and copy
any inference helper functions used on lines ~93-96 into
examples/pruning/helpers.py and import them from there so the example no longer
depends on tests.
- Around line 78-80: Wrap the transformer_engine imports in a try/except and set
a flag (e.g., HAS_TE) so importing the example won't fail when
transformer_engine is not installed, and import/alias torch.nn.LayerNorm as the
fallback norm; then update setup_gates() to detect and handle both TE norm
classes (RMSNorm/LayerNorm from transformer_engine when HAS_TE) and standard
PyTorch norms (torch.nn.LayerNorm and any other supported norm types) by
creating the same logits_gate_list entries for either case; additionally, add a
defensive check before using model.logits_gate_list[0] to raise a clear error or
construct a default gate when logits_gate_list is empty, and optionally ensure
main() pins transformer_impl to the TE backend when intended (or document that
transformer_engine must be installed).
- Around line 353-384: The layer-to-rank mapping is wrong because
layer_id_in_this_rank() assumes equal split via num_layers_per_pp and offset;
update it to use Megatron's true pipeline offsets (e.g., call the utility used
in the distill plugin like TransformerLayer._get_layer_offset() or compute
start/end using config.num_layers_in_first_pipeline_stage and
config.num_layers_in_last_pipeline_stage together with
get_pipeline_model_parallel_rank()/get_pipeline_model_parallel_world_size()) so
that global layer IDs map to local indices correctly; adjust
layer_id_in_this_rank() to return the local index or -1 based on that computed
start/end range, and ensure patch_model/unpatch_model still use the returned
local index consistently.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f0060447-cc9c-4f31-b6c7-4a5eebc9255f
📒 Files selected for processing (1)
examples/pruning/depth_ranking.py
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/pruning/rank_layer_importance.py`:
- Around line 245-248: The topology flags (--num_layers,
--num_layers_in_first_pipeline_stage, --num_layers_in_last_pipeline_stage) are
treated as required but are optional and the current layer_id_in_this_rank()
math assumes uniform per-rank spans causing incorrect global IDs; make these
args required via parser.add_argument(..., required=True) or add validation
after parsing to compute per-rank layer spans from the loaded model (e.g.,
derive total num_layers from model.num_layers or model.config and compute
per-rank start/end counts using world_rank/world_size and the provided
first/last stage sizes), then replace the existing ternary/offset logic with
computed per-rank start_index/end_index used by layer_id_in_this_rank() and the
loops over range(args.num_layers) so that patches target the correct global
layer IDs.
- Around line 166-183: The hook_fn currently appends full GPU tensors to
self.reference_hidden in the reference-loading phase (self.reference_load),
which causes OOM; instead either (a) avoid caching full tensors by performing
the comparison immediately per calibration batch (compute
normalized_mse_loss_per_sample between hidden_out and the corresponding
reference and append results to self.hidden_distance/self.logits_distance), or
(b) if caching is required, move/slice the activation off device before storing
(e.g., hidden_out = hidden_out.detach().cpu() or store a reduced
summary/selected slices) and for logits use teacher_logits =
self.lm_head(...).detach().cpu() so memory stays on CPU/disk; update references
to self.reference_hidden, hook_fn, normalized_mse_loss_per_sample,
self.hidden_distance, self.logits_distance, and self.reference_load accordingly.
- Around line 81-82: Remove the unused module-level CUDA allocations kl_loss and
mse_loss (the torch.nn.KLDivLoss(...) and torch.nn.MSELoss(...) calls) which run
at import time and call .cuda(); delete these lines entirely since they are
never referenced and cause unwanted GPU initialization before dist.setup(); if
loss objects are needed later, instantiate them locally after distributed setup
(e.g., inside main or training functions) and avoid calling .cuda() at import
time.
In `@examples/pruning/README.md`:
- Around line 233-240: Update the two command fenced blocks around the usage of
rank_layer_importance.py to include a blank line before and after each fence and
add the language tag "bash" after the opening ```; also correct the typo
"interative" to "iterative" in the explanatory sentence that references
iterative estimation (search for rank_layer_importance.py and the paragraph
mentioning interative/iterative).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a5d03e5d-3a39-4448-a24c-be180264fe5e
📒 Files selected for processing (2)
examples/pruning/README.mdexamples/pruning/rank_layer_importance.py
| kl_loss = torch.nn.KLDivLoss(reduction='batchmean', log_target=True).cuda() | ||
| mse_loss = torch.nn.MSELoss(reduce=True, reduction='mean').cuda() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Module-scope CUDA allocations:"
rg -n '^\s*[A-Za-z_][A-Za-z0-9_]*\s*=\s*torch\.nn\..*\.cuda\(\)' examples/pruning/rank_layer_importance.py
echo
echo "Distributed/device setup points in the same file:"
rg -n -C2 'dist\.setup\(|torch\.cuda\.set_device|LOCAL_RANK|\.cuda\(' examples/pruning/rank_layer_importance.pyRepository: NVIDIA/Model-Optimizer
Length of output: 914
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Usage of kl_loss and mse_loss variables ==="
rg -n 'kl_loss|mse_loss' examples/pruning/rank_layer_importance.py
echo
echo "=== File structure around lines 81-82 to confirm module scope ==="
sed -n '75,95p' examples/pruning/rank_layer_importance.py | cat -n
echo
echo "=== Verify __main__ block and entry point ==="
sed -n '410,419p' examples/pruning/rank_layer_importance.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 2114
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Total lines in file ==="
wc -l examples/pruning/rank_layer_importance.py
echo
echo "=== Exact match for 'kl_loss' references (case-sensitive, whole word) ==="
rg -w 'kl_loss' examples/pruning/rank_layer_importance.py
echo
echo "=== Exact match for 'mse_loss' references (case-sensitive, whole word) ==="
rg -w 'mse_loss' examples/pruning/rank_layer_importance.pyRepository: NVIDIA/Model-Optimizer
Length of output: 581
Remove unused module-scope CUDA allocations.
Lines 81–82 allocate kl_loss and mse_loss at module import time, before dist.setup() runs. This causes every worker to touch the default GPU during import, making the script fragile under torch.distributed.run. Additionally, these variables are never referenced anywhere in the file—the code uses F.mse_loss() instead—so they should be removed entirely rather than moved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/pruning/rank_layer_importance.py` around lines 81 - 82, Remove the
unused module-level CUDA allocations kl_loss and mse_loss (the
torch.nn.KLDivLoss(...) and torch.nn.MSELoss(...) calls) which run at import
time and call .cuda(); delete these lines entirely since they are never
referenced and cause unwanted GPU initialization before dist.setup(); if loss
objects are needed later, instantiate them locally after distributed setup
(e.g., inside main or training functions) and avoid calling .cuda() at import
time.
| def hook_fn(self, module, input, output): | ||
| # seq x batch x dim | ||
| hidden_out = output.detach().permute(1, 0, 2) # batch x seq x dim | ||
|
|
||
| # if loading the reference form teacher | ||
| if self.reference_load: | ||
| self.reference_hidden.append(hidden_out) | ||
| return | ||
|
|
||
| # if computing the distance to the reference | ||
| sample_id = len(self.hidden_distance) | ||
| #MSE | ||
| self.hidden_distance.append( normalized_mse_loss_per_sample(hidden_out, self.reference_hidden[sample_id]).mean() ) | ||
| # if computing the distance to the teacher's logits | ||
| if self.lm_head: | ||
| teacher_logits = self.lm_head(self.reference_hidden[sample_id].permute(1, 0, 2))[0].detach() | ||
| logits = self.lm_head(hidden_out.permute(1, 0, 2))[0].detach() | ||
| self.logits_distance.append( normalized_mse_loss_per_sample(logits, teacher_logits).mean() ) |
There was a problem hiding this comment.
Caching every reference hidden state on-device will OOM quickly.
self.reference_hidden.append(hidden_out) keeps the full final hidden tensor for every calibration batch. With the defaults in this file (calib_num_samples=1024, seq_length=4096) and the 12B example documented in the README, that is on the order of tens of GB on the last PP rank before the ranking loop even starts. Please stream the comparison batch-by-batch, or at least slice/offload the reference activations before caching them.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/pruning/rank_layer_importance.py` around lines 166 - 183, The
hook_fn currently appends full GPU tensors to self.reference_hidden in the
reference-loading phase (self.reference_load), which causes OOM; instead either
(a) avoid caching full tensors by performing the comparison immediately per
calibration batch (compute normalized_mse_loss_per_sample between hidden_out and
the corresponding reference and append results to
self.hidden_distance/self.logits_distance), or (b) if caching is required,
move/slice the activation off device before storing (e.g., hidden_out =
hidden_out.detach().cpu() or store a reduced summary/selected slices) and for
logits use teacher_logits = self.lm_head(...).detach().cpu() so memory stays on
CPU/disk; update references to self.reference_hidden, hook_fn,
normalized_mse_loss_per_sample, self.hidden_distance, self.logits_distance, and
self.reference_load accordingly.
| - To estimate importance of each layer one can run `rank_layer_importance.py` script. This script computes importance of each layer by comparing the MSE between the final hidden representation with and without that layer. | ||
|
|
||
| ``` | ||
| python -m torch.distributed.run --nproc_per_node=2 /path/to/modelopt/examples/pruning/rank_layer_importance.py --hf_model_name_or_path /path/to/hf-checkpoint/nvidia/NVIDIA-Nemotron-Nano-12B-v2 --trust_remote_code --calib_dataset_name wikitext --num_layers_in_first_pipeline_stage 31 --num_layers_in_last_pipeline_stage 31 --num_layers 62 | ||
| ``` | ||
| - One can also pass indices of layers that should be dropped always. This allows running an interative estimation e.g. in first iteration score all layers, pick 5 least important, and in the next iteration pass these 5 layers to be dropped, so that it ranks the rest of the layers assuming these 5 are dropped. | ||
| ``` | ||
| python -m torch.distributed.run --nproc_per_node=2 /path/to/modelopt/examples/pruning/rank_layer_importance.py --hf_model_name_or_path /path/to/hf-checkpoint/nvidia/NVIDIA-Nemotron-Nano-12B-v2 --trust_remote_code --calib_dataset_name wikitext --num_layers_in_first_pipeline_stage 31 --num_layers_in_last_pipeline_stage 31 --num_layers 62 --drop_layers 6 7 9 32 41 |
There was a problem hiding this comment.
Fix the new command fences and typo.
Both command blocks need a bash language and blank lines around the fences, and Line 238 still says interative.
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 235-235: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
[warning] 237-237: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 239-239: Fenced code blocks should be surrounded by blank lines
(MD031, blanks-around-fences)
[warning] 239-239: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/pruning/README.md` around lines 233 - 240, Update the two command
fenced blocks around the usage of rank_layer_importance.py to include a blank
line before and after each fence and add the language tag "bash" after the
opening ```; also correct the typo "interative" to "iterative" in the
explanatory sentence that references iterative estimation (search for
rank_layer_importance.py and the paragraph mentioning interative/iterative).
| @@ -0,0 +1,419 @@ | |||
| # Adapted from https://github.com/EleutherAI/lm-evaluation-harness/tree/aa457edc3d64d81530159cd3a182932320c78f8c | |||
|
|
|||
| # MIT License | |||
There was a problem hiding this comment.
please use the NVIDIA Apache license and change the year to 2026
| from megatron.core.packed_seq_params import PackedSeqParams | ||
| from megatron.core.rerun_state_machine import RerunMode, get_rerun_state_machine | ||
|
|
||
| from megatron.bridge.models.mamba.mamba_provider import MambaModelProvider |
There was a problem hiding this comment.
does this example belong in the Megatron-Bridge repo? @kevalmorabia97
we probably want to keep MBridge examples in the MBridge repo, to reduce circular dependencies between the two repos
There was a problem hiding this comment.
We can move to ModelOpt/examples/megatron_bridge where we have pruning and distillation for M-Bridge already
|
@chochowski commits seem to have been messed up. PTAL |
There was a problem hiding this comment.
Actionable comments posted: 8
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py (1)
123-149:⚠️ Potential issue | 🟡 MinorGood refactoring to single-pass calibration.
The consolidation from multiple forward passes to a single pass with all thresholds is a solid performance improvement. However, there's a potential silent data loss if
sparsity_listlength doesn't matchself.threshold_trials:Line 140:
zip(self.threshold_trials, sparsity_list)will silently truncate to the shorter sequence. While the assertion at lines 320-322 ensures all modules return consistent lengths, it doesn't verify this matcheslen(self.threshold_trials).🛡️ Proposed fix to add length verification
for sample_stat in per_sample_stats: length = sample_stat["sample_length"] sparsity_list = sample_stat["sparsity"] + if len(sparsity_list) != len(self.threshold_trials): + raise ValueError( + f"Sparsity list length ({len(sparsity_list)}) doesn't match " + f"threshold_trials length ({len(self.threshold_trials)})" + ) for threshold, sparsity in zip(self.threshold_trials, sparsity_list):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/sparsity/attention_sparsity/calibration/calibrator.py` around lines 123 - 149, The loop building all_data_points can silently drop thresholds because zip(self.threshold_trials, sparsity_list) truncates to the shorter sequence; add an explicit length check right after per_sample_stats is obtained (after calling self._extract_calibration_stats) to assert or raise if any sample_stat["sparsity"] length != len(self.threshold_trials), including clear context in the error message (e.g., referencing the sample's "sample_length" or module id) so the mismatch is detected immediately and not silently ignored when populating all_data_points.modelopt/torch/sparsity/attention_sparsity/model_sparsify.py (1)
82-90:⚠️ Potential issue | 🟡 MinorUpdate the calibration example to the new
target_sparse_ratioshape.This snippet still shows a scalar
target_sparse_ratio, but the current calibration config expects a per-phase dict. Copying the example is inconsistent with the new schema.Suggested fix
"*attention*": { "method": "flash_skip_softmax", "backend": "pytorch", "enable": True, "calibration": { # Enables automatic threshold calibration - "target_sparse_ratio": 0.5, + "target_sparse_ratio": {"prefill": 0.5, "decode": 0.5}, "samples": 48, "max_seqlen": 8192, },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/sparsity/attention_sparsity/model_sparsify.py` around lines 82 - 90, The example calibration block for config["sparse_cfg"]["*attention*"]["calibration"] uses a scalar target_sparse_ratio but the schema now requires a per-phase dict; update the example to replace "target_sparse_ratio": 0.5 with a mapping of phases to ratios (e.g. "target_sparse_ratio": {"train": 0.5, "eval": 0.5}) so that the config key "calibration" matches the new per-phase shape expected by the calibration logic.modelopt/torch/sparsity/attention_sparsity/config.py (1)
125-159:⚠️ Potential issue | 🟡 MinorReject empty
thresholdsmaps.
{}currently passes validation even though the field contract says at least one ofprefill/decodemust be present. That turns a typoed config into a silent runtime fallback instead of a validation error.Suggested fix
valid_phases = {"prefill", "decode"} + if not v: + raise ValueError( + "thresholds must include at least one phase: 'prefill' and/or 'decode'" + ) invalid_keys = set(v.keys()) - valid_phases🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/sparsity/attention_sparsity/config.py` around lines 125 - 159, The validate_thresholds field validator allows an empty dict for thresholds, so add a check in validate_thresholds to reject an empty mapping and raise a ValueError when no phase keys are present; specifically, in the validate_thresholds(cls, v) function (the validator for the thresholds field) verify that v is non-empty and contains at least one of the valid phases ('prefill' or 'decode') and raise a clear ValueError if v is {} or otherwise missing both keys before proceeding with the existing per-phase list and length checks.examples/llm_ptq/hf_ptq.py (1)
361-373:⚠️ Potential issue | 🟠 Major
--low_memory_modestill initializes from--qformat, not--recipe.This branch always feeds
QUANT_CFG_CHOICES[args.qformat]intoinit_quantized_weights(). Later, calibration-only mode only callsmtq.calibrate()and never reappliesrecipe.ptq_cfg, so a low-memory recipe run can end up with the wrong quantizer layout and algorithm.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 361 - 373, The code is always seeding quant_cfg from QUANT_CFG_CHOICES[args.qformat], which breaks low-memory runs that should use the recipe PTQ layout; change the initialization so quant_cfg is derived from the recipe (recipe.ptq_cfg) when a recipe is provided (or when --low_memory_mode is active) instead of args.qformat, then apply the same KV adjustments (mtq.utils.update_quant_cfg_with_kv_cache_quant and _set_kv_cache_constant_amax) to that quant_cfg before calling init_quantized_weights; ensure references to QUANT_CFG_CHOICES, args.qformat, args.kv_cache_qformat, _KV_CAST_FORMATS, init_quantized_weights, recipe.ptq_cfg and mtq.calibrate are updated so calibration-only flows use the recipe-derived quant_cfg.
♻️ Duplicate comments (3)
examples/pruning/rank_layer_importance.py (3)
219-221:⚠️ Potential issue | 🟠 MajorMake topology inputs required (or derive them) before layer iteration.
args.num_layersis optional but used inrange(args.num_layers), and the offset/rank mapping assumes uniform stage spans. This can fail when args are omitted and can patch incorrect global layers in uneven PP layouts.Also applies to: 303-307, 334-342, 381-381
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/pruning/rank_layer_importance.py` around lines 219 - 221, The code calls range(args.num_layers) and computes an offset/rank mapping assuming uniform pipeline-stage spans while args.num_layers and stage-size args (args.num_layers_in_first_pipeline_stage, args.num_layers_in_last_pipeline_stage) are optional; make these topology inputs required or derive them before any layer iteration: ensure args.num_layers is populated by either making parser.add_argument(..., required=True) or computing it from the model/config (or by summing stage sizes if provided), and compute/validate args.num_layers_in_first_pipeline_stage and args.num_layers_in_last_pipeline_stage (or derive them from pipeline layout or by dividing layers across stages with explicit checks) before any use in range(args.num_layers) or the offset/rank mapping logic so uneven PP layouts are handled correctly and raise a clear error if topology cannot be inferred.
152-153:⚠️ Potential issue | 🔴 CriticalAvoid retaining full reference activations on GPU for all samples.
Line 153 stores full hidden states for every calibration step in
self.reference_hidden, which can OOM with long sequences and large sample counts. Stream comparisons or offload cached references to CPU.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/pruning/rank_layer_importance.py` around lines 152 - 153, The code currently appends full GPU-resident hidden states (hidden_out) to self.reference_hidden in the calibration path (self.reference_load), which can cause OOM; instead, convert and detach the tensor before storing off-GPU (e.g., hidden_out.detach().cpu() or .to('cpu')) or avoid storing all samples by performing streaming comparisons immediately and only keeping aggregated statistics; update the logic around self.reference_load/self.reference_hidden in the calibration routine (where hidden_out is produced) to either offload tensors to CPU/memory-mapped storage or replace append with an online comparison step so GPU memory usage stays bounded.
44-45:⚠️ Potential issue | 🟠 MajorRemove import-time CUDA allocations.
Line 44 and Line 45 allocate CUDA modules at import time and these variables are unused. This can touch the wrong device before distributed setup.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/pruning/rank_layer_importance.py` around lines 44 - 45, The two module instances kl_loss and mse_loss are being created and moved to CUDA at import time (symbols kl_loss and mse_loss), which touches the wrong device and they are unused; remove these import-time allocations by deleting these lines or by instantiating the losses where they are actually used (e.g., inside the training/evaluation function) and without calling .cuda() at module import; also fix the MSELoss creation to use the correct signature (remove the deprecated/redundant reduce=True and use reduction="mean") when constructing it at usage time.
🟠 Major comments (18)
examples/diffusers/quantization/utils.py-84-86 (1)
84-86:⚠️ Potential issue | 🟠 Major
proj_outshould match anywhere in the name, not just at the start.The current pattern
proj_out.*only matches when the name starts withproj_outbecausere.match()anchors to the string beginning. This is inconsistent with the other terms, which use.*(...).*to match anywhere. Nested module names likeblock.proj_out.weightwon't match. Use the same "anywhere" style as the other terms for consistency.Suggested regex fix
- pattern = re.compile( - r"(proj_out.*|.*(time_text_embed|context_embedder|x_embedder|norm_out|time_guidance_embed|stream_modulation).*)" - ) + pattern = re.compile( + r".*(proj_out|time_text_embed|context_embedder|x_embedder|norm_out|time_guidance_embed|stream_modulation).*" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/diffusers/quantization/utils.py` around lines 84 - 86, The regex assigned to the variable pattern currently uses "proj_out.*", which only matches at the start when used with re.match; change the pattern so "proj_out" is matched anywhere in the name like the other terms (for example by using ".*proj_out.*" or including proj_out inside the inner .*(...).* alternation), updating the compiled pattern expression so nested names like "block.proj_out.weight" are matched by pattern.modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py-59-71 (1)
59-71:⚠️ Potential issue | 🟠 MajorAvoid phase-order dependent threshold reuse.
_update_thresholds()falls back toself.thresholds, so a run that only configures one phase can reuse the last phase’s list for the next phase. After a decode call, the next prefill can silently inherit decode thresholds.Suggested fix
- self.thresholds = self.thresholds_config.get("prefill", [1e-3]) + self.thresholds = list( + self.thresholds_config.get("prefill") + or self.thresholds_config.get("decode") + or [1e-3] + ) @@ def _update_thresholds(self, phase: str): """Update thresholds list based on phase.""" - self.thresholds = self.thresholds_config.get(phase, self.thresholds) + self.thresholds = list( + self.thresholds_config.get(phase) + or self.thresholds_config.get("prefill") + or self.thresholds_config.get("decode") + or [1e-3] + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py` around lines 59 - 71, _update_thresholds currently falls back to the existing self.thresholds which allows thresholds from a previous phase to leak into a new phase; change _update_thresholds to read from self.thresholds_config and fall back to a deterministic default (e.g., self.thresholds_config.get("prefill", [1e-3])) instead of self.thresholds, and ensure you assign a shallow copy to self.thresholds to avoid shared-list mutation; update the method named _update_thresholds (and any callers relying on thresholds) so phase transitions always load explicit config for the requested phase rather than reusing the last-set list.modelopt/torch/sparsity/attention_sparsity/stats_manager.py-69-74 (1)
69-74:⚠️ Potential issue | 🟠 MajorGuard
sparse_blocksaggregation against threshold-count changes.This merge path assumes every forward reports the same list length. That breaks once the same manager sees calibration’s single-threshold stats and static multi-threshold stats: shorter lists leave stale totals, and a longer list will raise on Line 74.
Suggested fix
incoming = stats["sparse_blocks"] - if "sparse_blocks" not in self.aggregated_stats: - self.aggregated_stats["sparse_blocks"] = list(incoming) - else: - for i, val in enumerate(incoming): - self.aggregated_stats["sparse_blocks"][i] += val + aggregated = self.aggregated_stats.get("sparse_blocks") + if aggregated is None: + self.aggregated_stats["sparse_blocks"] = list(incoming) + elif len(aggregated) != len(incoming): + raise ValueError( + f"Inconsistent sparse_blocks length: expected {len(aggregated)}, got {len(incoming)}" + ) + else: + for i, val in enumerate(incoming): + aggregated[i] += val🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/sparsity/attention_sparsity/stats_manager.py` around lines 69 - 74, The aggregation assumes aggregated_stats["sparse_blocks"] and the incoming list have the same length; update the logic in the method that handles incoming = stats["sparse_blocks"] so it first compares lengths and handles mismatches: if aggregated_stats has no "sparse_blocks" key or lengths differ, reinitialize aggregated_stats["sparse_blocks"] to a copy of incoming (or to a zero-filled list matching the longer length) to avoid stale totals or IndexError when enumerating; otherwise, safely sum element-wise by iterating to min(len(incoming), len(existing)) and copy any remaining tail from the longer list into aggregated_stats["sparse_blocks"] so both shorter and longer incoming lists are handled without errors.experimental/conv/test_implicit_gemm.py-25-49 (1)
25-49:⚠️ Potential issue | 🟠 MajorSkip on unsupported SMs, not just when CUDA exists.
The runtime in
experimental/conv/implicit_gemm_cuda.pyenforces SM80+ via_MIN_SM_MAJOR = 8and raisesRuntimeErroron older GPUs. The test skip marker currently only checkstorch.cuda.is_available(), so on SM70 hardware with CUDA installed, tests will error during execution instead of skipping. The Triton FP4 marker has the same problem: it only checks attribute availability, not SM8.9+, despite its own reason string documenting that requirement.Suggested fix
+def _sm_at_least(major: int, minor: int = 0) -> bool: + if not torch.cuda.is_available(): + return False + return torch.cuda.get_device_capability() >= (major, minor) + + -pytestmark = pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available") +pytestmark = pytest.mark.skipif(not _sm_at_least(8, 0), reason="Requires CUDA SM80+") requires_triton_fp4 = pytest.mark.skipif( - not _triton_fp4_available(), + not (_sm_at_least(8, 9) and _triton_fp4_available()), reason="Triton fp4_fake_quant_block not available (requires compute >= 8.9)", )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@experimental/conv/test_implicit_gemm.py` around lines 25 - 49, The skip logic must check GPU SM version, not just CUDA presence: update the module-level pytestmark to skip when CUDA is unavailable OR the current device compute capability is below major 8 (the runtime constant _MIN_SM_MAJOR = 8 enforced by experimental.conv.implicit_gemm_cuda), and update _triton_fp4_available()/requires_triton_fp4 to additionally verify device compute capability >= 8.9 (major >= 8 and minor >= 9) before claiming Triton fp4 support; use torch.cuda.is_available() and torch.cuda.get_device_properties(0).major/minor (with safe guards if no device) to implement these checks and keep the existing import checks (conv3d_implicit_gemm_cuda and triton kernel attribute) intact.experimental/conv/implicit_gemm_cuda.py-217-228 (1)
217-228:⚠️ Potential issue | 🟠 MajorReplace
assertwith proper runtime validation for preconditions.The divisibility assertion at line 224 disappears under
python -O, leaving the function vulnerable to runtime errors. The downstream CUDA kernel computesnum_blocks = N / block_sizewith integer division, which will silently truncate work if N is not divisible by block_size, violating the documented precondition. Additionally, the function lacks validation for device placement, scalarglobal_amax, and positiveblock_size.Replace the assertion with explicit checks for:
block_size > 0xandglobal_amaxare CUDA tensorsxandglobal_amaxare on the same deviceglobal_amaxis a scalar (numel == 1)x_f32.numel() % block_size == 0Suggested fix
cuda_mod = _get_cuda_module() + if block_size <= 0: + raise ValueError(f"block_size must be > 0, got {block_size}") + if not x.is_cuda or not global_amax.is_cuda: + raise ValueError("x and global_amax must be CUDA tensors") + if x.device != global_amax.device: + raise ValueError( + f"x and global_amax must be on the same device, got {x.device} and {global_amax.device}" + ) + if global_amax.numel() != 1: + raise ValueError("global_amax must be a scalar tensor") orig_shape = x.shape orig_dtype = x.dtype x_f32 = x.float().contiguous() amax_f32 = global_amax.float().contiguous() - assert x_f32.numel() % block_size == 0, ( - f"numel ({x_f32.numel()}) must be divisible by block_size ({block_size})" - ) + if x_f32.numel() % block_size != 0: + raise ValueError( + f"numel ({x_f32.numel()}) must be divisible by block_size ({block_size})" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@experimental/conv/implicit_gemm_cuda.py` around lines 217 - 228, Replace the fragile assert with explicit runtime validation before calling cuda_mod.fp4_fake_quant_cuda: check that block_size is an int > 0 and raise ValueError if not; ensure x.is_cuda and global_amax.is_cuda and raise TypeError/ValueError if either is not CUDA; ensure x.device == global_amax.device and raise ValueError if they differ; ensure global_amax.numel() == 1 and raise ValueError if not; compute N = x.float().contiguous().numel() (use x_f32.numel()) and verify N % block_size == 0, raising ValueError with a clear message if it fails; keep the rest (x_f32, amax_f32, cuda_mod.fp4_fake_quant_cuda) unchanged.examples/llm_sparsity/attention_sparsity/hf_sa.py-150-155 (1)
150-155:⚠️ Potential issue | 🟠 MajorMove inputs to the model's entry device, not the default GPU.
With
device_map="auto", the model may place different shards on different devices (including CPU for offloading). Hardcodinginputs.cuda()at Line 118 assumes the first shard (input embedding) is on the default GPU, which may not hold. This causes errors or unnecessary device transfers for offloaded or multi-GPU models.Instead, move inputs to the model's entry device:
dev = model.get_input_embeddings().weight.device inputs = {k: v.to(dev) for k, v in inputs.items()}This ensures inputs match wherever Accelerate's auto-dispatch places the first model layer.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_sparsity/attention_sparsity/hf_sa.py` around lines 150 - 155, The code currently assumes inputs are on the default GPU by calling inputs.cuda(); instead, after loading the model with AutoModelForCausalLM.from_pretrained(..., device_map="auto"), determine the model's entry device via model.get_input_embeddings().weight.device and move the inputs there (e.g., set inputs = {k: v.to(dev) for k,v in inputs.items()}) so input tensors land on the same device/shard as the model's input embedding and avoid incorrect hard-coded GPU transfers.tests/gpu/torch/quantization/test_sequential_calibrate.py-325-331 (1)
325-331:⚠️ Potential issue | 🟠 MajorThis assertion doesn't prove run-mode replay reached layer 1.
actual = model.layers[0](x)only shows thatweight_doubling_calib()changed layer 0. The test still passes if the next calibration step never replays those updated activations into layer 1. Record the inputs observed bymodel.layers[1]during its calibration step and assert those againstactivations_before_weight_update * 2.0.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/quantization/test_sequential_calibrate.py` around lines 325 - 331, The test currently only checks that weight_doubling_calib updated model.layers[0] by running actual = model.layers[0](x) but doesn't verify that the replay used those updated outputs as inputs to model.layers[1]; modify the test to capture the inputs observed by model.layers[1] during its calibration step (e.g., attach a short forward hook or wrapper on model.layers[1] to record the incoming tensor when the calibration for layer 1 runs) and then assert that the recorded inputs equal activations_before_weight_update * 2.0, keeping the existing check that layer 0's weights were updated (model.layers[0](x)) and using the recorded_inputs variable in the new assertion.modelopt/torch/_deploy/utils/torch_onnx.py-595-624 (1)
595-624:⚠️ Potential issue | 🟠 MajorCreate
model_metadatafrom the final ONNX graph.
model_metadatais built a few lines above this block, beforequantize_weights(), FP16 conversion, cast retargeting, andremove_redundant_casts(). That leavesonnx_node_namesout of sync with the model bytes returned from this function. Please movecreate_model_metadata()below the last graph mutation.📦 Suggested move
- model_metadata = create_model_metadata( - tree_spec_input, tree_spec_output, input_none_names, onnx_opt_graph, model - ) - onnx_opt_graph = quantize_weights(model, onnx_opt_graph) @@ # TODO: Remove manual ir_version change once ORT supports ir_version 11 # Must be set after all gs.export_onnx() calls as graphsurgeon resets ir_version onnx_opt_graph.ir_version = 10 + + model_metadata = create_model_metadata( + tree_spec_input, tree_spec_output, input_none_names, onnx_opt_graph, model + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/_deploy/utils/torch_onnx.py` around lines 595 - 624, The model_metadata is being created before the final ONNX graph mutations, so onnx_node_names become out-of-sync with the returned model bytes; move the create_model_metadata() call (and any usage that builds model_metadata / onnx_node_names) to after all graph transformations such as quantize_weights(), qdq_to_dq(), convert_float_to_float16()/convert_to_f16(), change_casts_to_fp16(), remove_redundant_casts(), replace_zero_scale_with_smallest_nonzero(), and the final ir_version assignment on onnx_opt_graph so model_metadata is generated from the final onnx_opt_graph.modelopt/onnx/utils.py-1374-1388 (1)
1374-1388:⚠️ Potential issue | 🟠 MajorHandle graph-input → graph-output casts before bypassing them.
If
input_tensoris a graph input,get_producer_nodes()is empty and this branch never reconnectsoutput_tensor. The subsequent node removal leaves a disconnected graph output; keep the cast in that case or replace it with anIdentity.examples/llm_ptq/example_utils.py-617-623 (1)
617-623:⚠️ Potential issue | 🟠 MajorReuse the caller’s
device_mapand model kwargs in the pack-quant path.This branch hardcodes
device_map="auto"and rebuilds kwargs, sodevice="cpu",use_seq_device_map,max_memory, andattn_implementationare silently ignored for pack-quantized checkpoints.Suggested fix
with patch_compressed_linear_loading(): model = AutoModelForCausalLM.from_pretrained( ckpt_path, - device_map="auto", - trust_remote_code=trust_remote_code, - torch_dtype="auto", + device_map=device_map, + **model_kwargs, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 617 - 623, The pack-quant branch hardcodes device_map="auto" and rebuilds kwargs, causing caller-supplied device_map and model kwargs (like device, use_seq_device_map, max_memory, attn_implementation) to be ignored; update the block that calls AutoModelForCausalLM.from_pretrained inside patch_compressed_linear_loading() to merge and reuse the original caller kwargs (pass through the incoming device_map and any model kwargs) instead of overwriting them with device_map="auto" or rebuilding defaults, ensuring ckpt_path, trust_remote_code and other provided kwargs are forwarded to AutoModelForCausalLM.from_pretrained so pack-quant loading respects the caller’s settings.modelopt/onnx/export/fp8_exporter.py-125-136 (1)
125-136:⚠️ Potential issue | 🟠 MajorDerive the synthesized FP8 zero-point shape from the scale tensor.
The hardcoded
[1]zero-point shape violates ONNXQuantizeLinearspecification. Per ONNX spec,zero_pointmust exactly match the shape ofy_scale: for per-tensor quantization both must be scalars, and for per-axis quantization both must be identical 1-D tensors. This hardcoded approach only works for scalar or length-1 scales and will produce invalid ONNX graphs for per-axis/per-channel FP8 quantization (e.g., when scale is a 1-D tensor of length N for N channels). Inspectnode.inputs[1](the scale tensor) and create the zero-point with matching shape.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/export/fp8_exporter.py` around lines 125 - 136, The zero-point creation for FP8 uses a hardcoded 1-element shape which breaks per-axis quantization; instead read the scale tensor at node.inputs[1] (the y_scale for QuantizeLinear), copy its shape/dims into zp_tensor.dims so the zero_point matches per-tensor vs per-axis shape, and create raw_data that encodes the appropriate number of FP8 zeros for that shape/type. Update the block that constructs zp_tensor/LazyValues/zero_point (referencing zp_tensor, LazyValues, zero_point, and node.inputs[1]) to derive dims from the scale tensor and ensure the raw_data byte length and element encoding match FLOAT8E4M3FN and the element count implied by those dims.modelopt/torch/quantization/model_calib.py-1882-1883 (1)
1882-1883:⚠️ Potential issue | 🟠 MajorGuard CUDA cache clearing for CPU-only execution paths.
Line 1883 calls
torch.cuda.empty_cache()unconditionally. This breaks CPU-only runs and tests. Other calls in this file are guarded withif torch.cuda.is_available():(lines 408, 646, 682, 687); apply the same pattern here.Proposed fix
del layer_inputs - torch.cuda.empty_cache() + if torch.cuda.is_available(): + torch.cuda.empty_cache()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1882 - 1883, The unconditional call to torch.cuda.empty_cache() after deleting layer_inputs can fail in CPU-only environments; update the block in model_calib.py (near where layer_inputs is deleted) to call torch.cuda.empty_cache() only if torch.cuda.is_available(), mirroring the existing guards used elsewhere in this file (see analogous checks around lines with torch.cuda.is_available()); ensure you wrap only the torch.cuda.empty_cache() call (leave del layer_inputs unchanged) so CPU-only tests no longer hit CUDA APIs.modelopt/torch/quantization/config.py-1049-1056 (1)
1049-1056:⚠️ Potential issue | 🟠 Major
use_constant_amaxdoes not suppress calibration yet.This flag only changes
_get_amax().need_calibration()still treats these quantizers as ordinary static quantizers, so_castKV modes still trigger a calibration loop in plain PTQ flows (for example,int8_wo + fp8_cast), which contradicts the documented “skip calibration” behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/config.py` around lines 1049 - 1056, The use_constant_amax flag only alters _get_amax() but does not prevent calibration; update the need_calibration() logic to return False when self.use_constant_amax is True for KV/cast modes so PTQ flows (e.g., int8_wo + fp8_cast) skip calibration as documented: modify need_calibration() to check the use_constant_amax attribute and the quantizer's _cast/kv mode (or equivalent mode flag) and short-circuit calibration for those instances, ensuring existing non-KV quantizers keep their original behavior.modelopt/torch/quantization/config.py-1000-1008 (1)
1000-1008:⚠️ Potential issue | 🟠 Major
rotate={"enable": True}is still rejected.The new doc says this field accepts a
RotateConfig-shaped dict, butvalidate_config()still recursively rejects any nested dict equal to{"enable": True}before Pydantic can coerce it. Sorotate=Trueworks, while the minimal equivalent dict form currently raises.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/config.py` around lines 1000 - 1008, validate_config() is currently rejecting nested dicts like {"enable": True} for the rotate field before Pydantic coercion; update validate_config (the function named validate_config) to permit Mapping/dict values when the annotated/expected type for that field is RotateConfig (or a Union that includes RotateConfig) so Pydantic can coerce the dict into a RotateConfig instance instead of treating it as an invalid nested config. Concretely, change the early-rejection logic to detect the field's expected_type (for the rotate symbol or any field whose expected_type is RotateConfig/BaseModel or a Union containing it) and skip the recursive "reject nested dict" check for Mapping values, deferring validation/coercion to Pydantic; keep existing rejection behavior for plain dicts when the expected_type is not a model-like type.examples/llm_ptq/hf_ptq.py-923-957 (1)
923-957:⚠️ Potential issue | 🟠 Major
--recipeskips the KV-cache merge.
update_quant_cfg_with_kv_cache_quant()only runs in the non-recipe branch, so recipe-driven runs never add the*[kv]_bmm_quantizerentries. The later_set_kv_cache_constant_amax()call only tweaks existing KV settings, so--recipe ... --kv_cache_qformat fp8_castsilently produces no KV-cache quantization unless the recipe already hardcodes it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 923 - 957, When a recipe is supplied (load_recipe -> ModelOptPTQRecipe -> quant_cfg = recipe.ptq_cfg) the code currently skips applying KV-cache quant updates, so --recipe ... --kv_cache_qformat ... does not add the kv_bmm_quantizer entries; to fix, after setting quant_cfg from the recipe compute enable_quant_kv_cache the same way as the non-recipe branch (enable_quant_kv_cache = args.kv_cache_qformat != "none"), print the Enable/Disable message, and if enable_quant_kv_cache call mtq.update_quant_cfg_with_kv_cache_quant(...) using getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"] to merge KV settings into the recipe-derived quant_cfg (same update used in the non-recipe branch) so subsequent _set_kv_cache_constant_amax() can modify those entries.modelopt/torch/quantization/utils/activation_collector.py-259-269 (1)
259-269:⚠️ Potential issue | 🟠 MajorRollback the previous layer’s replay state when
forward_loop()fails.
_set_layer_states()moves layeri-1fromcollected_inputsintocached_inputsand clears the original list beforeforward_loop()runs. Ifforward_loop()then raises, theexceptblock only resets the current layer, so retrying the same layer can no longer rebuildprev.cached_inputsonce that queue was consumed.🔁 Suggested rollback shape
info = layer._seq_calib + prev_info = self._decoder_layers[layer_idx - 1]._seq_calib if layer_idx > 0 else None + prev_state = None + if prev_info is not None: + prev_state = ( + prev_info.mode, + deque(prev_info.cached_inputs), + list(prev_info.collected_inputs), + ) try: forward_loop(self.model) except Exception: # Reset the current layer so subsequent calls don't see stale state. info.mode = "original" info.collected_inputs = [] + if prev_info is not None and prev_state is not None: + prev_info.mode, prev_info.cached_inputs, prev_info.collected_inputs = prev_state raiseAlso applies to: 314-320
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/utils/activation_collector.py` around lines 259 - 269, The code moves the previous layer's collected_inputs into prev.cached_inputs and clears collected_inputs before forward_loop() but doesn't restore them if forward_loop() fails; update _set_layer_states()/the surrounding try/except so that on any exception you rollback the previous layer's state by restoring prev.mode, resetting prev.cached_inputs back to empty (or None) and repopulating prev.collected_inputs from the saved list (or restoring the original list) so retries can rebuild caches; apply the same rollback logic for the other occurrence that manipulates self._decoder_layers[..]._seq_calib (the block around the second occurrence noted at 314-320) and ensure you reference prev.collected_inputs, prev.cached_inputs and prev.mode when implementing the rollback.modelopt/torch/quantization/plugins/huggingface.py-1024-1055 (1)
1024-1055:⚠️ Potential issue | 🟠 Major
unpack_weight()leaves module half-unpacked for non-int32weight_packedtensors.The early return (line 1030) bypasses the entire cleanup logic when
weight_packedis a non-int32 tensor (e.g., vision modules in BF16). Althoughforward()correctly treats such tensors as the effective weight,unpack_weight()must still registerself.weight, delete compression attributes, and transition toFROZENstatus to complete the unpacking operation.Materialize non-int32 tensors as the decompressed weight and continue through cleanup, rather than returning early:
Suggested fix
if self.quantization_status == QuantizationStatus.COMPRESSED: compressed_data, quant_args = self._build_compressed_data() # Skip non-pack-quantized weights (e.g., vision modules stored as BF16) - if isinstance(compressed_data["weight_packed"], torch.Tensor): - if compressed_data["weight_packed"].dtype != torch.int32: - return - - decompressed = self.compressor.decompress_weight( - compressed_data=compressed_data, - quantization_args=quant_args, - ) + packed = compressed_data["weight_packed"] + if isinstance(packed, torch.Tensor) and packed.dtype != torch.int32: + decompressed = packed + else: + decompressed = self.compressor.decompress_weight( + compressed_data=compressed_data, + quantization_args=quant_args, + ) # Clear any placeholder before registering the real parameter self._parameters.pop("weight", None) self._buffers.pop("weight", None) if "weight" in self.__dict__: del self.__dict__["weight"] @@ if hasattr(self, "weight_scale"): del self.weight_scale + if hasattr(self, "weight_zero_point"): + del self.weight_zero_point if hasattr(self, "weight_shape"):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 1024 - 1055, The unpack_weight() path currently returns early when compressed_data["weight_packed"] exists but is not torch.int32, skipping cleanup and registration; instead, replace the early return with logic that materializes the non-int32 packed tensor as the decompressed weight (e.g., assign decompressed = compressed_data["weight_packed"] and cast to a floating dtype if needed or clone/detach it), then proceed to clear placeholders (_parameters/_buffers and self.__dict__ entries), register the new nn.Parameter(decompressed, requires_grad=False) as self._parameters["weight"]/self.__dict__["weight"], delete compression attributes (weight_packed, weight_scale, weight_shape) and finally set self.quantization_status = QuantizationStatus.FROZEN so the module is fully unpacked.modelopt/torch/quantization/plugins/huggingface.py-914-953 (1)
914-953:⚠️ Potential issue | 🟠 MajorMake the
CompressedLinearmonkey-patch re-entrant.The boolean flag
_modelopt_init_patchedis insufficient for tracking nested contexts. When two callers overlap, the first to exit restoresCompressedLinear.__getattr__globally, leaving the second caller's context with a broken patch mid-execution. Replace the boolean with a depth counter protected by a lock, and only restore the original method when the last active context exits.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 914 - 953, The patch must become re-entrant: replace the boolean CompressedLinear._modelopt_init_patched with an integer counter (e.g. _modelopt_patch_count) and a threading.Lock (e.g. _modelopt_patch_lock), acquire the lock at context entry to increment the counter and store original_getattr only when counter goes from 0→1, and on context exit acquire the lock to decrement the counter and restore/delete CompressedLinear.__getattr__ only when the counter reaches 0; keep the same patched_getattr behavior and ensure all accesses/updates to _modelopt_patch_count and original_getattr are protected by the lock.
🟡 Minor comments (9)
modelopt/torch/export/unified_export_hf.py-785-788 (1)
785-788:⚠️ Potential issue | 🟡 MinorWarning text is misleading about
weight_scale_2.At Line 785, this branch is checking gate/up quantizer
amaxmismatches;weight_scale_2is derived later during quantized weight export. The wording can confuse debugging.✏️ Suggested wording fix
- f"Found {synced} MoE expert gate/up projection pair(s) with mismatched " - f"weight_scale_2 after requantize_resmooth_fused_llm_layers. " + f"Found {synced} MoE expert gate/up projection pair(s) with mismatched " + f"weight quantizer amax values after requantize_resmooth_fused_llm_layers. " f"This typically means the dummy forward did not activate these experts. " f"Taking element-wise max of amaxes for serving-engine fusion."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_hf.py` around lines 785 - 788, The warning message incorrectly mentions weight_scale_2 but this branch is detecting mismatched gate/up quantizer amax values after calling requantize_resmooth_fused_llm_layers; update the f-string that currently references "weight_scale_2" to clearly state "gate/up quantizer amax" (or similar wording) and explain that this typically means the dummy forward did not activate these experts and that element-wise max of amaxes will be taken for serving-engine fusion; modify the string where variable synced is used so the message accurately reflects the checked quantity and avoids referencing weight_scale_2 which is computed later.modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py-259-262 (1)
259-262:⚠️ Potential issue | 🟡 MinorDon’t floor sparse-block totals before summarizing.
batch_size * num_headsmakess * total_blocksfractional in the common case. Truncating it here biasesaverage_sparsitydownward inSparseAttentionStatsManager.get_summary().Suggested fix
- sparse_blocks_out = [int(s * total_blocks) for s in sparsity_list] + sparse_blocks_out = [s * total_blocks for s in sparsity_list]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.py` around lines 259 - 262, The code truncates per-example sparse block counts by doing sparse_blocks_out = [int(s * total_blocks) for s in sparsity_list], which biases summaries like SparseAttentionStatsManager.get_summary() and average_sparsity; change sparse_blocks_out to preserve fractional counts (e.g., sparse_blocks_out = [s * total_blocks for s in sparsity_list]) and defer any integer rounding to the final reporting/aggregation logic (or explicitly use proper rounding there) so summaries compute correct average sparsity from the true fractional block counts; update any consumers that expect ints accordingly.tests/_test_utils/torch/diffusers_models.py-30-33 (1)
30-33:⚠️ Potential issue | 🟡 MinorUse narrower exception handling for optional model imports.
except Exceptioncan mask unexpected runtime errors in the diffusers import path. Restrict this to the missing-symbol cases you actually want to tolerate, consistent with the pattern established elsewhere in the codebase (e.g.,unified_export_hf.py,diffusers_utils.py).Suggested narrowing
try: from diffusers.models.transformers import Flux2Transformer2DModel -except Exception: # pragma: no cover - optional diffusers models +except (ImportError, AttributeError): # pragma: no cover - optional diffusers models Flux2Transformer2DModel = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/_test_utils/torch/diffusers_models.py` around lines 30 - 33, The broad except Exception around the optional import of Flux2Transformer2DModel should be narrowed to only the expected "missing symbol" errors; change the except clause that currently swallows all exceptions for the statement "from diffusers.models.transformers import Flux2Transformer2DModel" to catch only ImportError (and optionally AttributeError) so genuine runtime errors aren't masked, while preserving the fallback assignment Flux2Transformer2DModel = None and the pragma comment.modelopt_recipes/general/ptq/nvfp4_omlp_only-fp8_kv.yml-18-18 (1)
18-18:⚠️ Potential issue | 🟡 MinorFix the description to match the actual quantizer scope.
Line 18 says this recipe covers “all linear layers,” but the config only re-enables
*mlp*,*block_sparse_moe*, and*o_proj*quantizers afterdefault.enable: false. That will mislead users choosing between the default, MLP-only, and OMLP-only presets.Suggested wording
- description: NVFP4 static weight and dynamic activation for all linear layers including output projections, FP8 KV cache, max calibration. + description: NVFP4 static weight and dynamic activation for MLP, MoE, and output projection linear layers only, FP8 KV cache, max calibration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/general/ptq/nvfp4_omlp_only-fp8_kv.yml` at line 18, Update the description string to accurately reflect the quantizer scope: instead of saying “all linear layers,” state that this recipe enables static weight and dynamic activation quantization only for MLP layers, block-sparse MoE layers, and output projections (re-enabled via default.enable: false then re-enabling "*mlp*", "*block_sparse_moe*", and "*o_proj*"), and mention FP8 KV cache and max calibration as before so the description matches the actual config.modelopt_recipes/general/ptq/nvfp4_default-fp8_kv.yml-18-35 (1)
18-35:⚠️ Potential issue | 🟡 MinorDescription doesn't match the quantizers this recipe enables.
*input_quantizeris enabled globally at Lines 29-35, so this is not a weight-only W4A16 recipe for matching layers. Either update the description or disable activation quantization here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/general/ptq/nvfp4_default-fp8_kv.yml` around lines 18 - 35, The recipe's description claims "weight only (W4A16)" but the ptq_cfg enables '*input_quantizer' (activation quantization) alongside '*weight_quantizer', so update the recipe to match intent: either remove or set enable: false for '*input_quantizer' in ptq_cfg to keep it weight-only, or change the top-level description to reflect that both weight and activation quantizers are enabled; locate the entries named '*input_quantizer' and '*weight_quantizer' in the ptq_cfg and make the corresponding enable flag or description change accordingly.modelopt_recipes/general/ptq/nvfp4_mlp_only-fp8_kv.yml-18-49 (1)
18-49:⚠️ Potential issue | 🟡 MinorDescription says "all linear layers", but the matchers are MLP/MoE-only.
The only enabled quantizers here are
*mlp*...and*block_sparse_moe*...at Lines 22-49. Please align the description with the actual scope, or widen the matcher list.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/general/ptq/nvfp4_mlp_only-fp8_kv.yml` around lines 18 - 49, The description claims "all linear layers" but the config only enables quantizers for the '*mlp*weight_quantizer', '*mlp*input_quantizer', '*block_sparse_moe*weight_quantizer', and '*block_sparse_moe*input_quantizer'; update either the description to reflect MLP/MoE-only scope or expand the matcher list to include other linear layer patterns (e.g., add entries like '*linear*weight_quantizer' and '*linear*input_quantizer' or '*dense*...' as appropriate) so the stated scope matches the actual enabled quantizers.tests/gpu/torch/quantization/test_sequential_calibrate.py-25-331 (1)
25-331:⚠️ Potential issue | 🟡 MinorMove these CPU-only checks into
tests/unit.Nothing in this module allocates CUDA tensors or touches GPU-only kernels, so keeping it under
tests/gpupushes the new coverage into the wrong CI lane. As per coding guidelines, "tests/**/*.py: Write tests using pytest for all new features and examples; organize tests intotests/unit(fast CPU-based),tests/gpu(fast GPU-based),tests/gpu_megatron(Megatron-Core),tests/gpu_trtllm(TensorRT-LLM), andtests/examples(integration tests)".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu/torch/quantization/test_sequential_calibrate.py` around lines 25 - 331, This test module contains only CPU-based tests (exercises sequential_calibrate, LayerActivationCollector and functions like test_seq_calib_func_called_per_layer, test_mode_transitions_across_calibration_steps, etc.) but is placed under the GPU-only test suite; move the test module out of the GPU CI lane into the CPU/unit test suite so it runs as a fast CPU test. Concretely: relocate the file to the project’s unit-tests area (so it is discovered under unit tests rather than GPU), ensure no CUDA-specific fixtures or markers remain, and update any CI/test-suite configuration that referenced this file so the tests exercising sequential_calibrate and LayerActivationCollector run in the CPU/unit pipeline. Ensure imports and monkeypatch usage remain unchanged after the move.modelopt/recipe/config.py-36-39 (1)
36-39:⚠️ Potential issue | 🟡 MinorFix the
ModelOptRecipeBasedocstring content.Line 38-Line 39 describe output-layer replacement behavior that is unrelated to recipe configuration and is misleading.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/recipe/config.py` around lines 36 - 39, The docstring for ModelOptRecipeBase incorrectly describes output-layer replacement behavior; update the ModelOptRecipeBase class docstring to accurately summarize purpose and usage of the recipe base (e.g., what configuration fields it provides, intended subclassing/validation behavior) and remove or relocate the misleading sentence about matching "*output_layer*" and replacing attributes with {"enable": False}; ensure the docstring references ModelOptRecipeBase and describes its role in model optimization recipes rather than unrelated layer-replacement behavior.modelopt/recipe/loader.py-30-49 (1)
30-49:⚠️ Potential issue | 🟡 MinorAbsolute
Pathinputs skip suffix probing.
load_recipe(Path("/tmp/foo"))returns the path unchanged, so/tmp/foo.ymland/tmp/foo.yamlare never discovered. The docstring says omitted suffixes are probed for absolute and relative filesystem paths, sostrandPathinputs currently behave differently.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/recipe/loader.py` around lines 30 - 49, The function _resolve_recipe_path treats absolute Path inputs as already-resolved and skips suffix probing, so Path("/tmp/foo") won't find /tmp/foo.yml; change the early conditional to treat both str and Path uniformly (convert recipe_path to rp_str for both absolute and relative Paths) and run the existing BUILTIN_RECIPES_LIB and filesystem suffix probing logic (using candidate and fs_candidate) for those types, while preserving the Traversable passthrough; if no probe matches, return Path(rp_str) as before.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c329587c-7f30-4de1-8dce-1b051d78451d
📥 Commits
Reviewing files that changed from the base of the PR and between c92db19 and 5ee6bfb5453b5a9b2dd7c8b235b899661b3841f4.
📒 Files selected for processing (80)
.github/workflows/unit_tests.ymlCHANGELOG.rstdocs/source/getting_started/_installation_for_Linux.rstexamples/diffusers/quantization/models_utils.pyexamples/diffusers/quantization/utils.pyexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/multinode_ptq.pyexamples/llm_sparsity/attention_sparsity/hf_sa.pyexamples/pruning/rank_layer_importance.pyexperimental/conv/README.mdexperimental/conv/bench_implicit_gemm.pyexperimental/conv/implicit_gemm_binding.cppexperimental/conv/implicit_gemm_cuda.pyexperimental/conv/implicit_gemm_kernel.cuexperimental/conv/test_implicit_gemm.pymodelopt/onnx/autocast/graphsanitizer.pymodelopt/onnx/autocast/precisionconverter.pymodelopt/onnx/autocast/utils.pymodelopt/onnx/export/fp8_exporter.pymodelopt/onnx/export/nvfp4_exporter.pymodelopt/onnx/utils.pymodelopt/recipe/__init__.pymodelopt/recipe/_config_loader.pymodelopt/recipe/config.pymodelopt/recipe/loader.pymodelopt/torch/_deploy/utils/torch_onnx.pymodelopt/torch/export/diffusers_utils.pymodelopt/torch/export/layer_utils.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/opt/config.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/nn/functional.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/plugins/diffusion/diffusers.pymodelopt/torch/quantization/plugins/huggingface.pymodelopt/torch/quantization/utils/__init__.pymodelopt/torch/quantization/utils/activation_collector.pymodelopt/torch/quantization/utils/core_utils.pymodelopt/torch/sparsity/attention_sparsity/calibration/calibrator.pymodelopt/torch/sparsity/attention_sparsity/config.pymodelopt/torch/sparsity/attention_sparsity/methods/flash_skip_softmax.pymodelopt/torch/sparsity/attention_sparsity/model_sparsify.pymodelopt/torch/sparsity/attention_sparsity/stats_manager.pymodelopt/torch/utils/network.pymodelopt_recipes/__init__.pymodelopt_recipes/configs/READMEmodelopt_recipes/general/READMEmodelopt_recipes/general/ptq/fp8_default-fp8_kv.ymlmodelopt_recipes/general/ptq/nvfp4_default-fp8_kv.ymlmodelopt_recipes/general/ptq/nvfp4_mlp_only-fp8_kv.ymlmodelopt_recipes/general/ptq/nvfp4_omlp_only-fp8_kv.ymlpyproject.tomltests/_test_utils/torch/diffusers_models.pytests/_test_utils/torch/quantization/quantize_common.pytests/_test_utils/torch/quantization/tensor_quantizer_common.pytests/_test_utils/torch/sparsity/sparse_attention_common.pytests/examples/llm_ptq/test_llm_ptq.pytests/gpu/torch/export/test_export.pytests/gpu/torch/quantization/test_hadamard.pytests/gpu/torch/quantization/test_sequential_calibrate.pytests/gpu/torch/sparsity/attention_sparsity/test_calibration_gpu.pytests/gpu/torch/sparsity/attention_sparsity/test_integration_gpu.pytests/unit/onnx/autocast/test_precisionconverter.pytests/unit/recipe/__init__.pytests/unit/recipe/test_loader.pytests/unit/torch/export/test_export_diffusers.pytests/unit/torch/quantization/plugins/test_huggingface.pytests/unit/torch/quantization/test_calib.pytests/unit/torch/quantization/test_sequential_calibrate.pytests/unit/torch/quantization/test_utils.pytests/unit/torch/sparsity/attention_sparsity/test_flash_skip_softmax.pytests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_calibration.pytests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_config.pytests/unit/torch/sparsity/attention_sparsity/test_sparse_attention_conversion.pytests/unit/torch/sparsity/attention_sparsity/test_stats_manager.pytests/unit/torch/sparsity/attention_sparsity/test_threshold_info.pytox.ini
💤 Files with no reviewable changes (1)
- modelopt/torch/utils/network.py
✅ Files skipped from review due to trivial changes (4)
- modelopt_recipes/general/README
- experimental/conv/README.md
- tests/unit/recipe/init.py
- modelopt_recipes/configs/README
| # CASE B: Expert (stay compressed, fix metadata) | ||
| elif f"{name}.weight_shape" in checkpoint_weights: | ||
| ws = checkpoint_weights[f"{name}.weight_shape"] | ||
| if f"{name}.weight_packed" in checkpoint_weights: | ||
| module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to( | ||
| torch.int32 | ||
| ) | ||
| module._parameters.pop("weight", None) | ||
| module._buffers.pop("weight", None) | ||
| module.__dict__.pop("weight", None) | ||
| shape_param = torch.nn.Parameter(ws.to(torch.int32), requires_grad=False) | ||
| module._parameters.pop("weight_shape", None) | ||
| module.__dict__.pop("weight_shape", None) | ||
| module._parameters["weight_shape"] = shape_param | ||
| module.__dict__["weight_shape"] = shape_param |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "example_utils.py" -path "*/llm_ptq/*" | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 101
🏁 Script executed:
cat -n examples/llm_ptq/example_utils.py | sed -n '490,530p'Repository: NVIDIA/Model-Optimizer
Length of output: 2305
🏁 Script executed:
cat -n examples/llm_ptq/example_utils.py | sed -n '480,550p'Repository: NVIDIA/Model-Optimizer
Length of output: 3534
Move restored expert tensors onto target_device.
The BF16 branch at line 490 uses .to(target_device), but the expert branch at lines 504–506 and 510 only applies dtype conversion without specifying the device. This causes weight_packed and weight_shape to remain on CPU, creating device mismatches when expert modules execute on GPU.
Suggested fix
elif f"{name}.weight_shape" in checkpoint_weights:
ws = checkpoint_weights[f"{name}.weight_shape"]
if f"{name}.weight_packed" in checkpoint_weights:
module.weight_packed = checkpoint_weights[f"{name}.weight_packed"].to(
- torch.int32
+ device=target_device, dtype=torch.int32
)
module._parameters.pop("weight", None)
module._buffers.pop("weight", None)
module.__dict__.pop("weight", None)
- shape_param = torch.nn.Parameter(ws.to(torch.int32), requires_grad=False)
+ shape_param = torch.nn.Parameter(
+ ws.to(device=target_device, dtype=torch.int32), requires_grad=False
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/example_utils.py` around lines 500 - 514, The expert branch
fails to move restored tensors to the target device: when setting
module.weight_packed and weight_shape you currently only convert dtype
(torch.int32) but not device, causing CPU/GPU mismatches; update the assignments
that create module.weight_packed and shape_param so they are sent to
target_device as well (e.g., use .to(device=target_device, dtype=torch.int32) or
.to(target_device) after dtype conversion) and then store those device-aware
tensors into module._parameters and module.__dict__ (affecting the symbols
module.weight_packed and weight_shape).
| cuda_mod = _get_cuda_module() | ||
|
|
||
| if x.ndim != 5 or w.ndim != 5: | ||
| raise ValueError(f"Expected 5D tensors, got x.ndim={x.ndim}, w.ndim={w.ndim}") | ||
| n_batch, cin, d, h, w_in = x.shape | ||
| cout, cin_w, kd, kh, kw = w.shape | ||
| if cin_w != cin: | ||
| raise ValueError( | ||
| f"Grouped convolution is not supported (x has {cin} input channels, " | ||
| f"w has {cin_w}). This kernel requires groups=1." | ||
| ) | ||
|
|
||
| sd, sh, sw = _triple(stride) | ||
| dd, dh, dw = _triple(dilation) | ||
| pad_wl, pad_wr, pad_hl, pad_hr, pad_dl, pad_dr = _pad6(padding) | ||
|
|
||
| x_pad = F.pad(x, (pad_wl, pad_wr, pad_hl, pad_hr, pad_dl, pad_dr)) | ||
| dp = d + pad_dl + pad_dr | ||
| hp = h + pad_hl + pad_hr | ||
| wp = w_in + pad_wl + pad_wr | ||
|
|
||
| od = (dp - (dd * (kd - 1) + 1)) // sd + 1 | ||
| oh = (hp - (dh * (kh - 1) + 1)) // sh + 1 | ||
| ow = (wp - (dw * (kw - 1) + 1)) // sw + 1 | ||
|
|
||
| m = n_batch * od * oh * ow | ||
| k = cin * kd * kh * kw | ||
|
|
||
| w_flat = w.reshape(cout, k).transpose(0, 1).contiguous() | ||
|
|
||
| x_pad = x_pad.float().contiguous() | ||
| w_flat = w_flat.float().contiguous() | ||
|
|
||
| has_bias = bias is not None | ||
| bias_t = bias.float().contiguous() if has_bias else torch.empty(0, device=x.device) # type: ignore[union-attr] | ||
|
|
||
| if quant_act and act_amax is None: | ||
| raise ValueError("act_amax is required when quant_act=True") | ||
|
|
||
| do_quant = quant_act | ||
| amax_t = act_amax.float().contiguous() if do_quant else torch.empty(0, device=x.device) # type: ignore[union-attr] | ||
|
|
||
| y_flat = cuda_mod.conv3d_implicit_gemm_cuda( |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Current conv3d wrapper validation:"
sed -n '123,165p' experimental/conv/implicit_gemm_cuda.py
echo
echo "Existing device/scalar guards in this file:"
rg -n '\.is_cuda|device\s*!=|numel\(\)\s*!=\s*1' experimental/conv/implicit_gemm_cuda.py || trueRepository: NVIDIA/Model-Optimizer
Length of output: 1674
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Get the full function definition including its signature
echo "=== Full conv3d_implicit_gemm_cuda function ==="
sed -n '100,200p' experimental/conv/implicit_gemm_cuda.py
echo
echo "=== Check for any device/is_cuda validation elsewhere in the file ==="
rg -n 'is_cuda|\.device|assert' experimental/conv/implicit_gemm_cuda.py | head -30
echo
echo "=== Check fp4_fake_quant function mentioned in scratchpad ==="
rg -n -A 20 'def fp4_fake_quant' experimental/conv/implicit_gemm_cuda.pyRepository: NVIDIA/Model-Optimizer
Length of output: 4311
Add device and scalar validation before CUDA kernel dispatch.
The function casts x, w, bias, and act_amax to float with .float().contiguous(), but this does not move CPU tensors to CUDA. When these potentially mixed-device or CPU tensors are passed directly to cuda_mod.conv3d_implicit_gemm_cuda(...), the raw CUDA kernel receives invalid pointers, causing undefined behavior or crashes. Additionally, act_amax is never validated to be a scalar (single-element tensor).
Add explicit checks:
xandwmust be CUDA tensors- All of
x,w,bias(if provided), andact_amax(if provided) must be on the same device act_amaxmust be a scalar tensor
Suggested fix
cuda_mod = _get_cuda_module()
if x.ndim != 5 or w.ndim != 5:
raise ValueError(f"Expected 5D tensors, got x.ndim={x.ndim}, w.ndim={w.ndim}")
+ if not x.is_cuda or not w.is_cuda:
+ raise ValueError("x and w must be CUDA tensors")
+ if x.device != w.device:
+ raise ValueError(f"x and w must be on the same device, got {x.device} and {w.device}")
n_batch, cin, d, h, w_in = x.shape
cout, cin_w, kd, kh, kw = w.shape
if cin_w != cin:
raise ValueError(
f"Grouped convolution is not supported (x has {cin} input channels, "
f"w has {cin_w}). This kernel requires groups=1."
)
+ if bias is not None and bias.device != x.device:
+ raise ValueError(f"bias must be on {x.device}, got {bias.device}")
+ if act_amax is not None and act_amax.device != x.device:
+ raise ValueError(f"act_amax must be on {x.device}, got {act_amax.device}")
+ if act_amax is not None and act_amax.numel() != 1:
+ raise ValueError("act_amax must be a scalar tensor")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experimental/conv/implicit_gemm_cuda.py` around lines 123 - 165, The CUDA
kernel can receive CPU pointers or mismatched-device tensors; before calling
cuda_mod.conv3d_implicit_gemm_cuda ensure tensors are CUDA and on the same
device: verify x.is_cuda and w.is_cuda, check that bias (if has_bias) and
act_amax (if do_quant) are on x.device, and move/raise if not; also validate
act_amax is a scalar tensor (numel()==1) when do_quant is True; apply these
checks near the preparation section before creating x_pad/w_flat and before
invoking conv3d_implicit_gemm_cuda so you avoid sending CPU pointers to the
kernel.
| torch::Tensor fp4_fake_quant_cuda(torch::Tensor x, torch::Tensor global_amax, int block_size) { | ||
| // x: [num_blocks, block_size] or flat [N] where N % block_size == 0 | ||
| auto x_flat = x.contiguous().view({-1}); | ||
| int N = x_flat.numel(); | ||
| int num_blocks = N / block_size; | ||
|
|
||
| auto y = torch::empty_like(x_flat); | ||
|
|
||
| // Launch: one warp (32 threads) per block | ||
| int threads_per_block = 256; // 8 warps per CUDA block | ||
| int warps_per_block = threads_per_block / 32; | ||
| int num_cuda_blocks = (num_blocks + warps_per_block - 1) / warps_per_block; | ||
|
|
||
| fp4_fake_quant_kernel<<<num_cuda_blocks, threads_per_block>>>( | ||
| x_flat.data_ptr<float>(), y.data_ptr<float>(), global_amax.data_ptr<float>(), num_blocks, | ||
| block_size); | ||
|
|
||
| return y.view_as(x); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Host-side launch code:"
sed -n '517,627p' experimental/conv/implicit_gemm_kernel.cu
echo
echo "Existing device/stream guards or launch checks in this file:"
rg -n 'CUDAGuard|getCurrentCUDAStream|cudaGetLastError|TORCH_CHECK\(.*cuda' experimental/conv/implicit_gemm_kernel.cu || trueRepository: NVIDIA/Model-Optimizer
Length of output: 7558
Guard kernel launches with device and stream semantics.
Both fp4_fake_quant_cuda (line 517) and conv3d_implicit_gemm_cuda (line 541) use bare <<<>>> launches that ignore the tensor's device and PyTorch's current stream. When tensors on alternate devices (e.g., cuda:1) are passed, the kernel launches on the current device with the default stream instead, causing device pointer mismatches and reordering bugs with surrounding operations.
Wrap each launch sequence with c10::cuda::CUDAGuard to ensure the correct device, use at::cuda::getCurrentCUDAStream() for kernel launch, and check cudaFuncSetAttribute and launch errors with cudaGetLastError().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@experimental/conv/implicit_gemm_kernel.cu` around lines 517 - 534, Wrap the
raw kernel launches in fp4_fake_quant_cuda and conv3d_implicit_gemm_cuda with
proper device and stream scoping: create a c10::cuda::CUDAGuard using the input
tensor device (e.g. x.device()) and get the current stream via
at::cuda::getCurrentCUDAStream(); then launch the kernel using the stream
parameter (kernel<<<grid, block, 0, stream.stream()>>>(...)). After the launch
check cudaGetLastError() and propagate/report any error; if your kernel requires
specific attributes, set them via cudaFuncSetAttribute before launching. Ensure
you apply this pattern to both fp4_fake_quant_cuda and conv3d_implicit_gemm_cuda
so launches use the tensor device and PyTorch stream.
| # Cast to high precision -> cast to low precision, first cast has no impact and can be safely removed | ||
| # Cast to low precision -> cast to high precision affects precision and should not be removed | ||
| precision_order = [ | ||
| onnx.TensorProto.DOUBLE, | ||
| onnx.TensorProto.FLOAT, | ||
| onnx.TensorProto.FLOAT16, | ||
| onnx.TensorProto.BFLOAT16, | ||
| ] | ||
| consumers = [n for n in get_consumer_nodes(model, node.output[0]) if n.op_type == "Cast"] | ||
|
|
||
| # If the first cast has additional consumers, we should not remove it | ||
| if len(consumers) != 1: | ||
| return False | ||
|
|
||
| next_node = consumers[0] | ||
| first_cast_type = output_type | ||
| second_cast_type = get_cast_to_type(next_node) | ||
|
|
||
| return ( | ||
| first_cast_type in precision_order | ||
| and second_cast_type in precision_order | ||
| and precision_order.index(first_cast_type) <= precision_order.index(second_cast_type) | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and get context around the specified lines
cat -n modelopt/onnx/utils.py | sed -n '1320,1380p'Repository: NVIDIA/Model-Optimizer
Length of output: 2899
🏁 Script executed:
# Also look at the broader context to understand the _is_sequential_cast function
cat -n modelopt/onnx/utils.py | sed -n '1300,1400p'Repository: NVIDIA/Model-Optimizer
Length of output: 5013
🏁 Script executed:
# Search for _is_sequential_cast function definition
rg -n "_is_sequential_cast" modelopt/onnx/utils.py -B 5 -A 30Repository: NVIDIA/Model-Optimizer
Length of output: 3559
🏁 Script executed:
# Also check for any references to source types or input handling
rg -n "get_node_input_type\|node_input_type\|input_type" modelopt/onnx/utils.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
# Get input type information - check if _is_sequential_cast looks at source type
cat -n modelopt/onnx/utils.py | sed -n '1337,1365p' | grep -E "input_type|get_.*type|node.input"Repository: NVIDIA/Model-Optimizer
Length of output: 272
🏁 Script executed:
# Check how _is_sequential_cast is called and what _bypass_cast_node does
cat -n modelopt/onnx/utils.py | sed -n '1456,1470p'Repository: NVIDIA/Model-Optimizer
Length of output: 914
🏁 Script executed:
# Look for any tests or examples that use _is_sequential_cast
fd "test" -e py | head -20 | xargs grep -l "_is_sequential_cast" 2>/dev/null || echo "No tests found for this function"Repository: NVIDIA/Model-Optimizer
Length of output: 98
🏁 Script executed:
# Check ONNX Cast operation semantics - look at any documentation or references
rg -n "FLOAT16\|BFLOAT16\|precision" modelopt/onnx/utils.py -B 1 -A 1 | head -40Repository: NVIDIA/Model-Optimizer
Length of output: 48
🌐 Web query:
ONNX Cast operation FLOAT16 BFLOAT16 precision rounding semantics
💡 Result:
-
ONNX
Castdoes not define a configurable rounding mode for casts toFLOAT16orBFLOAT16. The only rounding controls standardized in theCastoperator today are theround_mode(andsaturate) attributes, and they apply only when casting toFLOAT8E8M0(andsaturateapplies to float8 conversions more broadly). For other target types (includingFLOAT16/BFLOAT16), the spec does not spell out rounding semantics in the operator definition. [1] -
What most implementations do in practice
FLOAT16(IEEE binary16): conversion fromFLOAT/DOUBLEis typically the platform’s IEEE-754 narrowing conversion (commonly round-to-nearest, ties-to-even (RNE)).BFLOAT16: many ecosystems also use RNE for FP32→BF16. For example:- The ONNX Python helper used to build tensors documents FP32→BF16 conversion as “rounds-to-nearest-even” by default (with an optional truncation mode). [2]
- The OpenCL
cl_intel_bfloat16_conversionsextension states the only supported rounding mode is round-to-nearest-even. [3] - NVIDIA’s CUDA bfloat16 conversion APIs describe constructors/casts using the default round-to-nearest-even behavior. [4]
Implication: if you need bit-exact, cross-backend reproducibility for FP32→(FP16/BF16) casting, don’t rely on ONNX Cast alone to mandate rounding; validate the target runtime/hardware behavior (or emulate rounding explicitly before/after casting). [1]
Sources: ONNX Cast operator docs [1], ONNX helper BF16 conversion implementation/docs [2], OpenCL Intel BF16 conversions extension [3], NVIDIA CUDA BF16 conversion docs [4].
Source type needed to determine cast removability—FLOAT16 and BFLOAT16 are incomparable formats.
_is_sequential_cast only inspects the output types of sequential casts but lacks the input type of the first cast. It treats FLOAT16 and BFLOAT16 as a linear precision order, but they are different formats with different exponent/mantissa distributions. A chain like FLOAT → FLOAT16 → BFLOAT16 cannot safely have the first cast removed, since removing it changes the rounding path and result. The function must fetch the source type of the first cast and handle FLOAT16/BFLOAT16 conversions as non-removable to prevent silent numeric corruption.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/utils.py` around lines 1341 - 1363, The current
_is_sequential_cast logic uses only the first cast's output_type and treats
FLOAT16 and BFLOAT16 as ordered; change it to fetch the source/input type of the
first Cast node (e.g., inspect the first Cast node's input value/type via the
node.input[0] and model value_info/graph input/initializer) instead of relying
on output_type, then compare that source_type (first_cast_src) to the second
cast type from get_cast_to_type(next_node); if the conversion path involves
FLOAT16↔BFLOAT16 in either direction treat it as non-removable (return False),
otherwise keep the precision_order comparison but use first_cast_src as the
first element for precision_order.index() checks so FLOAT16 and BFLOAT16 are not
incorrectly ordered.
| Usage:: | ||
|
|
||
| from modelopt.torch.quantization.plugins.huggingface import patch_compressed_linear_loading | ||
|
|
||
| with patch_compressed_linear_loading(): | ||
| model = AutoModelForCausalLM.from_pretrained( | ||
| ckpt_path, | ||
| device_map="auto", | ||
| trust_remote_code=True, | ||
| torch_dtype="auto", | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Locate and examine the huggingface.py file around lines 896-906
fd -t f huggingface.py | grep -E "quantization/plugins" | head -1Repository: NVIDIA/Model-Optimizer
Length of output: 116
🏁 Script executed:
# Check if the file exists and get its line count
wc -l modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer
Length of output: 121
🏁 Script executed:
# Read the relevant section including context around lines 896-906
sed -n '870,930p' modelopt/torch/quantization/plugins/huggingface.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2010
🏁 Script executed:
# Search for the patch_compressed_linear_loading function definition and full docstring
sed -n '880,950p' modelopt/torch/quantization/plugins/huggingface.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 3093
Fix the docstring example: make trust_remote_code caller-controlled.
The usage example in the docstring (lines 896–906) hardcodes trust_remote_code=True, which contradicts NVIDIA's security coding practices. Since this is a copy-pasteable code snippet for end-users, it should demonstrate safe defaults. Either:
- Omit the
trust_remote_codeargument (letting transformers default toFalse), or - Use a caller-provided variable (e.g.,
trust_remote_code=trust_remote_code)
Per SECURITY.md: "Do not hardcode trust_remote_code=True when loading Hugging Face Transformers models. Let the caller decide via a parameter; default to False."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 896 - 906,
The docstring usage example for patch_compressed_linear_loading currently
hardcodes trust_remote_code=True; update the snippet so trust_remote_code is
caller-controlled by either omitting the trust_remote_code argument (to use
transformers' default False) or replacing the literal with a variable (e.g.,
trust_remote_code=trust_remote_code) before calling
AutoModelForCausalLM.from_pretrained in the example; ensure the docstring shows
the safer pattern and mentions the variable so callers can opt-in explicitly.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
examples/megatron_bridge/rank_layer_importance.py (1)
346-354:⚠️ Potential issue | 🟠 MajorBroken layer ID mapping when
--num_layers_in_first_pipeline_stageis omitted.When
args.num_layers_in_first_pipeline_stageisNone, the ternary evaluates to0, making the conditionlayer_id >= offset and layer_id < offsetalwaysFalse. This causes all layers to return-1, meaning no block is ever patched.Either make
--num_layers_in_first_pipeline_stagerequired or infer the per-rank layer count from the model.Proposed fix - make the argument required
# Uneven Pipeline Parallelism parameters parser.add_argument("--num_layers", type=int, required=True) - parser.add_argument("--num_layers_in_first_pipeline_stage", type=int, default=None) - parser.add_argument("--num_layers_in_last_pipeline_stage", type=int, default=None) + parser.add_argument("--num_layers_in_first_pipeline_stage", type=int, required=True) + parser.add_argument("--num_layers_in_last_pipeline_stage", type=int, required=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/rank_layer_importance.py` around lines 346 - 354, The condition in layer_id_in_this_rank incorrectly uses a ternary that yields 0 when args.num_layers_in_first_pipeline_stage is None, causing every check to fail; fix by computing an explicit per-rank layer count first (e.g. num_in_first = args.num_layers_in_first_pipeline_stage if args.num_layers_in_first_pipeline_stage is not None else args.num_layers // args.world_size) and then use that variable in the comparison (use layer_id >= offset and layer_id < offset + num_in_first) so the function returns layer_id - offset for correct ranks and -1 otherwise; reference layer_id_in_this_rank, args.num_layers_in_first_pipeline_stage, args.num_layers, args.world_size, and offset.
🧹 Nitpick comments (5)
examples/megatron_bridge/rank_layer_importance.py (5)
387-389: Unusedpatch_registervariable in dropped layers loop.The
patch_registeris assigned but never used since dropped layers remain patched. Consider using_to indicate the return value is intentionally ignored.Proposed fix
for layer_id in args.drop_layers: - patch_register = patch_model(layer_id_in_this_rank(layer_id)) + _ = patch_model(layer_id_in_this_rank(layer_id))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/rank_layer_importance.py` around lines 387 - 389, The loop assigns patch_register from patch_model(layer_id_in_this_rank(layer_id)) but never uses it; update the loop over args.drop_layers to explicitly ignore the return value (e.g., replace the variable with "_" or remove the assignment) so it's clear the result of patch_model is intentionally unused; locate the loop using patch_register, patch_model, layer_id_in_this_rank, and args.drop_layers and change the assignment accordingly.
66-74: Inconsistent no-op behavior for Mamba layers.
noop_mamba_forward_patchreturns the input unchanged (pass-through), while other no-op patches return zeros or clones. If this is intentional (e.g., Mamba layers require identity rather than zeroing for stability), consider adding a brief comment explaining why.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/rank_layer_importance.py` around lines 66 - 74, The noop_mamba_forward_patch function currently returns hidden_states (identity) while other no-op patches in this module return zeros or clones, causing inconsistent behavior; either make noop_mamba_forward_patch match the other no-op semantics (e.g., return zeros_like(hidden_states) or a clone) by changing its return to the same pattern, or if identity is intentional for stability, add a short explanatory comment above noop_mamba_forward_patch describing why Mamba layers must be passthrough (reference noop_mamba_forward_patch and the other no-op patch patterns) so readers know the divergence is deliberate.
406-409: Use context manager for file operations.Same issue as
collect_scores- theopen()without context manager can leak file handles.Proposed fix
if is_pipeline_last_stage() and get_data_parallel_rank() == 0: scores = collect_scores(unwrapped_model) assert scores is not None - pickle.dump(scores, open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb")) + with open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb") as f: + pickle.dump(scores, f)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/rank_layer_importance.py` around lines 406 - 409, The file write currently uses open(...) without a context manager which can leak file handles; in the block guarded by is_pipeline_last_stage() and get_data_parallel_rank() == 0 (where you call collect_scores(unwrapped_model) and assert scores is not None), replace pickle.dump(scores, open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb")) with a with open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb") as f: pickle.dump(scores, f) pattern so the file is properly closed after writing.
286-291: Use context manager for file operations and clean up debug output.The
open()call without a context manager can leak file handles. Also, the debug prints with rawf-string=}syntax are verbose for production code.Proposed fix
- print(f"{scores=}") - pickle.dump(scores, open("scores.p", "wb")) + with open("scores.p", "wb") as f: + pickle.dump(scores, f) print("Layers ordered by <MSE> importance:") - print( - f"{sorted([(k,v['mse'].mean()) for k,v in scores.items() if v['mse'].numel() > 0], key=lambda x:(x[1]))=}" - ) + ranked = sorted( + [(k, v['mse'].mean().item()) for k, v in scores.items() if v['mse'].numel() > 0], + key=lambda x: x[1] + ) + print(ranked)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/rank_layer_importance.py` around lines 286 - 291, Replace the raw debug prints and the bare open() with a context-managed file write and cleaner output: remove or simplify the prints that use the f"{var=}" debug syntax (the occurrences printing scores and the sorted MSE output), and write scores using with open("scores.p", "wb") as f: pickle.dump(scores, f). Keep a single concise log/print that summarises the result (e.g., “Saved scores to scores.p” and/or a short human-readable listing from the sorted MSE computation) instead of the verbose debug f-string; update any references to scores and the sorted MSE expression accordingly.
44-45: Deprecatedreduceparameter inMSELoss.The
reduceparameter has been deprecated in favor ofreduction. Using both is redundant and triggers a deprecation warning.Proposed fix
kl_loss = torch.nn.KLDivLoss(reduction="batchmean", log_target=True).cuda() -mse_loss = torch.nn.MSELoss(reduce=True, reduction="mean").cuda() +mse_loss = torch.nn.MSELoss(reduction="mean").cuda()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/rank_layer_importance.py` around lines 44 - 45, The MSELoss instantiation uses the deprecated `reduce` parameter alongside `reduction`; update the MSE loss creation (variable mse_loss using torch.nn.MSELoss) to remove the `reduce=True` argument and rely only on `reduction="mean"` and then call .cuda() as before so the line becomes a single call to torch.nn.MSELoss(reduction="mean").cuda().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Line 181: The log message passed to print_rank_0 contains a typo: change the
string in the call to print_rank_0("> Computing distances to stored refernces")
to correct "refernces" to "references" so it reads "> Computing distances to
stored references"; update the call site where print_rank_0 is invoked in
rank_layer_importance.py accordingly.
---
Duplicate comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Around line 346-354: The condition in layer_id_in_this_rank incorrectly uses a
ternary that yields 0 when args.num_layers_in_first_pipeline_stage is None,
causing every check to fail; fix by computing an explicit per-rank layer count
first (e.g. num_in_first = args.num_layers_in_first_pipeline_stage if
args.num_layers_in_first_pipeline_stage is not None else args.num_layers //
args.world_size) and then use that variable in the comparison (use layer_id >=
offset and layer_id < offset + num_in_first) so the function returns layer_id -
offset for correct ranks and -1 otherwise; reference layer_id_in_this_rank,
args.num_layers_in_first_pipeline_stage, args.num_layers, args.world_size, and
offset.
---
Nitpick comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Around line 387-389: The loop assigns patch_register from
patch_model(layer_id_in_this_rank(layer_id)) but never uses it; update the loop
over args.drop_layers to explicitly ignore the return value (e.g., replace the
variable with "_" or remove the assignment) so it's clear the result of
patch_model is intentionally unused; locate the loop using patch_register,
patch_model, layer_id_in_this_rank, and args.drop_layers and change the
assignment accordingly.
- Around line 66-74: The noop_mamba_forward_patch function currently returns
hidden_states (identity) while other no-op patches in this module return zeros
or clones, causing inconsistent behavior; either make noop_mamba_forward_patch
match the other no-op semantics (e.g., return zeros_like(hidden_states) or a
clone) by changing its return to the same pattern, or if identity is intentional
for stability, add a short explanatory comment above noop_mamba_forward_patch
describing why Mamba layers must be passthrough (reference
noop_mamba_forward_patch and the other no-op patch patterns) so readers know the
divergence is deliberate.
- Around line 406-409: The file write currently uses open(...) without a context
manager which can leak file handles; in the block guarded by
is_pipeline_last_stage() and get_data_parallel_rank() == 0 (where you call
collect_scores(unwrapped_model) and assert scores is not None), replace
pickle.dump(scores, open(f"scores_{get_pipeline_model_parallel_rank()}.p",
"wb")) with a with open(f"scores_{get_pipeline_model_parallel_rank()}.p", "wb")
as f: pickle.dump(scores, f) pattern so the file is properly closed after
writing.
- Around line 286-291: Replace the raw debug prints and the bare open() with a
context-managed file write and cleaner output: remove or simplify the prints
that use the f"{var=}" debug syntax (the occurrences printing scores and the
sorted MSE output), and write scores using with open("scores.p", "wb") as f:
pickle.dump(scores, f). Keep a single concise log/print that summarises the
result (e.g., “Saved scores to scores.p” and/or a short human-readable listing
from the sorted MSE computation) instead of the verbose debug f-string; update
any references to scores and the sorted MSE expression accordingly.
- Around line 44-45: The MSELoss instantiation uses the deprecated `reduce`
parameter alongside `reduction`; update the MSE loss creation (variable mse_loss
using torch.nn.MSELoss) to remove the `reduce=True` argument and rely only on
`reduction="mean"` and then call .cuda() as before so the line becomes a single
call to torch.nn.MSELoss(reduction="mean").cuda().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9ddfa505-ef7f-4d1e-be6d-1fbf70fe9b56
📥 Commits
Reviewing files that changed from the base of the PR and between 5ee6bfb5453b5a9b2dd7c8b235b899661b3841f4 and e5904e5b813b684dad923abe0cbba6381c469527.
📒 Files selected for processing (1)
examples/megatron_bridge/rank_layer_importance.py
e5904e5 to
7e2e85a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
examples/megatron_bridge/rank_layer_importance.py (2)
179-179:⚠️ Potential issue | 🟡 MinorTypo in log message is still present.
Line 179 has
"refernces"; should be"references".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/rank_layer_importance.py` at line 179, Fix the typo in the log message passed to print_rank_0 in rank_layer_importance.py: change the string "refernces" to "references" so the call print_rank_0("> Computing distances to stored refernces") becomes print_rank_0("> Computing distances to stored references"); no other changes required.
297-306:⚠️ Potential issue | 🔴 CriticalLayer-to-rank mapping is incorrect for uneven/multi-stage PP and can skip or mis-patch layers.
Line 303’s assertion is mathematically invalid for
pp_size > 2, andlayer_id_in_this_rank()(Line 362 onward) assumes every rank ownsnum_layers_in_first_pipeline_stage. This causes wrong global→local mapping and potential out-of-bounds patching on uneven topologies.🐛 Proposed fix (compute per-rank spans, then map with start/end)
def estimate_layer_importance(args: argparse.Namespace): pp_size = dist.size() - if args.num_layers_in_first_pipeline_stage is None: - args.num_layers_in_first_pipeline_stage = args.num_layers // pp_size - if args.num_layers_in_last_pipeline_stage is None: - args.num_layers_in_last_pipeline_stage = args.num_layers - ( - args.num_layers_in_first_pipeline_stage * (pp_size - 1) - ) - assert ( - args.num_layers_in_first_pipeline_stage + args.num_layers_in_last_pipeline_stage - == args.num_layers - ), "Number of layers in first and last pipeline stages must sum to the total number of layers" + if args.num_layers_in_first_pipeline_stage is None and args.num_layers_in_last_pipeline_stage is None: + # Balanced fallback when topology flags are omitted. + base, rem = divmod(args.num_layers, pp_size) + pp_layer_counts = [base + (1 if r < rem else 0) for r in range(pp_size)] + else: + if ( + args.num_layers_in_first_pipeline_stage is None + or args.num_layers_in_last_pipeline_stage is None + ): + raise ValueError( + "Provide both --num_layers_in_first_pipeline_stage and " + "--num_layers_in_last_pipeline_stage, or omit both." + ) + first = args.num_layers_in_first_pipeline_stage + last = args.num_layers_in_last_pipeline_stage + if pp_size == 1: + if args.num_layers != first: + raise ValueError("For pp_size=1, num_layers must equal first-stage layer count.") + pp_layer_counts = [args.num_layers] + elif pp_size == 2: + if first + last != args.num_layers: + raise ValueError("For pp_size=2, first+last must equal --num_layers.") + pp_layer_counts = [first, last] + else: + middle_total = args.num_layers - first - last + if middle_total < 0 or middle_total % (pp_size - 2) != 0: + raise ValueError( + "Invalid topology: remaining layers must be non-negative and divisible by middle stages." + ) + middle = middle_total // (pp_size - 2) + pp_layer_counts = [first] + [middle] * (pp_size - 2) + [last] print_rank_0(f"Setting pipeline_model_parallel_size to {pp_size}") @@ pp_rank = get_pipeline_model_parallel_rank() - offset = ( - pp_rank * args.num_layers_in_first_pipeline_stage - if args.num_layers_in_first_pipeline_stage - else 0 - ) + offset = sum(pp_layer_counts[:pp_rank]) + local_layer_count = pp_layer_counts[pp_rank] @@ def layer_id_in_this_rank(layer_id): - if ( - layer_id >= offset and layer_id < offset + args.num_layers_in_first_pipeline_stage - if args.num_layers_in_first_pipeline_stage - else 0 - ): + if offset <= layer_id < offset + local_layer_count: return layer_id - offset else: return -1Also applies to: 331-335, 362-370
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/megatron_bridge/rank_layer_importance.py` around lines 297 - 306, The current logic assumes only two-stage pipeline and that every rank has args.num_layers_in_first_pipeline_stage, which breaks for pp_size > 2 and causes layer_id_in_this_rank() to map incorrectly; replace the simplistic computation and assertion by computing per-rank layer spans (e.g., build an array of start/end layer indices for each pipeline rank using args.num_layers, pp_size and any per-stage defaults), then change layer_id_in_this_rank() to map a global layer id to a rank-local id by finding the rank whose span contains that layer and subtracting that span's start to get the local index; remove the invalid sum assertion and instead validate that the union of all per-rank spans covers 0..args.num_layers-1 without overlap or gaps.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Around line 129-131: The code stores full-sequence activations in
self.reference_hidden (via lines like reference_hidden.append(hidden_out)) which
can OOM and ignores nlast_tokens; change each append to only keep the last
nlast_tokens frames (or at least the final token if nlast_tokens<=0) and offload
them to CPU/detached tensors before storing. Concretely, where you currently
append hidden_out (in __init__ flow and the other blocks around the 147-152 and
155-168 regions), replace it with something like: slice = hidden_out[:,
-nlast_tokens:, ...] if nlast_tokens>0 else hidden_out[:, -1:, ...]; then call
slice = slice.detach().cpu() and append that to self.reference_hidden so you
retain only the tail tokens on CPU and avoid keeping full-GPU histories. Ensure
any later code that reads self.reference_hidden expects the shortened sequence
length.
---
Duplicate comments:
In `@examples/megatron_bridge/rank_layer_importance.py`:
- Line 179: Fix the typo in the log message passed to print_rank_0 in
rank_layer_importance.py: change the string "refernces" to "references" so the
call print_rank_0("> Computing distances to stored refernces") becomes
print_rank_0("> Computing distances to stored references"); no other changes
required.
- Around line 297-306: The current logic assumes only two-stage pipeline and
that every rank has args.num_layers_in_first_pipeline_stage, which breaks for
pp_size > 2 and causes layer_id_in_this_rank() to map incorrectly; replace the
simplistic computation and assertion by computing per-rank layer spans (e.g.,
build an array of start/end layer indices for each pipeline rank using
args.num_layers, pp_size and any per-stage defaults), then change
layer_id_in_this_rank() to map a global layer id to a rank-local id by finding
the rank whose span contains that layer and subtracting that span's start to get
the local index; remove the invalid sum assertion and instead validate that the
union of all per-rank spans covers 0..args.num_layers-1 without overlap or gaps.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 5fdf2490-afcc-49ae-8b1b-2ff301b67d16
📥 Commits
Reviewing files that changed from the base of the PR and between e5904e5b813b684dad923abe0cbba6381c469527 and b17ccac.
📒 Files selected for processing (1)
examples/megatron_bridge/rank_layer_importance.py
… available in safety (#1067) ### What does this PR do? **Type of change**: Bug fix **Overview**: Remote autotuning via `--autotune` is only available with `--safe` enabled. See `trtexec --help`: ```sh --remoteAutoTuningConfig Set the remote auto tuning config. Must be specified with --safe. Format: protocol://username[:password]@hostname[:port]?param1=value1¶m2=value2 Example: ssh://user:pass@192.0.2.100:22?remote_exec_path=/opt/tensorrt/bin&remote_lib_path=/opt/tensorrt/lib ``` ### Usage ```python $ python -m modelopt.onnx.quantization.autotune \ --onnx_path resnet50_Opset17_bs128.onnx \ --use_trtexec \ --trtexec_benchmark_args "--remoteAutoTuningConfig=\"<remote autotuning config>\" --safe" ``` ### Testing See `examples/onnx_ptq/autotune`. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Remote autotuning now automatically enforces safety mode by including the --safe flag if not already present, with informative warning messages when the flag is automatically added. * **Documentation** * Updated remote autotuning guide to clarify that safety mode (--safe flag) must be enabled through the trtexec benchmark arguments for proper configuration. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com> Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
c24c68a to
7f43534
Compare
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1016 +/- ##
==========================================
+ Coverage 70.18% 70.22% +0.03%
==========================================
Files 228 228
Lines 25952 25952
==========================================
+ Hits 18215 18224 +9
+ Misses 7737 7728 -9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…timizer into depth_importance_example
Signed-off-by: mchochowski <mchochowski@nvidia.com>
Signed-off-by: mchochowski <mchochowski@nvidia.com>
What does this PR do?
Type of change: ? new-feature
The script goes through the GPTModel transformer patching each block as no-op one-by-one estimating how much removing this block affects the final output representation.
Usage
Testing
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit