Releases · NVIDIA/Model-Optimizer

13 May 20:37

kevalmorabia97

0.44.0

c897fbe

ModelOpt 0.44.0 Release Latest

Latest

New Features

Support full Transformer Engine spec for Minitron pruning (mcore_minitron). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
Add end-to-end tutorial for Minitron pruning + distillation + quantization + evaluation + vLLM deployment for Nemotron-Nano-9B-v2 → Pruned 7B along with data blend preparation steps (and ablation study). See examples/pruning/minitron/README.md for details.
Add Puzzletron - a new algorithm for heterogeneous pruning of LLM and VLM models. See examples/puzzletron/README.md for more details.
Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
Add N:M sparse softmax support to the Triton flash attention kernel (modelopt.torch.kernels.common.attention.triton_fa). See examples/llm_sparsity/attention_sparsity/README.md for usage.
Add skip-softmax skipping to the Triton flash attention kernel (modelopt.torch.kernels.common.attention.triton_fa). See examples/llm_sparsity/attention_sparsity/README.md for usage.
Add Video Sparse Attention (VSA) method for video diffusion models (modelopt.torch.sparsity.attention_sparsity). VSA uses 3D block tiling with a two-branch architecture for attention speedup.
Enable PTQ workflow for the Step3.5-Flash MoE model with NVFP4 W4A4 + FP8 KV cache quantization. See modelopt_recipes/models/Step3.5-Flash/nvfp4-mlp-only.yaml for more details.
Add support for vLLM fakequant reload using ModelOpt state for HF models. See examples/vllm_serve/README.md for more details.
[Early Testing] Add Claude Code PTQ skill (.claude/skills/ptq/) for agent-assisted post-training quantization. The skill guides the agent through environment detection, model support checking, format selection, and execution via the launcher or manual SLURM/Docker/bare GPU paths. Includes handling for unlisted models with custom module patching. This feature is in early testing — use with caution.
[Early Testing] Polish Claude Code evaluation skill (.claude/skills/evaluation/) for agent-assisted LLM accuracy benchmarking via NeMo Evaluator Launcher. Adds two companion skills vendored verbatim from NVIDIA-NeMo/Evaluator: launching-evals (run/check/debug/analyze NEL evaluations) and accessing-mlflow (query MLflow runs, compare metrics, fetch artifacts). Re-sync at a pinned upstream SHA via .claude/scripts/sync-upstream-skills.sh. Also adds a shared skills/common/credentials.md covering HF / NGC / Docker token setup referenced by multiple skills. This feature is in early testing — use with caution.
Add performant layerwise calibration for large models that don't fit on GPU (e.g. DeepSeek-R1, Kimi-K2). See modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml for usage. Layerwise calibration also supports PTQ with intermediate progress saving — useful when long PTQ runs get hit with Slurm timeouts. See modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml for usage.
Add implicit GEMM CUDA kernel for Conv3D with fused NVFP4 fake quantization (modelopt.torch.quantization.src.conv). When NVFP4 quantization is applied to an nn.Conv3d layer via ModelOpt PTQ, the implicit GEMM path is used automatically instead of cuDNN. Uses BF16 WMMA tensor cores (SM80+) with FP32 accumulation and in-kernel FP4 (E2M1) activation quantization. Grouped convolution (groups > 1) falls back to the default cuDNN path. Inference only — training mode falls back to cuDNN with a warning.
Add FP8 MHA quantization support for vision transformers. Adds an attention-aware ONNX post-processing pass (scale Mul / K-transpose move before Q, Q→DQ insertion on softmax output) in FP8QuantExporter (modelopt.onnx.export.fp8_exporter.FP8QuantExporter), per-instance nested-attention-wrapper skipping in the HF plugin, and nn.LayerNorm registration in QuantModuleRegistry so BMM input quantizers and LayerNorm output quantizers defined in FP8_DEFAULT_CFG are honored end-to-end. See examples/torch_onnx/torch_quant_to_onnx.py for the general timm-model quantize→ONNX workflow.

Backward Breaking Changes

The quant_cfg field in quantization configs is now an ordered list of QuantizerCfgEntry dicts instead of a flat dictionary. Each entry specifies a quantizer_name wildcard, an optional parent_class filter, a cfg dict of quantizer attributes, and/or an enable flag. Entries are applied in list order with later entries overriding earlier ones. The old dict-based format is still accepted and automatically converted via normalize_quant_cfg_list(), but now emits a DeprecationWarning; new code should use the list format. All built-in configs (e.g. FP8_DEFAULT_CFG, INT4_AWQ_CFG, NVFP4_DEFAULT_CFG), examples, and YAML recipes have been updated. See the quant-cfg documentation for the new format reference and migration guide.
Deprecated Mllama (Llama 3.2 Vision) support in the llm_ptq and vlm_ptq examples. The model_type == "mllama" branches and MllamaImageProcessor usage have been removed from hf_ptq.py and example_utils.py. For image-text calibration of VLMs, use --calib_with_images with a supported VLM (see Nemotron VL section in examples/llm_ptq/README.md).

Bug Fixes

Fix Megatron utility functions for generation (with pipeline parallelism) and ~10x speedup in MMLU score evaluation (by batching prefill passes).
Fix Minitron pruning (mcore_minitron) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
Fix TRT support for remote autotuning in ONNX Autotune from 10.16+ to 10.15+ and fix TRT versioning check to the trtexec version instead of the TRT Python API when using trtexec backend.
Exclude MatMul/Gemm nodes with K or N < 16 from ONNX INT8 and FP8 quantization. Such small-dimension GEMMs cannot efficiently use INT8/FP8 Tensor Cores and the added Q/DQ layers cause perf regressions in TensorRT. Honors Gemm transB when deriving K.
Fix nvfp4_awq export AssertionError: Modules have different quantization formats for MoE models (e.g. Qwen3-30B-A3B) when some experts are not exercised by the calibration data. awq_lite now applies a neutral all-ones pre_quant_scale to any expert that ends up disabled (no cache-pass tokens, NaN scales, or no search-pass tokens) so its format remains nvfp4_awq, consistent with the rest of the MoE block. A warning is emitted whenever this fallback fires.

Misc

[Security] Changed the default of weights_only to True in torch.load for secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class in torch.serialization.add_safe_globals([cls]) before loading. Added safe_save (modelopt.torch.utils.serialization.safe_save) and safe_load (modelopt.torch.utils.serialization.safe_load) API to save and load checkpoints securely.
Bump minimum required PyTorch version to 2.8.
[Experimental] Add support for transformers>=5.0, including generic PTQ and unified HF checkpoint export for fused MoE expert modules (Mixtral, Qwen2-MoE, Qwen3-MoE, Qwen3.5-MoE, DeepSeek-V3, Jamba, OLMoE, etc.).
Improve megatron_preprocess_data: add --reasoning_content support for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (.jsonl.gz), add --strip_newlines flag for plain-text pretraining data, add --hf_streaming for very large datasets (only consumed rows downloaded), and auto-shuffle when --hf_max_samples_per_split is set to avoid biased sampling.
Add installation support for Python 3.14. Only basic unit tests are verified for now. Production usage still defaults to Python 3.12. Python 3.10 support will be dropped in the next release.

Assets 3

13 May 17:32

kevalmorabia97

0.44.0rc5

c897fbe

0.44.0rc5 Pre-release

Pre-release

fix(te-plugin): handle TE 2.15+ tuple return from `_Linear` / `_Group…

Assets 3

12 May 20:12

kevalmorabia97

0.44.0rc4

50e112e

0.44.0rc4 Pre-release

Pre-release

fix(te-plugin): make _Linear arg indexing robust to TE signature chan…

Assets 3

11 May 16:43

AAnoosheh

0.44.0rc3

1b5d448

0.44.0rc3 Pre-release

Pre-release

0.44.0rc3

Assets 3

05 May 04:59

kevalmorabia97

0.44.0rc2

cc06062

0.44.0rc2 Pre-release

Pre-release

Install the 0.44.0rc2 pre-release version using

pip install nvidia-modelopt==0.44.0rc2 --extra-index-url https://pypi.nvidia.com

Assets 3

20 Apr 19:14

kevalmorabia97

0.44.0rc1

8d2f99f

0.44.0rc1 Pre-release

Pre-release

Install the 0.44.0rc1 pre-release version using

pip install nvidia-modelopt==0.44.0rc1 --extra-index-url https://pypi.nvidia.com

Assets 3

19 Apr 15:10

kevalmorabia97

0.44.0rc0

26ae8da

0.44.0rc0 Pre-release

Pre-release

Install the 0.44.0rc0 pre-release version using

pip install nvidia-modelopt==0.44.0rc0 --extra-index-url https://pypi.nvidia.com

Assets 3

16 Apr 19:22

kevalmorabia97

0.43.0

ccabb95

ModelOpt 0.43.0 Release

Bug Fixes

ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.

Backward Breaking Changes

Default --kv_cache_qformat in hf_ptq.py changed from fp8 to fp8_cast. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass --kv_cache_qformat fp8.
Removed KV cache scale clamping (clamp_(min=1.0)) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (--kv_cache_qformat fp8 or nvfp4), consider using the casting methods (fp8_cast or nvfp4_cast) instead.

New Features

Add fp8_cast and nvfp4_cast modes for --kv_cache_qformat in hf_ptq.py. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new use_constant_amax field in QuantizerAttributeConfig controls this behavior.
User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
hf_ptq.py now saves the quantization summary and moe expert token count table to the export directory.
Add --moe_calib_experts_ratio flag in hf_ptq.py to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to None (not enabled).
Add sparse attention optimization for transformer models (modelopt.torch.sparsity.attention_sparsity). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See examples/llm_sparsity/attention_sparsity/README.md for usage.
Add support for rotating the input before quantization for RHT.
Add support for advanced weight scale search for NVFP4 quantization and its export path.
Enable PTQ workflow for Qwen3.5 MoE models.
Enable PTQ workflow for the Kimi-K2.5 model.
Add nvfp4_omlp_only quantization format for NVFP4 quantization. This is similar to nvfp4_mlp_only but also quantizes the output projection layer in attention.
Add nvfp4_experts_only quantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization.
pass_through_bwd in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
Add compute_quantization_mse API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
Autotune: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: python -m modelopt.onnx.quantization.autotune. See the Autotune guide in the documentation.
Add get_auto_quantize_config API to extract a flat quantization config from auto_quantize search results, enabling re-quantization at different effective bit targets without re-running calibration.
Improve auto_quantize checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in auto_quantize grouping and scoring rules.
Add support for block-granular RHT for non-power-of-2 dimensions.
Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.

Deprecations

Removed MT-Bench (FastChat) support from examples/llm_eval. The run_fastchat.sh and gen_model_answer.py scripts have been deleted, and the mtbench task has been removed from the llm_ptq example scripts.
Remove deprecated NeMo-2.0 Framework references.

Misc

Migrated project metadata from setup.py to a fully declarative pyproject.toml.
Enable experimental Python 3.13 wheel support and unit tests in CI/CD.

Assets 3

13 Apr 18:25

kevalmorabia97

0.43.0rc4

ccabb95

0.43.0rc4 Pre-release

Pre-release

Install the 0.43.0rc4 pre-release version using

pip install nvidia-modelopt==0.43.0rc4 --extra-index-url https://pypi.nvidia.com

Assets 3

06 Apr 15:35

kevalmorabia97

0.43.0rc3

f3151d2

0.43.0rc3 Pre-release

Pre-release

Install the 0.43.0rc3 pre-release version using

pip install nvidia-modelopt==0.43.0rc3 --extra-index-url https://pypi.nvidia.com

Assets 3

Releases: NVIDIA/Model-Optimizer

ModelOpt 0.44.0 Release

New Features

Backward Breaking Changes

Bug Fixes

Misc

Uh oh!

0.44.0rc5

Uh oh!

0.44.0rc4

Uh oh!

0.44.0rc3

Uh oh!

0.44.0rc2

Uh oh!

0.44.0rc1

Uh oh!

0.44.0rc0

Uh oh!

ModelOpt 0.43.0 Release

Bug Fixes

Backward Breaking Changes

New Features

Deprecations

Misc

Uh oh!

0.43.0rc4

Uh oh!

0.43.0rc3

Uh oh!