You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+23-2Lines changed: 23 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,23 @@
1
1
NVIDIA Model Optimizer Changelog
2
2
================================
3
+
0.44 (2026-04-xx)
3
4
4
-
0.43 (2026-03-xx)
5
+
**New Features**
6
+
- Added iterator interface using CalibrationDataReader in ONNX quantization workflow.
7
+
8
+
9
+
0.44 (2026-05-xx)
10
+
^^^^^^^^^^^^^^^^^
11
+
12
+
**New Features**
13
+
14
+
- Support full Transformer Engine spec for Minitron pruning (``mcore_minitron``). Now we no longer need to use custom ModelOpt spec. Note that this does not affect the usage of the pruning workflow but makes pruning slightly faster and may result in slightly different pruned model because of different kernel and numerics.
15
+
16
+
**Bug Fixes**
17
+
18
+
- Fix Minitron pruning (``mcore_minitron``) for MoE models. Importance estimation hooks were incorrectly registered for MoE modules and NAS step was hanging before this.
19
+
20
+
0.43 (2026-04-09)
5
21
^^^^^^^^^^^^^^^^^
6
22
7
23
**Bug Fixes**
@@ -25,6 +41,7 @@ NVIDIA Model Optimizer Changelog
25
41
- Enable PTQ workflow for Qwen3.5 MoE models.
26
42
- Enable PTQ workflow for the Kimi-K2.5 model.
27
43
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
44
+
- Add ``nvfp4_experts_only`` quantization config that targets only MoE routed expert layers (excluding shared) with NVFP4 quantization.
28
45
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
29
46
- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
30
47
- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
@@ -34,12 +51,16 @@ NVIDIA Model Optimizer Changelog
34
51
- Add support for block-granular RHT for non-power-of-2 dimensions.
| Post Training Quantization | Compress model size by 2x-4x, speeding up inference while preserving model quality! |\[[LLMs](./examples/llm_ptq/)\]\[[diffusers](./examples/diffusers/)\]\[[VLMs](./examples/vlm_ptq/)\]\[[onnx](./examples/onnx_ptq/)\]\[[windows](./examples/windows/)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\]|
95
-
| Quantization Aware Training | Refine accuracy even further with a few training steps! |\[[NeMo](./examples/llm_qat#nemo-qatqad-simplified-flow-example)\]\[[Hugging Face](./examples/llm_qat/)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\]|
96
-
| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! |\[[PyTorch](./examples/pruning/)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/3_pruning.html)\]|
97
-
| Distillation | Reduce deployment model size by teaching small models to behave like larger models! |\[[NeMo](./examples/llm_distill#knowledge-distillation-kd-for-nvidia-nemo-models)\]\[[Hugging Face](./examples/llm_distill/)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\]|
95
+
| Quantization Aware Training | Refine accuracy even further with a few training steps! |\[[Hugging Face](./examples/llm_qat/)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\]|
96
+
| Pruning | Reduce your model size and accelerate inference by removing unnecessary weights! |\[[General](./examples/pruning/)\]\[[Megatron-Bridge](./examples/megatron_bridge/README.md#pruning)\]||
97
+
| Distillation | Reduce deployment model size by teaching small models to behave like larger models! |\[[Megatron-Bridge](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-bridge-framework)\]\[[Megatron-LM](./examples/llm_distill/README.md#knowledge-distillation-kd-in-nvidia-megatron-lm-framework)\]\[[Hugging Face](./examples/llm_distill/)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/4_distillation.html)\]|
98
98
| Speculative Decoding | Train draft modules to predict extra tokens during inference! |\[[Megatron](./examples/speculative_decoding#mlm-example)\]\[[Hugging Face](./examples/speculative_decoding/)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/5_speculative_decoding.html)\]|
99
99
| Sparsity | Efficiently compress your model by storing only its non-zero parameter values and their locations |\[[PyTorch](./examples/llm_sparsity/)\]|\[[docs](https://nvidia.github.io/Model-Optimizer/guides/6_sparsity.html)\]|
0 commit comments