You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+17Lines changed: 17 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,22 @@
1
1
NVIDIA Model Optimizer Changelog (Linux)
2
2
========================================
3
3
4
+
0.43 (2026-03-xx)
5
+
^^^^^^^^^^^^^^^^^
6
+
7
+
**Bug Fixes**
8
+
9
+
- ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.
10
+
11
+
**New Features**
12
+
13
+
- User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
14
+
- ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
15
+
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to all the experts.
16
+
- Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
17
+
- Add support for rotating the input before quantization for RHT.
18
+
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
19
+
4
20
0.42 (2026-02-xx)
5
21
^^^^^^^^^^^^^^^^^
6
22
@@ -21,6 +37,7 @@ NVIDIA Model Optimizer Changelog (Linux)
21
37
- Add LTX-2 and Wan2.2 (T2V) support in the diffusers quantization workflow.
22
38
- Add PTQ support for GLM-4.7, including loading MTP layer weights from a separate ``mtp.safetensors`` file and export as-is.
23
39
- Add support for image-text data calibration in PTQ for Nemotron VL models.
40
+
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
24
41
- Add PTQ support for Nemotron Parse.
25
42
- Add distillation support for LTX-2. See `examples/diffusers/distillation/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/diffusers/distillation>`_ for more details.
26
43
- Add Megatron Core export/import mapping for Qwen3-VL (``Qwen3VLForConditionalGeneration``) vision-language models. The mapping handles the ``model.language_model.`` weight prefix used by Qwen3-VL.
* By default, ``cupy-cuda12x`` is installed for INT4 ONNX quantization. If you have CUDA 13, you need to run ``pip uninstall -y cupy-cuda12x`` and ``pip install cupy-cuda13x`` after installing ``nvidia-modelopt[onnx]``.
132
+
129
133
**Accelerated Quantization with Triton Kernels**
130
134
131
135
ModelOpt includes optimized quantization kernels implemented with Triton language that accelerate quantization
0 commit comments