Skip to content

Commit 052e360

Browse files
committed
Merge remote-tracking branch 'origin/main' into weight_only_te_fix
2 parents 45ab8ca + 7c33d85 commit 052e360

File tree

209 files changed

+19958
-3754
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

209 files changed

+19958
-3754
lines changed

.coderabbit.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,9 @@ reviews:
2020
5. Any use of "# nosec" comments to bypass Bandit security checks is not allowed.
2121
If a security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved
2222
by @NVIDIA/modelopt-setup-codeowners with an explicit justification in the PR description.
23+
6. Any addition of new PIP dependencies in pyproject.toml or requirements.txt that are not
24+
permissive licenses (e.g. MIT, Apache 2) must be reviewed and approved by
25+
@NVIDIA/modelopt-setup-codeowners with an explicit justification in the PR description.
2326
- path: "examples/**/*.py"
2427
instructions: *security_instructions
2528
auto_review:

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ LICENSE_HEADER @NVIDIA/modelopt-setup-codeowners
1010
pyproject.toml @NVIDIA/modelopt-setup-codeowners
1111
SECURITY.md @NVIDIA/modelopt-setup-codeowners
1212
tox.ini @NVIDIA/modelopt-setup-codeowners
13+
uv.lock @NVIDIA/modelopt-setup-codeowners
1314

1415
# Library
1516
modelopt/deploy @NVIDIA/modelopt-deploy-codeowners

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,10 @@ Type of change: ? <!-- Use one of the following: Bug fix, new feature, new examp
1717

1818
Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`).
1919

20-
Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, using `torch.load(..., weights_only=True)`, avoiding `pickle`, etc.).
20+
Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.).
2121

2222
- Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. -->
23-
- If you copied code from any other source, did you follow IP policy in [CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?: ✅ / ❌ / N/A <!--- Mandatory -->
23+
- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory -->
2424
- Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. -->
2525
- Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. -->
2626

.github/workflows/bump_uv_lock.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ jobs:
5050
git add uv.lock
5151
git config user.name "github-actions[bot]"
5252
git config user.email "github-actions[bot]@users.noreply.github.com"
53-
git commit -m "[chore]: bump uv.lock"
53+
git commit -s -m "[chore]: bump uv.lock"
5454
git push origin "$BRANCH"
5555
gh pr create \
5656
--title "[chore]: weekly bump of uv.lock on ${BASE} ($(date +%Y-%m-%d))" \

.github/workflows/unit_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ jobs:
6464
timeout-minutes: 30
6565
strategy:
6666
matrix:
67-
py: [10, 11]
67+
py: [10, 11, 13]
6868
steps:
6969
- uses: actions/checkout@v6
7070
- uses: ./.github/actions/ubuntu-setup

CHANGELOG.rst

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,19 +8,36 @@ NVIDIA Model Optimizer Changelog
88

99
- ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.
1010

11+
**Backward Breaking Changes**
12+
13+
- Default ``--kv_cache_qformat`` in ``hf_ptq.py`` changed from ``fp8`` to ``fp8_cast``. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass ``--kv_cache_qformat fp8``.
14+
- Removed KV cache scale clamping (``clamp_(min=1.0)``) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (``--kv_cache_qformat fp8`` or ``nvfp4``), consider using the casting methods (``fp8_cast`` or ``nvfp4_cast``) instead.
15+
1116
**New Features**
1217

18+
- Add ``fp8_cast`` and ``nvfp4_cast`` modes for ``--kv_cache_qformat`` in ``hf_ptq.py``. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new ``use_constant_amax`` field in :class:`QuantizerAttributeConfig <modelopt.torch.quantization.config.QuantizerAttributeConfig>` controls this behavior.
1319
- User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
1420
- ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
1521
- Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to all the experts.
1622
- Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
1723
- Add support for rotating the input before quantization for RHT.
1824
- Add support for advanced weight scale search for NVFP4 quantization and its export path.
1925
- Enable PTQ workflow for Qwen3.5 MoE models.
26+
- Enable PTQ workflow for the Kimi-K2.5 model.
27+
- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
28+
- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
29+
- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
30+
- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
31+
- Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration.
32+
- Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
33+
- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
34+
- Add support for block-granular RHT for non-power-of-2 dimensions.
35+
- Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
2036

2137
**Misc**
2238

2339
- Migrated project metadata from ``setup.py`` to a fully declarative ``pyproject.toml``.
40+
- Enable experimental Python 3.13 wheel support and unit tests in CI/CD.
2441

2542
0.42 (2026-02-xx)
2643
^^^^^^^^^^^^^^^^^

CLAUDE.md

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# CLAUDE.md
2+
3+
NVIDIA Model Optimizer (ModelOpt): open-source library for model optimization techniques including
4+
quantization, pruning, distillation, sparsity, and speculative decoding to accelerate inference.
5+
Primarily Python codebase with optional C++/CUDA extensions supporting PyTorch, ONNX, and Hugging Face/Megatron models.
6+
7+
> If a `CLAUDE.local.md` file exists alongside this file, read and respect it — it contains
8+
> developer-specific overrides that supplement this shared guidance.
9+
10+
## Rules (Read First)
11+
12+
**CRITICAL (YOU MUST):**
13+
14+
- NVIDIA Apache 2.0 license header on ALL new Python/C++/CUDA files (see `LICENSE_HEADER`)
15+
- `git commit -s -S` (DCO sign-off + cryptographic signing required). Never attribute AI tools in
16+
sign-off line
17+
- `pre-commit` hooks run on commit — if files are modified by hooks, re-stage and commit again
18+
- PRs require CODEOWNERS review (auto-assigned based on `.github/CODEOWNERS`)
19+
- After rebasing, always re-run tests locally before pushing
20+
- All code must follow the security guidelines in `SECURITY.md` — violations are blocked as pre-merge errors
21+
- For contribution guidelines, commit conventions, and PR requirements, see `CONTRIBUTING.md`
22+
23+
## Common Commands
24+
25+
| Task | Command |
26+
|------|---------|
27+
| Install (editable + dev) | `pip install -e ".[dev]"` |
28+
| CPU unit tests | `python -m pytest tests/unit` |
29+
| GPU unit tests | `python -m pytest tests/gpu` |
30+
| Megatron GPU tests | `python -m pytest tests/gpu_megatron` |
31+
| TRT-LLM GPU tests | `python -m pytest tests/gpu_trtllm` |
32+
| Pattern match | `pytest tests/unit -k "test_quantize"` |
33+
| Lint + format (all files) | `pre-commit run --all-files` |
34+
| Lint (diff only) | `pre-commit run --from-ref origin/main --to-ref HEAD` |
35+
| Run via tox (CPU unit) | `tox -e py312-torch210-tf_latest-unit` |
36+
| Build docs | `tox -e build-docs` |
37+
| Build wheel | `tox -e build-wheel` |
38+
39+
## Architecture
40+
41+
ModelOpt is organized into three top-level namespaces:
42+
43+
| Namespace | Path | Role |
44+
|-----------|------|------|
45+
| `modelopt.torch` | `modelopt/torch/` | Core PyTorch optimization library |
46+
| `modelopt.onnx` | `modelopt/onnx/` | ONNX model quantization and export |
47+
| `modelopt.deploy` | `modelopt/deploy/` | Deployment utilities for LLMs |
48+
49+
### `modelopt.torch` Sub-packages
50+
51+
| Sub-package | Path | Role |
52+
|-------------|------|------|
53+
| `opt` | `modelopt/torch/opt/` | Core optimization infrastructure (modes, config, state dicts) |
54+
| `quantization` | `modelopt/torch/quantization/` | PTQ, QAT, and quantization-aware algorithms |
55+
| `prune` | `modelopt/torch/prune/` | Structured and unstructured pruning |
56+
| `distill` | `modelopt/torch/distill/` | Knowledge distillation |
57+
| `sparsity` | `modelopt/torch/sparsity/` | Weight and activation sparsity |
58+
| `speculative` | `modelopt/torch/speculative/` | Speculative decoding (Medusa, EAGLE, etc.) |
59+
| `nas` | `modelopt/torch/nas/` | Neural architecture search |
60+
| `export` | `modelopt/torch/export/` | Checkpoint export for TRT-LLM / Megatron |
61+
| `peft` | `modelopt/torch/peft/` | QLoRA and PEFT integration |
62+
| `_deploy` | `modelopt/torch/_deploy/` | Internal deployment utilities |
63+
| `utils` | `modelopt/torch/utils/` | Shared utilities and plugin infrastructure |
64+
65+
### Core Abstraction: Modes
66+
67+
A **mode** is the unit of model optimization in ModelOpt. Each algorithm (quantization, pruning,
68+
etc.) is implemented as one or more modes. Modes are recorded in the model's `modelopt_state` so
69+
optimization workflows can be composed, saved, and restored.
70+
71+
## Key Files
72+
73+
| File | Role |
74+
|------|------|
75+
| `modelopt/torch/opt/mode.py` | Base class for all optimization modes |
76+
| `modelopt/torch/opt/config.py` | Configuration system for modes |
77+
| `modelopt/torch/opt/conversion.py` | `apply_mode()` / `restore()` entry points |
78+
| `modelopt/torch/quantization/__init__.py` | PTQ/QAT public API |
79+
| `modelopt/torch/export/unified_export_hf.py` | Unified HF checkpoint export |
80+
| `modelopt/torch/export/model_config_export.py` | TRT-LLM model config export |
81+
| `modelopt/deploy/llm/` | LLM deployment utilities |
82+
| `pyproject.toml` | Optional dependency groups (`[onnx]`, `[hf]`, `[all]`, `[dev]`); ruff, mypy, pytest, bandit, and coverage config |
83+
| `.pre-commit-config.yaml` | Pre-commit hooks (ruff, mypy, clang-format, license headers) |
84+
| `tox.ini` | Test environment definitions |
85+
86+
## Design Patterns
87+
88+
| Pattern | Key Points |
89+
|---------|------------|
90+
| **Mode composition** | Optimization algorithms are composed as sequences of modes, each recorded in `modelopt_state` |
91+
| **Plugin system** | Optional integrations (HuggingFace, Megatron, etc.) loaded lazily via `import_plugin()` |
92+
| **Optional dependencies** | Features gated by install extras (`[onnx]`, `[hf]`, `[all]`); avoid hard imports at module level |
93+
| **Config dataclasses** | Each mode has a typed config; use Pydantic or dataclass conventions |
94+
| **State dict** | Models carry `modelopt_state` for checkpoint save/restore across optimization steps |
95+
96+
## CI / Testing
97+
98+
| Layer | Location | Notes |
99+
|-------|----------|-------|
100+
| CPU unit tests | `tests/unit/` | Fast, no GPU needed; run in pre-merge CI |
101+
| GPU unit tests | `tests/gpu/` | Requires CUDA GPU |
102+
| Megatron GPU tests | `tests/gpu_megatron/` | Requires Megatron-Core + GPU |
103+
| TRT-LLM GPU tests | `tests/gpu_trtllm/` | Requires TensorRT-LLM + GPU |
104+
| Example/integration tests | `tests/examples/` | Integration tests for examples; see `tests/examples/README.md` |
105+
| Pre-commit / lint | `.pre-commit-config.yaml` | ruff, mypy, clang-format, license headers, bandit |
106+
| Coverage | `pyproject.toml` | 70% minimum on `modelopt/*` |

CONTRIBUTING.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,12 @@ To run the pre-commit hooks without committing, use:
3939
pre-commit run --all-files
4040
```
4141

42+
## Adding a new PIP dependency
43+
44+
Currently we have 2 places where we mention pip dependencies: [pyproject.toml](./pyproject.toml) for dependencies that are required for the ModelOpt library and `examples/<example-name>/requirements.txt` for dependencies that are required for the specific examples.
45+
46+
If adding a new PIP dependency to any of these, make sure to verify the LICENSE of the dependency. If its not a permissive license (e.g. MIT, Apache 2), you need to provide a justification for the use of the dependency in the PR and check with `@NVIDIA/modelopt-setup-codeowners` if its allowed or not.
47+
4248
## 🔒 Security coding practices
4349

4450
All contributors must follow the security coding practices documented in *Security Coding Practices for

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.
2626

2727
## Latest News
2828

29+
- [2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4). Learn more in the [Nemotron 3 Super release blog](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/). Check out how to quantize Nemotron 3 models for deployment acceleration [here](./examples/llm_ptq/README.md)
30+
- [2026/03/11] [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the [Quantization (PTQ and QAT) guide](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/docs/models/llm/nemotron3-super.md#quantization-ptq-and-qat) for FP8/NVFP4 quantization and HF export instructions.
2931
- [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)
3032
- [2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
3133
- [2025/10/07] [BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)

docs/source/getting_started/_installation_for_Linux.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
1212
+-------------------------+-----------------------------+
1313
| Architecture | x86_64, aarch64 (SBSA) |
1414
+-------------------------+-----------------------------+
15-
| Python | >=3.10,<3.13 |
15+
| Python | >=3.10,<3.14 |
1616
+-------------------------+-----------------------------+
1717
| CUDA | 12.x, 13.x |
1818
+-------------------------+-----------------------------+

0 commit comments

Comments
 (0)