NVIDIA
diff --git a/‎.coderabbit.yaml‎
Lines changed: 3 additions & 0 deletions b/‎.coderabbit.yaml‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 0 deletions b/‎.github/CODEOWNERS‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 2 additions & 2 deletions b/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎.github/workflows/bump_uv_lock.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/bump_uv_lock.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎.github/workflows/unit_tests.yml‎
Lines changed: 1 addition & 1 deletion b/‎.github/workflows/unit_tests.yml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG.rst‎
Lines changed: 17 additions & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 106 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 106 additions & 0 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 6 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/getting_started/_installation_for_Linux.rst‎
Lines changed: 1 addition & 1 deletion
@@ -20,6 +20,9 @@ reviews:
         5. Any use of "# nosec" comments to bypass Bandit security checks is not allowed.
            If a security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved
            by @NVIDIA/modelopt-setup-codeowners with an explicit justification in the PR description.
+        6. Any addition of new PIP dependencies in pyproject.toml or requirements.txt that are not
+           permissive licenses (e.g. MIT, Apache 2) must be reviewed and approved by
+           @NVIDIA/modelopt-setup-codeowners with an explicit justification in the PR description.
     - path: "examples/**/*.py"
       instructions: *security_instructions
   auto_review:
 
@@ -10,6 +10,7 @@ LICENSE_HEADER @NVIDIA/modelopt-setup-codeowners
 pyproject.toml @NVIDIA/modelopt-setup-codeowners
 SECURITY.md @NVIDIA/modelopt-setup-codeowners
 tox.ini @NVIDIA/modelopt-setup-codeowners
+uv.lock @NVIDIA/modelopt-setup-codeowners
 
 # Library
 modelopt/deploy @NVIDIA/modelopt-deploy-codeowners
 
@@ -17,10 +17,10 @@ Type of change: ? <!-- Use one of the following: Bug fix, new feature, new examp
 
 Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`).
 
-Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, using `torch.load(..., weights_only=True)`, avoiding `pickle`, etc.).
+Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.).
 
 - Is this change backward compatible?: ✅ / ❌ / N/A <!--- If ❌, explain why. -->
-- If you copied code from any other source, did you follow IP policy in [CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?: ✅ / ❌ / N/A <!--- Mandatory -->
+- If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A <!--- Mandatory -->
 - Did you write any new necessary tests?: ✅ / ❌ / N/A <!--- Mandatory for new features or examples. -->
 - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. -->
 
 
@@ -50,7 +50,7 @@ jobs:
           git add uv.lock
           git config user.name "github-actions[bot]"
           git config user.email "github-actions[bot]@users.noreply.github.com"
-          git commit -m "[chore]: bump uv.lock"
+          git commit -s -m "[chore]: bump uv.lock"
           git push origin "$BRANCH"
           gh pr create \
             --title "[chore]: weekly bump of uv.lock on ${BASE} ($(date +%Y-%m-%d))" \
 
@@ -64,7 +64,7 @@ jobs:
     timeout-minutes: 30
     strategy:
       matrix:
-        py: [10, 11]
+        py: [10, 11, 13]
     steps:
       - uses: actions/checkout@v6
       - uses: ./.github/actions/ubuntu-setup
 
@@ -8,19 +8,36 @@ NVIDIA Model Optimizer Changelog
 
 - ONNX Runtime dependency upgraded to 1.24 to solve missing graph outputs when using the TensorRT Execution Provider.
 
+**Backward Breaking Changes**
+
+- Default ``--kv_cache_qformat`` in ``hf_ptq.py`` changed from ``fp8`` to ``fp8_cast``. Existing scripts that rely on the default will now skip KV cache calibration and use a constant amax instead. To restore the previous calibrated behavior, explicitly pass ``--kv_cache_qformat fp8``.
+- Removed KV cache scale clamping (``clamp_(min=1.0)``) in the HF checkpoint export path. Calibrated KV cache scales below 1.0 are now exported as-is. If you observe accuracy degradation with calibrated KV cache (``--kv_cache_qformat fp8`` or ``nvfp4``), consider using the casting methods (``fp8_cast`` or ``nvfp4_cast``) instead.
+
 **New Features**
 
+- Add ``fp8_cast`` and ``nvfp4_cast`` modes for ``--kv_cache_qformat`` in ``hf_ptq.py``. These use a constant amax (FP8 E4M3 max, 448.0) without data-driven calibration, since the downstream engine uses FP8 attention math for both FP8 and NVFP4 quantization. A new ``use_constant_amax`` field in :class:`QuantizerAttributeConfig <modelopt.torch.quantization.config.QuantizerAttributeConfig>` controls this behavior.
 - User does not need to manually register MOE modules to cover experts calibration coverage in PTQ workflow.
 - ``hf_ptq.py`` now saves the quantization summary and moe expert token count table to the export directory.
 - Add ``--moe_calib_experts_ratio`` flag in ``hf_ptq.py`` to specify the ratio of experts to calibrate during forward pass to improve expert coverage during calibration. Default to all the experts.
 - Add sparse attention optimization for transformer models (``modelopt.torch.sparsity.attention_sparsity``). This reduces computational cost by skipping attention computation. Supports calibration for threshold selection on HuggingFace models. See `examples/llm_sparsity/attention_sparsity/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/llm_sparsity/attention_sparsity>`_ for usage.
 - Add support for rotating the input before quantization for RHT.
 - Add support for advanced weight scale search for NVFP4 quantization and its export path.
 - Enable PTQ workflow for Qwen3.5 MoE models.
+- Enable PTQ workflow for the Kimi-K2.5 model.
+- Add ``nvfp4_omlp_only`` quantization format for NVFP4 quantization. This is similar to ``nvfp4_mlp_only`` but also quantizes the output projection layer in attention.
+- ``pass_through_bwd`` in the quantization config is now default to True. Please set it to False if you want to use STE with zeroed outlier gradients for potentially better QAT accuracy.
+- Add :meth:`compute_quantization_mse <modelopt.torch.quantization.model_quant.compute_quantization_mse>` API to measure per-quantizer mean-squared quantization error, with flexible wildcard and callable filtering.
+- **Autotune**: New tool for automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models. Uses TensorRT latency measurements to choose insertion schemes that minimize inference time. Discovers regions automatically, groups them by structural pattern, and tests multiple Q/DQ schemes per pattern. Supports INT8 and FP8 quantization, pattern cache for warm-start on similar models, checkpoint/resume, and importing patterns from an existing QDQ baseline. CLI: ``python -m modelopt.onnx.quantization.autotune``. See the Autotune guide in the documentation.
+- Add ``get_auto_quantize_config`` API to extract a flat quantization config from ``auto_quantize`` search results, enabling re-quantization at different effective bit targets without re-running calibration.
+- Improve ``auto_quantize`` checkpoint/resume: calibration state is now saved and restored across runs, avoiding redundant calibration when resuming a search.
+- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
+- Add support for block-granular RHT for non-power-of-2 dimensions.
+- Replace modelopt FP8 QDQ nodes with native ONNX QDQ nodes.
 
 **Misc**
 
 - Migrated project metadata from ``setup.py`` to a fully declarative ``pyproject.toml``.
+- Enable experimental Python 3.13 wheel support and unit tests in CI/CD.
 
 0.42 (2026-02-xx)
 ^^^^^^^^^^^^^^^^^
 
@@ -0,0 +1,106 @@
+# CLAUDE.md
+
+NVIDIA Model Optimizer (ModelOpt): open-source library for model optimization techniques including
+quantization, pruning, distillation, sparsity, and speculative decoding to accelerate inference.
+Primarily Python codebase with optional C++/CUDA extensions supporting PyTorch, ONNX, and Hugging Face/Megatron models.
+
+> If a `CLAUDE.local.md` file exists alongside this file, read and respect it — it contains
+> developer-specific overrides that supplement this shared guidance.
+
+## Rules (Read First)
+
+**CRITICAL (YOU MUST):**
+
+- NVIDIA Apache 2.0 license header on ALL new Python/C++/CUDA files (see `LICENSE_HEADER`)
+- `git commit -s -S` (DCO sign-off + cryptographic signing required). Never attribute AI tools in
+  sign-off line
+- `pre-commit` hooks run on commit — if files are modified by hooks, re-stage and commit again
+- PRs require CODEOWNERS review (auto-assigned based on `.github/CODEOWNERS`)
+- After rebasing, always re-run tests locally before pushing
+- All code must follow the security guidelines in `SECURITY.md` — violations are blocked as pre-merge errors
+- For contribution guidelines, commit conventions, and PR requirements, see `CONTRIBUTING.md`
+
+## Common Commands
+
+| Task | Command |
+|------|---------|
+| Install (editable + dev) | `pip install -e ".[dev]"` |
+| CPU unit tests | `python -m pytest tests/unit` |
+| GPU unit tests | `python -m pytest tests/gpu` |
+| Megatron GPU tests | `python -m pytest tests/gpu_megatron` |
+| TRT-LLM GPU tests | `python -m pytest tests/gpu_trtllm` |
+| Pattern match | `pytest tests/unit -k "test_quantize"` |
+| Lint + format (all files) | `pre-commit run --all-files` |
+| Lint (diff only) | `pre-commit run --from-ref origin/main --to-ref HEAD` |
+| Run via tox (CPU unit) | `tox -e py312-torch210-tf_latest-unit` |
+| Build docs | `tox -e build-docs` |
+| Build wheel | `tox -e build-wheel` |
+
+## Architecture
+
+ModelOpt is organized into three top-level namespaces:
+
+| Namespace | Path | Role |
+|-----------|------|------|
+| `modelopt.torch` | `modelopt/torch/` | Core PyTorch optimization library |
+| `modelopt.onnx` | `modelopt/onnx/` | ONNX model quantization and export |
+| `modelopt.deploy` | `modelopt/deploy/` | Deployment utilities for LLMs |
+
+### `modelopt.torch` Sub-packages
+
+| Sub-package | Path | Role |
+|-------------|------|------|
+| `opt` | `modelopt/torch/opt/` | Core optimization infrastructure (modes, config, state dicts) |
+| `quantization` | `modelopt/torch/quantization/` | PTQ, QAT, and quantization-aware algorithms |
+| `prune` | `modelopt/torch/prune/` | Structured and unstructured pruning |
+| `distill` | `modelopt/torch/distill/` | Knowledge distillation |
+| `sparsity` | `modelopt/torch/sparsity/` | Weight and activation sparsity |
+| `speculative` | `modelopt/torch/speculative/` | Speculative decoding (Medusa, EAGLE, etc.) |
+| `nas` | `modelopt/torch/nas/` | Neural architecture search |
+| `export` | `modelopt/torch/export/` | Checkpoint export for TRT-LLM / Megatron |
+| `peft` | `modelopt/torch/peft/` | QLoRA and PEFT integration |
+| `_deploy` | `modelopt/torch/_deploy/` | Internal deployment utilities |
+| `utils` | `modelopt/torch/utils/` | Shared utilities and plugin infrastructure |
+
+### Core Abstraction: Modes
+
+A **mode** is the unit of model optimization in ModelOpt. Each algorithm (quantization, pruning,
+etc.) is implemented as one or more modes. Modes are recorded in the model's `modelopt_state` so
+optimization workflows can be composed, saved, and restored.
+
+## Key Files
+
+| File | Role |
+|------|------|
+| `modelopt/torch/opt/mode.py` | Base class for all optimization modes |
+| `modelopt/torch/opt/config.py` | Configuration system for modes |
+| `modelopt/torch/opt/conversion.py` | `apply_mode()` / `restore()` entry points |
+| `modelopt/torch/quantization/__init__.py` | PTQ/QAT public API |
+| `modelopt/torch/export/unified_export_hf.py` | Unified HF checkpoint export |
+| `modelopt/torch/export/model_config_export.py` | TRT-LLM model config export |
+| `modelopt/deploy/llm/` | LLM deployment utilities |
+| `pyproject.toml` | Optional dependency groups (`[onnx]`, `[hf]`, `[all]`, `[dev]`); ruff, mypy, pytest, bandit, and coverage config |
+| `.pre-commit-config.yaml` | Pre-commit hooks (ruff, mypy, clang-format, license headers) |
+| `tox.ini` | Test environment definitions |
+
+## Design Patterns
+
+| Pattern | Key Points |
+|---------|------------|
+| **Mode composition** | Optimization algorithms are composed as sequences of modes, each recorded in `modelopt_state` |
+| **Plugin system** | Optional integrations (HuggingFace, Megatron, etc.) loaded lazily via `import_plugin()` |
+| **Optional dependencies** | Features gated by install extras (`[onnx]`, `[hf]`, `[all]`); avoid hard imports at module level |
+| **Config dataclasses** | Each mode has a typed config; use Pydantic or dataclass conventions |
+| **State dict** | Models carry `modelopt_state` for checkpoint save/restore across optimization steps |
+
+## CI / Testing
+
+| Layer | Location | Notes |
+|-------|----------|-------|
+| CPU unit tests | `tests/unit/` | Fast, no GPU needed; run in pre-merge CI |
+| GPU unit tests | `tests/gpu/` | Requires CUDA GPU |
+| Megatron GPU tests | `tests/gpu_megatron/` | Requires Megatron-Core + GPU |
+| TRT-LLM GPU tests | `tests/gpu_trtllm/` | Requires TensorRT-LLM + GPU |
+| Example/integration tests | `tests/examples/` | Integration tests for examples; see `tests/examples/README.md` |
+| Pre-commit / lint | `.pre-commit-config.yaml` | ruff, mypy, clang-format, license headers, bandit |
+| Coverage | `pyproject.toml` | 70% minimum on `modelopt/*` |
@@ -39,6 +39,12 @@ To run the pre-commit hooks without committing, use:
 pre-commit run --all-files
 ```
 
+## Adding a new PIP dependency
+
+Currently we have 2 places where we mention pip dependencies: [pyproject.toml](./pyproject.toml) for dependencies that are required for the ModelOpt library and `examples/<example-name>/requirements.txt` for dependencies that are required for the specific examples.
+
+If adding a new PIP dependency to any of these, make sure to verify the LICENSE of the dependency. If its not a permissive license (e.g. MIT, Apache 2), you need to provide a justification for the use of the dependency in the PR and check with `@NVIDIA/modelopt-setup-codeowners` if its allowed or not.
+
 ## 🔒 Security coding practices
 
 All contributors must follow the security coding practices documented in *Security Coding Practices for
 
@@ -26,6 +26,8 @@ Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.
 
 ## Latest News
 
+- [2026/03/11] Model Optimizer quantized Nemotron-3-Super checkpoints are available on Hugging Face for download: [FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), [NVFP4](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4). Learn more in the [Nemotron 3 Super release blog](https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/). Check out how to quantize Nemotron 3 models for deployment acceleration [here](./examples/llm_ptq/README.md)
+- [2026/03/11] [NeMo Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) now supports Nemotron-3-Super quantization (PTQ and QAT) and export workflows using the Model Optimizer library. See the [Quantization (PTQ and QAT) guide](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/super-v3/docs/models/llm/nemotron3-super.md#quantization-ptq-and-qat) for FP8/NVFP4 quantization and HF export instructions.
 - [2025/12/11] [BLOG: Top 5 AI Model Optimization Techniques for Faster, Smarter Inference](https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference/)
 - [2025/12/08] NVIDIA TensorRT Model Optimizer is now officially rebranded as NVIDIA Model Optimizer.
 - [2025/10/07] [BLOG: Pruning and Distilling LLMs Using NVIDIA Model Optimizer](https://developer.nvidia.com/blog/pruning-and-distilling-llms-using-nvidia-tensorrt-model-optimizer/)
 
@@ -12,7 +12,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
 +-------------------------+-----------------------------+
 | Architecture            |  x86_64, aarch64 (SBSA)     |
 +-------------------------+-----------------------------+
-| Python                  |  >=3.10,<3.13               |
+| Python                  |  >=3.10,<3.14               |
 +-------------------------+-----------------------------+
 | CUDA                    |  12.x, 13.x                 |
 +-------------------------+-----------------------------+