Skip to content

Commit 6c0fa6b

Browse files
committed
Merge main
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
2 parents 8bb048e + 8a8c250 commit 6c0fa6b

51 files changed

Lines changed: 5483 additions & 749 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/CODEOWNERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ modelopt/torch/utils @NVIDIA/modelopt-torch-utils-codeowners
4444
/examples/llm_ptq @NVIDIA/modelopt-examples-llm_ptq-codeowners
4545
/examples/llm_qat @NVIDIA/modelopt-examples-llm_qat-codeowners
4646
/examples/llm_sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
47+
/examples/megatron_bridge @NVIDIA/modelopt-examples-megatron-codeowners
4748
/examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
4849
/examples/nemo_run @NVIDIA/modelopt-examples-megatron-codeowners
4950
/examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners

.github/workflows/gpu_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# NOTE: Make sure this file is consistent with .gitlab/tests.yml
1+
# TODO: Optimize gpu tests runtime!
22
name: GPU tests
33

44
on:
@@ -78,7 +78,7 @@ jobs:
7878
gpu-tests-non-pr:
7979
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
8080
runs-on: linux-amd64-gpu-h100-latest-2
81-
timeout-minutes: 120
81+
timeout-minutes: 150
8282
container: *gpu_container
8383
steps: *gpu_steps
8484
gpu-pr-required-check:

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux)
1313
- Add standalone type inference option (``--use_standalone_type_inference``) in ONNX AutoCast as an alternative to ONNX's ``infer_shapes``. This experimental feature performs type-only inference without shape inference, useful as a workaround when shape inference fails or to avoid unnecessary shape inference overhead.
1414
- Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
1515
- Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
16+
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
1617
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
1718
- Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
1819
- Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ ______________________________________________________________________
2020
**[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.
2121

2222
**[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
23-
Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-NeMo/NeMo), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
23+
Model Optimizer is also integrated with [NVIDIA Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
2424

2525
**[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm). The unified Hugging Face export API now supports both transformers and diffusers models.
2626

examples/diffusers/quantization/quantize.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -454,6 +454,11 @@ def create_argument_parser() -> argparse.ArgumentParser:
454454

455455

456456
def main() -> None:
457+
from diffusers.models.normalization import RMSNorm as DiffuserRMSNorm
458+
459+
torch.nn.RMSNorm = DiffuserRMSNorm
460+
torch.nn.modules.normalization.RMSNorm = DiffuserRMSNorm
461+
457462
parser = create_argument_parser()
458463
args, unknown_args = parser.parse_known_args()
459464

examples/megatron_bridge/README.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Megatron Bridge
2+
3+
This directory contains examples of using Model Optimizer with [NeMo Megatron-Bridge](https://github.com/NVIDIA-Nemo/Megatron-Bridge) framework for pruning, distillation, quantization, etc.
4+
5+
<div align="center">
6+
7+
| **Section** | **Description** | **Link** | **Docs** |
8+
| :------------: | :------------: | :------------: | :------------: |
9+
| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] | |
10+
| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] | |
11+
| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] | |
12+
| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] | |
13+
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
14+
15+
</div>
16+
17+
## Pre-Requisites
18+
19+
Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`) which has all the dependencies installed.
20+
21+
To get the latest ModelOpt features and examples, you can mount your latest ModelOpt cloned repository to the container at `/opt/Model-Optimizer` or pull the latest changes once inside the docker container (`cd /opt/Model-Optimizer && git checkout main && git pull`).
22+
23+
## Pruning
24+
25+
This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md).
26+
27+
Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while skipping pruning of `num_attention_heads` using following defaults:
28+
1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration,
29+
at-most 20% depth (`num_layers`) and 40% width is pruned per prunable hparam (`hidden_size`, `ffn_hidden_size`, ...),
30+
top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model.
31+
32+
```bash
33+
torchrun --nproc_per_node 2 /opt/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
34+
--hf_model_name_or_path Qwen/Qwen3-8B \
35+
--prune_target_params 6e9 \
36+
--hparams_to_skip num_attention_heads \
37+
--output_hf_path /tmp/Qwen3-8B-Pruned-6B
38+
```
39+
40+
To see the full usage for advanced configurations, run:
41+
42+
```bash
43+
python /opt/Model-Optimizer/examples/megatron_bridge/prune_minitron.py --help
44+
```
45+
46+
> [!TIP]
47+
> If number of layers in the model is not divisible by number of GPUs i.e. pipeline parallel (PP) size, you can configure
48+
> uneven PP by setting `--num_layers_in_first_pipeline_stage` and `--num_layers_in_last_pipeline_stage`.
49+
> E.g. for Qwen3-8B with 36 layers and 8 GPUs, you can set both to 3 to get 3-5-5-5-5-5-5-3 layers per GPU.
50+
51+
## Distillation
52+
53+
TODO
54+
55+
## Quantization
56+
57+
TODO
58+
59+
## Resources
60+
61+
- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146)
62+
- 📖 [Documentation](https://nvidia.github.io/Model-Optimizer)
63+
- 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html)
64+
- 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md)
65+
-[File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md)

0 commit comments

Comments
 (0)