|
| 1 | +# Megatron Bridge |
| 2 | + |
| 3 | +This directory contains examples of using Model Optimizer with [NeMo Megatron-Bridge](https://github.com/NVIDIA-Nemo/Megatron-Bridge) framework for pruning, distillation, quantization, etc. |
| 4 | + |
| 5 | +<div align="center"> |
| 6 | + |
| 7 | +| **Section** | **Description** | **Link** | **Docs** | |
| 8 | +| :------------: | :------------: | :------------: | :------------: | |
| 9 | +| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] | | |
| 10 | +| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] | | |
| 11 | +| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] | | |
| 12 | +| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] | | |
| 13 | +| Resources | Extra links to relevant resources | \[[Link](#resources)\] | | |
| 14 | + |
| 15 | +</div> |
| 16 | + |
| 17 | +## Pre-Requisites |
| 18 | + |
| 19 | +Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`) which has all the dependencies installed. |
| 20 | + |
| 21 | +To get the latest ModelOpt features and examples, you can mount your latest ModelOpt cloned repository to the container at `/opt/Model-Optimizer` or pull the latest changes once inside the docker container (`cd /opt/Model-Optimizer && git checkout main && git pull`). |
| 22 | + |
| 23 | +## Pruning |
| 24 | + |
| 25 | +This section shows how to prune a HuggingFace model using Minitron algorithm in Megatron-Bridge framework. Checkout other available pruning algorithms, supported frameworks and models, and general pruning getting-started in the [pruning README](../pruning/README.md). |
| 26 | + |
| 27 | +Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while skipping pruning of `num_attention_heads` using following defaults: |
| 28 | + 1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration, |
| 29 | + at-most 20% depth (`num_layers`) and 40% width is pruned per prunable hparam (`hidden_size`, `ffn_hidden_size`, ...), |
| 30 | + top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model. |
| 31 | + |
| 32 | +```bash |
| 33 | +torchrun --nproc_per_node 2 /opt/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \ |
| 34 | + --hf_model_name_or_path Qwen/Qwen3-8B \ |
| 35 | + --prune_target_params 6e9 \ |
| 36 | + --hparams_to_skip num_attention_heads \ |
| 37 | + --output_hf_path /tmp/Qwen3-8B-Pruned-6B |
| 38 | +``` |
| 39 | + |
| 40 | +To see the full usage for advanced configurations, run: |
| 41 | + |
| 42 | +```bash |
| 43 | +python /opt/Model-Optimizer/examples/megatron_bridge/prune_minitron.py --help |
| 44 | +``` |
| 45 | + |
| 46 | +> [!TIP] |
| 47 | +> If number of layers in the model is not divisible by number of GPUs i.e. pipeline parallel (PP) size, you can configure |
| 48 | +> uneven PP by setting `--num_layers_in_first_pipeline_stage` and `--num_layers_in_last_pipeline_stage`. |
| 49 | +> E.g. for Qwen3-8B with 36 layers and 8 GPUs, you can set both to 3 to get 3-5-5-5-5-5-5-3 layers per GPU. |
| 50 | +
|
| 51 | +## Distillation |
| 52 | + |
| 53 | +TODO |
| 54 | + |
| 55 | +## Quantization |
| 56 | + |
| 57 | +TODO |
| 58 | + |
| 59 | +## Resources |
| 60 | + |
| 61 | +- 📅 [Roadmap](https://github.com/NVIDIA/Model-Optimizer/issues/146) |
| 62 | +- 📖 [Documentation](https://nvidia.github.io/Model-Optimizer) |
| 63 | +- 💡 [Release Notes](https://nvidia.github.io/Model-Optimizer/reference/0_changelog.html) |
| 64 | +- 🐛 [File a bug](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=1_bug_report.md) |
| 65 | +- ✨ [File a Feature Request](https://github.com/NVIDIA/Model-Optimizer/issues/new?template=2_feature_request.md) |
0 commit comments