Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux)
- Add standalone type inference option (``--use_standalone_type_inference``) in ONNX AutoCast as an alternative to ONNX's ``infer_shapes``. This experimental feature performs type-only inference without shape inference, useful as a workaround when shape inference fails or to avoid unnecessary shape inference overhead.
- Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
- Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Also add example for distillation with Megatron-Bridge framework. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
- Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
- Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.
Expand Down
147 changes: 135 additions & 12 deletions examples/megatron_bridge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,21 +4,47 @@ This directory contains examples of using Model Optimizer with [NeMo Megatron-Br

<div align="center">

| **Section** | **Description** | **Link** | **Docs** |
| :------------: | :------------: | :------------: | :------------: |
| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] | |
| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] | |
| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] | |
| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] | |
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
| **Section** | **Description** | **Link** |
| :------------: | :------------: | :------------: |
| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] |
| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] |
| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] |
| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] |
| Resources | Extra links to relevant resources | \[[Link](#resources)\] |

</div>

## Pre-Requisites

Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`) which has all the dependencies installed.

To get the latest ModelOpt features and examples, you can mount your latest ModelOpt cloned repository to the container at `/opt/Megatron-Bridge/3rdparty/Model-Optimizer` or pull the latest changes once inside the docker container (`cd /opt/Megatron-Bridge/3rdparty/Model-Optimizer && git checkout main && git pull`).
To get the latest ModelOpt features and examples scripts, mount your Model-Optimizer repo to the container.

```bash
export MODELOPT_DIR=${PWD}/Model-Optimizer # or set to your local Model-Optimizer repository path if you have cloned it
if [ ! -d "${MODELOPT_DIR}" ]; then
git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
fi

export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02
docker run \
--gpus all \
--shm-size=16GB \
--net=host \
--ulimit memlock=-1 \
--rm -it \
-v ${MODELOPT_DIR}:/opt/Model-Optimizer \
-v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is mounting to venv also necessary?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So users can mount library and examples from same version. This avoids the case where user uses old modelopt but with examples from main branch

-w /opt/Model-Optimizer/examples/megatron_bridge \
${DOCKER_IMAGE} bash
```

Once inside the container, you need to login with your HuggingFace token to download gated datasets / models.
Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.

```bash
huggingface-cli login --token <your token>
```

## Pruning

Expand All @@ -30,7 +56,8 @@ Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while
top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model.

```bash
torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
torchrun --nproc_per_node 2 prune_minitron.py \
--pp_size 2 \
--hf_model_name_or_path Qwen/Qwen3-8B \
--prune_target_params 6e9 \
--hparams_to_skip num_attention_heads \
Expand All @@ -41,7 +68,8 @@ Example usage for manually pruning to a specific architecture using following de
1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration.

```bash
torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
torchrun --nproc_per_node 2 prune_minitron.py \
--pp_size 2 \
--hf_model_name_or_path Qwen/Qwen3-8B \
--prune_export_config '{"hidden_size": 3584, "ffn_hidden_size": 9216}' \
--output_hf_path /tmp/Qwen3-8B-Pruned-6B-manual
Expand All @@ -50,7 +78,7 @@ torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/exampl
To see the full usage for advanced configurations, run:

```bash
python /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py --help
torchrun --nproc_per_node 1 prune_minitron.py --help
```

> [!TIP]
Expand All @@ -60,7 +88,102 @@ python /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/pr

## Distillation

TODO
This section shows how to distill a student model from a teacher model in the Megatron-Bridge framework.

This can be used stand-alone or after pruning (see [Pruning](#pruning)) / quantization (see [Quantization](#quantization)) to recover accuracy of the model by distilling from the original model (teacher).

The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.

### Data Preparation

The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
You can tokenize your JSONL dataset using the following function:

```python
from modelopt.torch.utils.plugins import megatron_preprocess_data

megatron_preprocess_data(
input_path="/path/to/your/data.jsonl",
output_dir="/path/to/tokenized/data",
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
json_keys=["text"], # change to your JSON key if needed
workers=32,
log_interval=100000,
max_sequence_length=256000, # To avoid rare OOM errors if text is too long
)
```

If you have multiple JSONL files, you can tokenize them one by one and pass all the paths to the `--data_paths` argument.

### Distillation with Real Data

Example usage to distill a 4B student (HF) from an 8B teacher (HF) on 8 GPUs (TP=8, PP=1):

```bash
torchrun --nnodes 1 --nproc_per_node 8 distill.py \
--tp_size 8 \
--teacher_hf_path Qwen/Qwen3-8B \
--student_hf_path Qwen/Qwen3-4B \
--data_paths 1.0 /path/to/tokenized/data \
--data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
--seq_length 8192 \
--mbs 1 \
--gbs 768 \
--train_iters 15000 \
--lr 1e-4 \
--min_lr 1e-5 \
--lr_warmup_iters 50 \
--eval_interval 100 \
--eval_iters 32 \
--log_interval 10 \
--output_dir /output/qwen3_8b_to_4b_distill
```

Tensorboard logging is enabled by default and logs are saved to `<output_dir>/tensorboard` directory.
To use Weights & Biases for logging, set the `WANDB_API_KEY` environment variable and pass the `--wandb_project` argument.
Optionally, you can also pass `--wandb_entity` and `--wandb_exp_name` arguments to group runs under a project and experiment name.

To see all available arguments:

```bash
torchrun --nproc_per_node 1 distill.py --help
```

### Quick Test with Mock Data

Example usage with mock data for quick testing (no pre-tokenized data needed):

```bash
torchrun --nproc_per_node 8 distill.py \
--tp_size 8 \
--teacher_hf_path Qwen/Qwen3-0.6B \
--student_hf_path Qwen/Qwen3-0.6B \
--use_mock_data \
--seq_length 512 \
--mbs 1 \
--gbs 8 \
--train_iters 100 \
--eval_interval 10 \
--eval_iters 4 \
--output_dir /tmp/test_distill
```

### Slurm Usage

To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.

### Convert Megatron checkpoint to Hugging Face format

To convert the Megatron checkpoint from last iteration (or any intermediate iteration) to Hugging Face format, you need the pruned model config (`--output_hf_path` from `prune_minitron.py` script) and the distilled megatron checkpoint dir (`<distill_output_dir>/checkpoints/iter_<iter_number>`) to run the following command:

```bash
uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we assume the user already has uv installed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its in the nemo container so already installed

--hf-model <path_to_pruned_hf_ckpt> \
--megatron-path <distill_output_dir>/checkpoints/iter_<iter_number> \
--hf-path <path_to_save_distilled_hf_ckpt>
```

For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).

## Quantization

Expand Down
Loading