Skip to content

Commit 8857025

Browse files
Add Megatron-Bridge recipe-free distillation example script (#861)
## What does this PR do? **Type of change:** New example script <!-- Use one of the following: Bug fix, new feature, new example, new tests, documentation. --> - [x] M-Bridge recipe-free distillation script so its more easier to run and can support pruned models - [x] Fix resuming distillation run ## Usage <!-- You can potentially add a usage example below. --> ```python torchrun --nproc_per_node 8 distill.py \ --teacher_hf_path Qwen/Qwen3-8B \ --student_hf_path Qwen3-8B-NAS-Pruned-6B \ --tp_size 8 \ --data_paths <climbmix 25% tokenized (~90B tokens)> \ --data_path_to_cache /path/to/cache/climbmix_dataset_indices_qwen3 \ --seq_length 4096 \ --mbs 8 \ --gbs 768 \ --train_iters 28500 \ --lr 1e-4 \ --min_lr 1e-5 \ --lr_warmup_iters 100 \ --eval_interval 500 \ --eval_iters 32 \ --log_interval 10 \ --output_dir qwen3_8b_6b_mbridge_distill ``` ## Testing <!-- Mention how have you tested your change if applicable. --> - [x] Re-ran Qwen3 8B -> 6B experiments and compare with Nemo2 results from blog Best subnet from NAS: `{'num_layers': 30, 'hidden_size': 3584, 'ffn_hidden_size': 11776} -> 5.99B params, 0.5718 score` | Model | MMLU | GSM8K - flexible, strict | MBPP (coding) | | ------- | ------ | ------- | ------- | | Qwen3-8B | 74.9 | 87.5, 84.6 | 65.4 | | Qwen3-8B-Pruned-6B | 57.6 | 11.6, 10.0 | 4.8 | | Qwen3-8B-Pruned-6B (Distilled for 16k steps i.e. 50B tokens ~3k GPU hours) | 71.6 | 78.0, 64.7 | 43.4 | | Qwen3-8B-Pruned-6B (Distilled for 28.5k steps i.e. 90B tokens ~5.2k GPU hours) | 71.9 | 78.1, 64.8 | 44.2 | | Qwen3-4B | 70.0 | 81.1, 84.7 | 62.8 | Previous Nemo2 experiments on depth pruned Qwen3 8B -> 6B (24 layers) had MMLU ~72.0 so more or less similar. No hparam tuning done for current M-Bridge distillation run - [ ] (Separate PR) GitHub CI/CD test for example script with NeMo 26.02 container ## Before your PR is "*Ready for review*" <!-- If you haven't finished some of the above items you can still open `Draft` PR. --> - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes <!--- If No, explain why. --> - **Did you write any new necessary tests?**: N/A - **Did you add or update any necessary documentation?**: Yes - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: Yes <!--- Only for new features, API changes, critical bug fixes or bw breaking changes. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added complete distillation workflow and example for Megatron-Bridge optimization. * **Documentation** * Enhanced setup guide with Docker workflows, data preparation steps, and detailed distillation instructions. * Improved usage documentation and help references. * **Improvements** * Better data preprocessing output with human-readable formatting for metrics. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent 6ec5135 commit 8857025

File tree

5 files changed

+408
-22
lines changed

5 files changed

+408
-22
lines changed

CHANGELOG.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux)
1313
- Add standalone type inference option (``--use_standalone_type_inference``) in ONNX AutoCast as an alternative to ONNX's ``infer_shapes``. This experimental feature performs type-only inference without shape inference, useful as a workaround when shape inference fails or to avoid unnecessary shape inference overhead.
1414
- Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
1515
- Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
16-
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
16+
- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Also add example for distillation with Megatron-Bridge framework. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
1717
- Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
1818
- Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
1919
- Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.

examples/megatron_bridge/README.md

Lines changed: 135 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -4,21 +4,47 @@ This directory contains examples of using Model Optimizer with [NeMo Megatron-Br
44

55
<div align="center">
66

7-
| **Section** | **Description** | **Link** | **Docs** |
8-
| :------------: | :------------: | :------------: | :------------: |
9-
| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] | |
10-
| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] | |
11-
| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] | |
12-
| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] | |
13-
| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
7+
| **Section** | **Description** | **Link** |
8+
| :------------: | :------------: | :------------: |
9+
| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] |
10+
| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] |
11+
| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] |
12+
| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] |
13+
| Resources | Extra links to relevant resources | \[[Link](#resources)\] |
1414

1515
</div>
1616

1717
## Pre-Requisites
1818

1919
Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`) which has all the dependencies installed.
2020

21-
To get the latest ModelOpt features and examples, you can mount your latest ModelOpt cloned repository to the container at `/opt/Megatron-Bridge/3rdparty/Model-Optimizer` or pull the latest changes once inside the docker container (`cd /opt/Megatron-Bridge/3rdparty/Model-Optimizer && git checkout main && git pull`).
21+
To get the latest ModelOpt features and examples scripts, mount your Model-Optimizer repo to the container.
22+
23+
```bash
24+
export MODELOPT_DIR=${PWD}/Model-Optimizer # or set to your local Model-Optimizer repository path if you have cloned it
25+
if [ ! -d "${MODELOPT_DIR}" ]; then
26+
git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
27+
fi
28+
29+
export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02
30+
docker run \
31+
--gpus all \
32+
--shm-size=16GB \
33+
--net=host \
34+
--ulimit memlock=-1 \
35+
--rm -it \
36+
-v ${MODELOPT_DIR}:/opt/Model-Optimizer \
37+
-v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \
38+
-w /opt/Model-Optimizer/examples/megatron_bridge \
39+
${DOCKER_IMAGE} bash
40+
```
41+
42+
Once inside the container, you need to login with your HuggingFace token to download gated datasets / models.
43+
Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.
44+
45+
```bash
46+
huggingface-cli login --token <your token>
47+
```
2248

2349
## Pruning
2450

@@ -30,7 +56,8 @@ Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while
3056
top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model.
3157

3258
```bash
33-
torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
59+
torchrun --nproc_per_node 2 prune_minitron.py \
60+
--pp_size 2 \
3461
--hf_model_name_or_path Qwen/Qwen3-8B \
3562
--prune_target_params 6e9 \
3663
--hparams_to_skip num_attention_heads \
@@ -41,7 +68,8 @@ Example usage for manually pruning to a specific architecture using following de
4168
1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration.
4269

4370
```bash
44-
torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
71+
torchrun --nproc_per_node 2 prune_minitron.py \
72+
--pp_size 2 \
4573
--hf_model_name_or_path Qwen/Qwen3-8B \
4674
--prune_export_config '{"hidden_size": 3584, "ffn_hidden_size": 9216}' \
4775
--output_hf_path /tmp/Qwen3-8B-Pruned-6B-manual
@@ -50,7 +78,7 @@ torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/exampl
5078
To see the full usage for advanced configurations, run:
5179

5280
```bash
53-
python /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py --help
81+
torchrun --nproc_per_node 1 prune_minitron.py --help
5482
```
5583

5684
> [!TIP]
@@ -60,7 +88,102 @@ python /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/pr
6088
6189
## Distillation
6290

63-
TODO
91+
This section shows how to distill a student model from a teacher model in the Megatron-Bridge framework.
92+
93+
This can be used stand-alone or after pruning (see [Pruning](#pruning)) / quantization (see [Quantization](#quantization)) to recover accuracy of the model by distilling from the original model (teacher).
94+
95+
The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
96+
97+
### Data Preparation
98+
99+
The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
100+
You can tokenize your JSONL dataset using the following function:
101+
102+
```python
103+
from modelopt.torch.utils.plugins import megatron_preprocess_data
104+
105+
megatron_preprocess_data(
106+
input_path="/path/to/your/data.jsonl",
107+
output_dir="/path/to/tokenized/data",
108+
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
109+
json_keys=["text"], # change to your JSON key if needed
110+
workers=32,
111+
log_interval=100000,
112+
max_sequence_length=256000, # To avoid rare OOM errors if text is too long
113+
)
114+
```
115+
116+
If you have multiple JSONL files, you can tokenize them one by one and pass all the paths to the `--data_paths` argument.
117+
118+
### Distillation with Real Data
119+
120+
Example usage to distill a 4B student (HF) from an 8B teacher (HF) on 8 GPUs (TP=8, PP=1):
121+
122+
```bash
123+
torchrun --nnodes 1 --nproc_per_node 8 distill.py \
124+
--tp_size 8 \
125+
--teacher_hf_path Qwen/Qwen3-8B \
126+
--student_hf_path Qwen/Qwen3-4B \
127+
--data_paths 1.0 /path/to/tokenized/data \
128+
--data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
129+
--seq_length 8192 \
130+
--mbs 1 \
131+
--gbs 768 \
132+
--train_iters 15000 \
133+
--lr 1e-4 \
134+
--min_lr 1e-5 \
135+
--lr_warmup_iters 50 \
136+
--eval_interval 100 \
137+
--eval_iters 32 \
138+
--log_interval 10 \
139+
--output_dir /output/qwen3_8b_to_4b_distill
140+
```
141+
142+
Tensorboard logging is enabled by default and logs are saved to `<output_dir>/tensorboard` directory.
143+
To use Weights & Biases for logging, set the `WANDB_API_KEY` environment variable and pass the `--wandb_project` argument.
144+
Optionally, you can also pass `--wandb_entity` and `--wandb_exp_name` arguments to group runs under a project and experiment name.
145+
146+
To see all available arguments:
147+
148+
```bash
149+
torchrun --nproc_per_node 1 distill.py --help
150+
```
151+
152+
### Quick Test with Mock Data
153+
154+
Example usage with mock data for quick testing (no pre-tokenized data needed):
155+
156+
```bash
157+
torchrun --nproc_per_node 8 distill.py \
158+
--tp_size 8 \
159+
--teacher_hf_path Qwen/Qwen3-0.6B \
160+
--student_hf_path Qwen/Qwen3-0.6B \
161+
--use_mock_data \
162+
--seq_length 512 \
163+
--mbs 1 \
164+
--gbs 8 \
165+
--train_iters 100 \
166+
--eval_interval 10 \
167+
--eval_iters 4 \
168+
--output_dir /tmp/test_distill
169+
```
170+
171+
### Slurm Usage
172+
173+
To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.
174+
175+
### Convert Megatron checkpoint to Hugging Face format
176+
177+
To convert the Megatron checkpoint from last iteration (or any intermediate iteration) to Hugging Face format, you need the pruned model config (`--output_hf_path` from `prune_minitron.py` script) and the distilled megatron checkpoint dir (`<distill_output_dir>/checkpoints/iter_<iter_number>`) to run the following command:
178+
179+
```bash
180+
uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
181+
--hf-model <path_to_pruned_hf_ckpt> \
182+
--megatron-path <distill_output_dir>/checkpoints/iter_<iter_number> \
183+
--hf-path <path_to_save_distilled_hf_ckpt>
184+
```
185+
186+
For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
64187

65188
## Quantization
66189

0 commit comments

Comments
 (0)