NVIDIA · kevalmorabia97 · Feb 11, 2026 · Feb 6, 2026 · Feb 6, 2026 · Feb 9, 2026
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -13,7 +13,7 @@ NVIDIA Model Optimizer Changelog (Linux)
 - Add standalone type inference option (``--use_standalone_type_inference``) in ONNX AutoCast as an alternative to ONNX's ``infer_shapes``. This experimental feature performs type-only inference without shape inference, useful as a workaround when shape inference fails or to avoid unnecessary shape inference overhead.
 - Add support for Kimi K2 Thinking model quantization from the original int4 checkpoint.
 - Add support for ``params`` constraint based automatic neural architecture search in Minitron pruning (``mcore_minitron``) as an alternative to manual pruning (using ``export_config``). See `examples/pruning/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/pruning>`_ for more details on its usage.
-- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
+- New example for Minitron pruning with Megatron-Bridge framework along with advanced pruning usage with new ``params`` constraint based pruning. Also add example for distillation with Megatron-Bridge framework. Check `examples/megatron_bridge/README.md <https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/megatron_bridge>`_ for example scripts.
 - Add support for calibration data with multiple samples in ``npz`` format in the ONNX Autocast workflow.
 - Add ``--opset`` option to ONNX quantization CLI to specify the target opset version for the quantized model.
 - Add support for context parallelism in Eagle speculative decoding for huggingface and megatron core models.

@@ -4,21 +4,47 @@ This directory contains examples of using Model Optimizer with [NeMo Megatron-Br
 
 <div align="center">
 
-| **Section** | **Description** | **Link** | **Docs** |
-| :------------: | :------------: | :------------: | :------------: |
-| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] | |
-| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] | |
-| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] | |
-| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] | |
-| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
+| **Section** | **Description** | **Link** |
+| :------------: | :------------: | :------------: |
+| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] |
+| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] |
+| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] |
+| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] |
+| Resources | Extra links to relevant resources | \[[Link](#resources)\] |
 
 </div>
 
 ## Pre-Requisites
 
 Running these examples requires many additional dependencies to be installed (e.g., Megatron-Bridge, Megatron-core, etc.), hence we strongly recommend directly using the NeMo container (e.g., `nvcr.io/nvidia/nemo:26.02`) which has all the dependencies installed.
 
-To get the latest ModelOpt features and examples, you can mount your latest ModelOpt cloned repository to the container at `/opt/Megatron-Bridge/3rdparty/Model-Optimizer` or pull the latest changes once inside the docker container (`cd /opt/Megatron-Bridge/3rdparty/Model-Optimizer && git checkout main && git pull`).
+To get the latest ModelOpt features and examples scripts, mount your Model-Optimizer repo to the container.
+
+```bash
+export MODELOPT_DIR=${PWD}/Model-Optimizer # or set to your local Model-Optimizer repository path if you have cloned it
+if [ ! -d "${MODELOPT_DIR}" ]; then
+  git clone https://github.com/NVIDIA/Model-Optimizer.git ${MODELOPT_DIR}
+fi
+
+export DOCKER_IMAGE=nvcr.io/nvidia/nemo:26.02
+docker run \
+  --gpus all \
+  --shm-size=16GB \
+  --net=host \
+  --ulimit memlock=-1 \
+  --rm -it \
+  -v ${MODELOPT_DIR}:/opt/Model-Optimizer \
+  -v ${MODELOPT_DIR}/modelopt:/opt/venv/lib/python3.12/site-packages/modelopt \
+  -w /opt/Model-Optimizer/examples/megatron_bridge \
+  ${DOCKER_IMAGE} bash
+```
+
+Once inside the container, you need to login with your HuggingFace token to download gated datasets / models.
+Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.
+
+```bash
+huggingface-cli login --token <your token>
+```
 
 ## Pruning
 
@@ -30,7 +56,8 @@ Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while
     top-10 candidates are evaluated for MMLU score (5% sampled data) to select the best model.
 
 ```bash
-torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
+torchrun --nproc_per_node 2 prune_minitron.py \
+    --pp_size 2 \
     --hf_model_name_or_path Qwen/Qwen3-8B \
     --prune_target_params 6e9 \
     --hparams_to_skip num_attention_heads \
@@ -41,7 +68,8 @@ Example usage for manually pruning to a specific architecture using following de
     1024 samples from [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) for calibration.
 
 ```bash
-torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py \
+torchrun --nproc_per_node 2 prune_minitron.py \
+    --pp_size 2 \
     --hf_model_name_or_path Qwen/Qwen3-8B \
     --prune_export_config '{"hidden_size": 3584, "ffn_hidden_size": 9216}' \
     --output_hf_path /tmp/Qwen3-8B-Pruned-6B-manual
@@ -50,7 +78,7 @@ torchrun --nproc_per_node 2 /opt/Megatron-Bridge/3rdparty/Model-Optimizer/exampl
 To see the full usage for advanced configurations, run:
 
 ```bash
-python /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/prune_minitron.py --help
+torchrun --nproc_per_node 1 prune_minitron.py --help
 ```
 
 > [!TIP]
@@ -60,7 +88,102 @@ python /opt/Megatron-Bridge/3rdparty/Model-Optimizer/examples/megatron_bridge/pr
 
 ## Distillation
 
-TODO
+This section shows how to distill a student model from a teacher model in the Megatron-Bridge framework.
+
+This can be used stand-alone or after pruning (see [Pruning](#pruning)) / quantization (see [Quantization](#quantization)) to recover accuracy of the model by distilling from the original model (teacher).
+
+The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
+
+### Data Preparation
+
+The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
+You can tokenize your JSONL dataset using the following function:
+
+```python
+from modelopt.torch.utils.plugins import megatron_preprocess_data
+
+megatron_preprocess_data(
+    input_path="/path/to/your/data.jsonl",
+    output_dir="/path/to/tokenized/data",
+    tokenizer_name_or_path="Qwen/Qwen3-0.6B",
+    json_keys=["text"],  # change to your JSON key if needed
+    workers=32,
+    log_interval=100000,
+    max_sequence_length=256000,  # To avoid rare OOM errors if text is too long
+)
+```
+
+If you have multiple JSONL files, you can tokenize them one by one and pass all the paths to the `--data_paths` argument.
+
+### Distillation with Real Data
+
+Example usage to distill a 4B student (HF) from an 8B teacher (HF) on 8 GPUs (TP=8, PP=1):
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8 distill.py \
+    --tp_size 8 \
+    --teacher_hf_path Qwen/Qwen3-8B \
+    --student_hf_path Qwen/Qwen3-4B \
+    --data_paths 1.0 /path/to/tokenized/data \
+    --data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
+    --seq_length 8192 \
+    --mbs 1 \
+    --gbs 768 \
+    --train_iters 15000 \
+    --lr 1e-4 \
+    --min_lr 1e-5 \
+    --lr_warmup_iters 50 \
+    --eval_interval 100 \
+    --eval_iters 32 \
+    --log_interval 10 \
+    --output_dir /output/qwen3_8b_to_4b_distill
+```
+
+Tensorboard logging is enabled by default and logs are saved to `<output_dir>/tensorboard` directory.
+To use Weights & Biases for logging, set the `WANDB_API_KEY` environment variable and pass the `--wandb_project` argument.
+Optionally, you can also pass `--wandb_entity` and `--wandb_exp_name` arguments to group runs under a project and experiment name.
+
+To see all available arguments:
+
+```bash
+torchrun --nproc_per_node 1 distill.py --help
+```
+
+### Quick Test with Mock Data
+
+Example usage with mock data for quick testing (no pre-tokenized data needed):
+
+```bash
+torchrun --nproc_per_node 8 distill.py \
+    --tp_size 8 \
+    --teacher_hf_path Qwen/Qwen3-0.6B \
+    --student_hf_path Qwen/Qwen3-0.6B \
+    --use_mock_data \
+    --seq_length 512 \
+    --mbs 1 \
+    --gbs 8 \
+    --train_iters 100 \
+    --eval_interval 10 \
+    --eval_iters 4 \
+    --output_dir /tmp/test_distill
+```
+
+### Slurm Usage
+
+To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.
+
+### Convert Megatron checkpoint to Hugging Face format
+
+To convert the Megatron checkpoint from last iteration (or any intermediate iteration) to Hugging Face format, you need the pruned model config (`--output_hf_path` from `prune_minitron.py` script) and the distilled megatron checkpoint dir (`<distill_output_dir>/checkpoints/iter_<iter_number>`) to run the following command:
+
+```bash
+uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
+    --hf-model <path_to_pruned_hf_ckpt> \
+    --megatron-path <distill_output_dir>/checkpoints/iter_<iter_number> \
+    --hf-path <path_to_save_distilled_hf_ckpt>
+```
+
+For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
 
 ## Quantization