|
| 1 | +# Knowledge Distillation with Megatron-Bridge |
| 2 | + |
| 3 | +This guide shows how to perform knowledge distillation on Puzzletron-compressed AnyModel checkpoints using Megatron-Bridge. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +1. Set up the environment with Megatron-Bridge |
| 8 | +2. Convert AnyModel checkpoints (student and teacher) to Megatron-Bridge format |
| 9 | +3. Run knowledge distillation training |
| 10 | + |
| 11 | +## Setup |
| 12 | + |
| 13 | +> **Temporary Setup:** This manual Megatron-Bridge setup is required temporarily until the NeMo docker container includes Megatron-Bridge by default. Once the container is updated, this setup step will no longer be necessary. |
| 14 | +
|
| 15 | +**Note:** Set `$WORKSPACE` to your project root directory before running these commands: |
| 16 | + |
| 17 | +```bash |
| 18 | +export WORKSPACE=/path/to/your/project |
| 19 | +``` |
| 20 | + |
| 21 | +1. **Clone Megatron-Bridge:** |
| 22 | + |
| 23 | + Clone [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) into your workspace: |
| 24 | + |
| 25 | + ```bash |
| 26 | + cd $WORKSPACE |
| 27 | + git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git |
| 28 | + ``` |
| 29 | + |
| 30 | +2. **Initialize Megatron-Bridge submodules:** |
| 31 | + |
| 32 | + ```bash |
| 33 | + cd $WORKSPACE/Megatron-Bridge |
| 34 | + git submodule init |
| 35 | + git submodule update |
| 36 | + ``` |
| 37 | + |
| 38 | +3. **Start Docker container with mounts:** |
| 39 | + |
| 40 | + Use the [NeMo 25.11 container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.11): |
| 41 | + |
| 42 | + ```bash |
| 43 | + docker run --gpus all -it --rm \ |
| 44 | + -v $WORKSPACE:/workspace \ |
| 45 | + -v $WORKSPACE/Megatron-Bridge/3rdparty/Megatron-LM:/opt/megatron-lm \ |
| 46 | + nvcr.io/nvidia/nemo:25.11 \ |
| 47 | + /bin/bash |
| 48 | + ``` |
| 49 | + |
| 50 | + **Note:** The mount `/opt/megatron-lm` is required because Megatron-Bridge depends on the Megatron-LM submodule. |
| 51 | + |
| 52 | +4. **Set up the environment inside the container:** |
| 53 | + |
| 54 | + ```bash |
| 55 | + export PYTHONPATH="/workspace/Megatron-Bridge/src:/workspace/Model-Optimizer:${PYTHONPATH}" |
| 56 | + ``` |
| 57 | + |
| 58 | +## Step 1: Convert Checkpoints to Megatron-Bridge Format |
| 59 | + |
| 60 | +Convert both student and teacher checkpoints: |
| 61 | + |
| 62 | +```bash |
| 63 | +# Convert student checkpoint |
| 64 | +torchrun --nproc_per_node=1 examples/puzzletron/mbridge_distillation/import_anymodel_to_mbridge.py \ |
| 65 | + --input-ckpt-path /path/to/student/anymodel/checkpoint \ |
| 66 | + --output-ckpt-path /path/to/student/mbridge/checkpoint |
| 67 | + |
| 68 | +# Convert teacher checkpoint |
| 69 | +torchrun --nproc_per_node=1 examples/puzzletron/mbridge_distillation/import_anymodel_to_mbridge.py \ |
| 70 | + --input-ckpt-path /path/to/teacher/anymodel/checkpoint \ |
| 71 | + --output-ckpt-path /path/to/teacher/mbridge/checkpoint |
| 72 | +``` |
| 73 | + |
| 74 | +## Step 2: Run Knowledge Distillation |
| 75 | + |
| 76 | +Run distillation with tokenized dataset: |
| 77 | + |
| 78 | +```bash |
| 79 | +torchrun --nproc_per_node=8 examples/puzzletron/mbridge_distillation/distill_anymodel.py \ |
| 80 | + --student-mbridge-ckpt /path/to/student/mbridge/checkpoint/iter_0000000 \ |
| 81 | + --teacher-mbridge-ckpt /path/to/teacher/mbridge/checkpoint/iter_0000000 \ |
| 82 | + --data-path /path/to/tokenized/dataset \ |
| 83 | + --output-dir ./distilled_output \ |
| 84 | + dataset.sequence_length=8192 \ |
| 85 | + model.tensor_model_parallel_size=8 \ |
| 86 | + model.teacher.tensor_model_parallel_size=8 \ |
| 87 | + train.global_batch_size=4 \ |
| 88 | + train.micro_batch_size=1 \ |
| 89 | + train.train_iters=5000 \ |
| 90 | + logger.log_interval=1 |
| 91 | +``` |
| 92 | + |
| 93 | +The distilled checkpoint will be saved to `--output-dir`. |
0 commit comments