A tutorial on mbridge distillation for puzzletron/any_model

danielkorzekwa · danielkorzekwa · commit d244ca7f3ad4 · 2026-02-19T01:08:05.000-08:00
Signed-off-by: Daniel Korzekwa &lt;dkorzekwa@nvidia.com&gt;
diff --git a/examples/puzzletron/README.md b/examples/puzzletron/README.md
@@ -275,21 +275,9 @@ vllm bench throughput --model path/to/model --input-len 2000 --output-len 100 --
 
 ## Knowledge Distillation
 
-To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one. For this, we will use [NeMo framework](https://github.com/NVIDIA-NeMo/NeMo) with the [nemo:25.07](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.07) container.
+To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one.
 
-First, convert the HF model to NeMo format:
-
-```bash
-python -m nemo_export/convert_hf_to_nemo --input-ckpt-path path/to/HF-model --output-ckpt-path path/to/save/model-nemo
-```
-
-Now you can utilize all the training features available in NeMo, including distillation. Please refer to the [NeMo distillation documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html).
-
-[Optional] Once distillation is complete, you can convert the distilled model back to the HuggingFace format.
-
-```bash
-python -m nemo_export/convert_nemo_to_hf --input-ckpt-path path/to/nemo-model --output-ckpt-path path/to/save/model-HF
-```
+See [mbridge_distillation/README.md](./mbridge_distillation/README.md) for instructions on using Megatron-Bridge for knowledge distillation.
 
 ## Advanced Usage
 
diff --git a/examples/puzzletron/mbridge_distillation/README.md b/examples/puzzletron/mbridge_distillation/README.md
@@ -0,0 +1,93 @@
+# Knowledge Distillation with Megatron-Bridge
+
+This guide shows how to perform knowledge distillation on Puzzletron-compressed AnyModel checkpoints using Megatron-Bridge.
+
+## Overview
+
+1. Set up the environment with Megatron-Bridge
+2. Convert AnyModel checkpoints (student and teacher) to Megatron-Bridge format
+3. Run knowledge distillation training
+
+## Setup
+
+> **Temporary Setup:** This manual Megatron-Bridge setup is required temporarily until the NeMo docker container includes Megatron-Bridge by default. Once the container is updated, this setup step will no longer be necessary.
+
+**Note:** Set `$WORKSPACE` to your project root directory before running these commands:
+
+```bash
+export WORKSPACE=/path/to/your/project
+```
+
+1. **Clone Megatron-Bridge:**
+
+   Clone [Megatron-Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge) into your workspace:
+
+   ```bash
+   cd $WORKSPACE
+   git clone https://github.com/NVIDIA-NeMo/Megatron-Bridge.git
+   ```
+
+2. **Initialize Megatron-Bridge submodules:**
+
+   ```bash
+   cd $WORKSPACE/Megatron-Bridge
+   git submodule init
+   git submodule update
+   ```
+
+3. **Start Docker container with mounts:**
+
+   Use the [NeMo 25.11 container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.11):
+
+   ```bash
+   docker run --gpus all -it --rm \
+     -v $WORKSPACE:/workspace \
+     -v $WORKSPACE/Megatron-Bridge/3rdparty/Megatron-LM:/opt/megatron-lm \
+     nvcr.io/nvidia/nemo:25.11 \
+     /bin/bash
+   ```
+
+   **Note:** The mount `/opt/megatron-lm` is required because Megatron-Bridge depends on the Megatron-LM submodule.
+
+4. **Set up the environment inside the container:**
+
+   ```bash
+   export PYTHONPATH="/workspace/Megatron-Bridge/src:/workspace/Model-Optimizer:${PYTHONPATH}"
+   ```
+
+## Step 1: Convert Checkpoints to Megatron-Bridge Format
+
+Convert both student and teacher checkpoints:
+
+```bash
+# Convert student checkpoint
+torchrun --nproc_per_node=1 examples/puzzletron/mbridge_distillation/import_anymodel_to_mbridge.py \
+    --input-ckpt-path /path/to/student/anymodel/checkpoint \
+    --output-ckpt-path /path/to/student/mbridge/checkpoint
+
+# Convert teacher checkpoint
+torchrun --nproc_per_node=1 examples/puzzletron/mbridge_distillation/import_anymodel_to_mbridge.py \
+    --input-ckpt-path /path/to/teacher/anymodel/checkpoint \
+    --output-ckpt-path /path/to/teacher/mbridge/checkpoint
+```
+
+## Step 2: Run Knowledge Distillation
+
+Run distillation with tokenized dataset:
+
+```bash
+torchrun --nproc_per_node=8 examples/puzzletron/mbridge_distillation/distill_anymodel.py \
+    --student-mbridge-ckpt /path/to/student/mbridge/checkpoint/iter_0000000 \
+    --teacher-mbridge-ckpt /path/to/teacher/mbridge/checkpoint/iter_0000000 \
+    --data-path /path/to/tokenized/dataset \
+    --output-dir ./distilled_output \
+    dataset.sequence_length=8192 \
+    model.tensor_model_parallel_size=8 \
+    model.teacher.tensor_model_parallel_size=8 \
+    train.global_batch_size=4 \
+    train.micro_batch_size=1 \
+    train.train_iters=5000 \
+    logger.log_interval=1
+```
+
+The distilled checkpoint will be saved to `--output-dir`.