You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/megatron_bridge/README.md
+62-4Lines changed: 62 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -92,7 +92,7 @@ This section shows how to distill a student model from a teacher model in the Me
92
92
93
93
This can be used stand-alone or after [Pruning](#pruning) / [Post-Training Quantization](#post-training-quantization) to recover accuracy of the model by distilling from the original model (teacher).
94
94
95
-
The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
95
+
The [distill.py](distill.py) script supports both standard HuggingFace checkpoints and [Puzzletron AnyModel](../puzzletron/README.md)checkpoints as student/teacher inputs. Just pass the checkpoint path via `--student_hf_path` / `--teacher_hf_path`. The distilled model is saved to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.
196
196
197
-
### Convert Megatron checkpoint to Hugging Face format
197
+
### Converting to Hugging Face format (optional)
198
198
199
-
To convert the Megatron checkpoint from last iteration (or any intermediate iteration) to Hugging Face format, you need the pruned model config (`--output_hf_path` from `prune_minitron.py` script) and the distilled megatron checkpoint dir (`<distill_output_dir>/checkpoints/iter_<iter_number>`) to run the following command:
199
+
The distilled checkpoint is saved in Megatron distributed format. If you need a HuggingFace checkpoint, there are two ways to convert it:
200
+
201
+
**Inline** -- add `--hf_export_path` and `--student_hf_model` to the `distill.py` command to automatically convert the final checkpoint after distillation:
`--student_hf_model` should match the base architecture of the student (used as a template for export).
211
+
212
+
**Separate conversion** -- convert any saved iteration using the Megatron-Bridge conversion script:
200
213
201
214
```bash
202
215
uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py export \
@@ -205,7 +218,52 @@ uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py ex
205
218
--hf-path <path_to_save_distilled_hf_ckpt>
206
219
```
207
220
208
-
For more details, you can refer to the checkpoint conversion scripts in the [Megatron-Bridge README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
221
+
For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
222
+
223
+
> **Known limitation:** HF export does not yet work for Puzzletron AnyModel (heterogeneous) checkpoints -- Megatron-Bridge cannot reload heterogeneous configs from saved checkpoints. Standard models export correctly with both methods.
224
+
225
+
### Distillation Results
226
+
227
+
The following MMLU results demonstrate knowledge distillation on student models that were first compressed using [Puzzletron](../puzzletron/README.md). The original (uncompressed) model serves as the teacher, and distillation recovers accuracy lost during compression.
228
+
229
+
#### Qwen3-8B compressed to 80% of original
230
+
231
+
The student was created by compressing Qwen3-8B to 80% of its original size using Puzzletron.
232
+
233
+
| Model | MMLU | Humanities | Other | Social Sci | STEM |
MMLU accuracy improved from 59.10% to 69.21% (+10.11 pp) after distillation with just 100 iterations on WikiText-103, recovering 64% of the gap to the teacher model.
240
+
241
+
#### Llama-3.1-8B-Instruct compressed to 50% of original
242
+
243
+
The student was created by compressing Llama-3.1-8B-Instruct to 50% of its original size using Puzzletron.
244
+
245
+
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
#### Llama-3.1-8B-Instruct compressed to 69% of original (regression)
252
+
253
+
The student was created by compressing Llama-3.1-8B-Instruct to ~69% of its original size using Puzzletron. This example shows regression due to overfitting on the small WikiText-103 dataset (100 iterations). MMLU was evaluated on a subset of 100 samples per task:
254
+
255
+
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
MMLU decreased from 66.26% to 64.96% (-1.30 pp) -- the model overfitted to WikiText-103. This highlights the importance of using larger, more diverse datasets for distillation.
262
+
263
+
#### Recommendations
264
+
265
+
-**Use larger datasets** for production distillation (e.g., [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)) to avoid overfitting as shown in the regression case above.
266
+
-**Train for more iterations** to ensure proper convergence.
To recover degradation in the quality of the compressed model, we can use knowledge distillation. This allows transferring the capabilities of the original model to the pruned one.
279
279
280
-
See [mbridge_distillation/README.md](./mbridge_distillation/README.md) for instructions on using Megatron-Bridge for knowledge distillation.
280
+
See [Megatron-Bridge distillation](../megatron_bridge/README.md#distillation) for instructions on using Megatron-Bridge for knowledge distillation. The distillation script supports both standard HuggingFace and Puzzletron AnyModel checkpoints.
0 commit comments