You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This section shows how to distill a student model from a teacher model in the Megatron-Bridge framework.
92
+
93
+
This can be used stand-alone or after pruning (see [Pruning](#pruning)) / quantization (see [Quantization](#quantization)) to recover accuracy of the model by distilling from the original model (teacher).
94
+
95
+
The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
96
+
97
+
### Data Preparation
98
+
99
+
The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
100
+
You can tokenize your JSONL dataset using the following function:
101
+
102
+
```python
103
+
from modelopt.torch.utils.plugins import megatron_preprocess_data
104
+
105
+
megatron_preprocess_data(
106
+
input_path="/path/to/your/data.jsonl",
107
+
output_dir="/path/to/tokenized/data",
108
+
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
109
+
json_keys=["text"], # change to your JSON key if needed
110
+
workers=32,
111
+
log_interval=100000,
112
+
max_sequence_length=256000, # To avoid rare OOM errors if text is too long
113
+
)
114
+
```
115
+
116
+
If you have multiple JSONL files, you can tokenize them one by one and pass all the paths to the `--data_paths` argument.
117
+
118
+
### Distillation with Real Data
119
+
120
+
Example usage to distill a 4B student (HF) from an 8B teacher (HF) on 8 GPUs (TP=8, PP=1):
Tensorboard logging is enabled by default and logs are saved to `<output_dir>/tensorboard` directory.
143
+
To use Weights & Biases for logging, set the `WANDB_API_KEY` environment variable and pass the `--wandb_project` argument.
144
+
Optionally, you can also pass `--wandb_entity` and `--wandb_exp_name` arguments to group runs under a project and experiment name.
145
+
146
+
To see all available arguments:
147
+
148
+
```bash
149
+
torchrun --nproc_per_node 1 distill.py --help
150
+
```
151
+
152
+
### Quick Test with Mock Data
153
+
154
+
Example usage with mock data for quick testing (no pre-tokenized data needed):
155
+
156
+
```bash
157
+
torchrun --nproc_per_node 8 distill.py \
158
+
--tp_size 8 \
159
+
--teacher_hf_path Qwen/Qwen3-0.6B \
160
+
--student_hf_path Qwen/Qwen3-0.6B \
161
+
--use_mock_data \
162
+
--seq_length 512 \
163
+
--mbs 1 \
164
+
--gbs 8 \
165
+
--train_iters 100 \
166
+
--eval_interval 10 \
167
+
--eval_iters 4 \
168
+
--output_dir /tmp/test_distill
169
+
```
170
+
171
+
### Slurm Usage
172
+
173
+
To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.
0 commit comments