You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/megatron_bridge/README.md
+1-42Lines changed: 1 addition & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -184,50 +184,9 @@ uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py ex
184
184
185
185
For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
186
186
187
-
> **Known limitation:** HF export does not yet work for Puzzletron AnyModel (heterogeneous) checkpoints -- Megatron-Bridge cannot reload heterogeneous configs from saved checkpoints. Standard models export correctly with both methods.
188
-
189
187
### Distillation Results
190
188
191
-
The following MMLU results demonstrate knowledge distillation on student models that were first compressed using [Puzzletron](../puzzletron/README.md). The original (uncompressed) model serves as the teacher, and distillation recovers accuracy lost during compression.
192
-
193
-
#### Qwen3-8B compressed to 80% of original
194
-
195
-
The student was created by compressing Qwen3-8B to 80% of its original size using Puzzletron.
196
-
197
-
| Model | MMLU | Humanities | Other | Social Sci | STEM |
MMLU accuracy improved from 59.10% to 69.21% (+10.11 pp) after distillation with just 100 iterations on WikiText-103, recovering 64% of the gap to the teacher model.
204
-
205
-
#### Llama-3.1-8B-Instruct compressed to 50% of original
206
-
207
-
The student was created by compressing Llama-3.1-8B-Instruct to 50% of its original size using Puzzletron.
208
-
209
-
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
#### Llama-3.1-8B-Instruct compressed to 69% of original (regression)
216
-
217
-
The student was created by compressing Llama-3.1-8B-Instruct to ~69% of its original size using Puzzletron. This example shows regression due to overfitting on the small WikiText-103 dataset (100 iterations). MMLU was evaluated on a subset of 100 samples per task:
218
-
219
-
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
MMLU decreased from 66.26% to 64.96% (-1.30 pp) -- the model overfitted to WikiText-103. This highlights the importance of using larger, more diverse datasets for distillation.
226
-
227
-
#### Recommendations
228
-
229
-
-**Use larger datasets** for production distillation (e.g., [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)) to avoid overfitting as shown in the regression case above.
230
-
-**Train for more iterations** to ensure proper convergence.
189
+
See [results/puzzletron.md](results/puzzletron.md) for MMLU results demonstrating knowledge distillation on Puzzletron-compressed student models.
The following MMLU results demonstrate knowledge distillation on student models that were first compressed using [Puzzletron](../../puzzletron/README.md). The original (uncompressed) model serves as the teacher, and distillation recovers accuracy lost during compression.
4
+
5
+
## Qwen3-8B compressed to 80% of original
6
+
7
+
The student was created by compressing Qwen3-8B to 80% of its original size using Puzzletron.
8
+
9
+
| Model | MMLU | Humanities | Other | Social Sci | STEM |
MMLU accuracy improved from 59.10% to 69.21% (+10.11 pp) after distillation with just 100 iterations on WikiText-103, recovering 64% of the gap to the teacher model.
16
+
17
+
## Llama-3.1-8B-Instruct compressed to 50% of original
18
+
19
+
The student was created by compressing Llama-3.1-8B-Instruct to 50% of its original size using Puzzletron.
20
+
21
+
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
## Llama-3.1-8B-Instruct compressed to 69% of original (regression)
28
+
29
+
The student was created by compressing Llama-3.1-8B-Instruct to ~69% of its original size using Puzzletron. This example shows regression due to overfitting on the small WikiText-103 dataset (100 iterations). MMLU was evaluated on a subset of 100 samples per task:
30
+
31
+
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
MMLU decreased from 66.26% to 64.96% (-1.30 pp) -- the model overfitted to WikiText-103. This highlights the importance of using larger, more diverse datasets for distillation.
38
+
39
+
## Recommendations
40
+
41
+
-**Use larger datasets** for production distillation (e.g., [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)) to avoid overfitting as shown in the regression case above.
42
+
-**Train for more iterations** to ensure proper convergence.
0 commit comments