Skip to content

Commit f3d90a3

Browse files
committed
move distillation results to results/puzzletron.md; enable anymodel hf export test
Signed-off-by: jrausch <jrausch@nvidia.com>
1 parent 151b03d commit f3d90a3

File tree

3 files changed

+54
-50
lines changed

3 files changed

+54
-50
lines changed

examples/megatron_bridge/README.md

Lines changed: 1 addition & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -184,50 +184,9 @@ uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py ex
184184

185185
For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
186186

187-
> **Known limitation:** HF export does not yet work for Puzzletron AnyModel (heterogeneous) checkpoints -- Megatron-Bridge cannot reload heterogeneous configs from saved checkpoints. Standard models export correctly with both methods.
188-
189187
### Distillation Results
190188

191-
The following MMLU results demonstrate knowledge distillation on student models that were first compressed using [Puzzletron](../puzzletron/README.md). The original (uncompressed) model serves as the teacher, and distillation recovers accuracy lost during compression.
192-
193-
#### Qwen3-8B compressed to 80% of original
194-
195-
The student was created by compressing Qwen3-8B to 80% of its original size using Puzzletron.
196-
197-
| Model | MMLU | Humanities | Other | Social Sci | STEM |
198-
|-------|------|------------|-------|------------|------|
199-
| Student (before distillation) | 0.5910 | 0.5046 | 0.6363 | 0.6831 | 0.5855 |
200-
| Student (after distillation) | 0.6921 | 0.5906 | 0.7316 | 0.7975 | 0.7016 |
201-
| Teacher (original Qwen3-8B) | 0.7493 | 0.6648 | 0.7856 | 0.8385 | 0.7526 |
202-
203-
MMLU accuracy improved from 59.10% to 69.21% (+10.11 pp) after distillation with just 100 iterations on WikiText-103, recovering 64% of the gap to the teacher model.
204-
205-
#### Llama-3.1-8B-Instruct compressed to 50% of original
206-
207-
The student was created by compressing Llama-3.1-8B-Instruct to 50% of its original size using Puzzletron.
208-
209-
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
210-
|-------|------|------------|-------|-----------------|------|
211-
| Student (before distillation) | 0.2316 | 0.2462 | 0.2292 | 0.2250 | 0.2274 |
212-
| Student (after distillation) | 0.2960 | 0.3146 | 0.3085 | 0.2925 | 0.2768 |
213-
| Teacher (original Llama-3.1-8B-Instruct) | 0.6839 | 0.7231 | 0.7038 | 0.7667 | 0.5911 |
214-
215-
#### Llama-3.1-8B-Instruct compressed to 69% of original (regression)
216-
217-
The student was created by compressing Llama-3.1-8B-Instruct to ~69% of its original size using Puzzletron. This example shows regression due to overfitting on the small WikiText-103 dataset (100 iterations). MMLU was evaluated on a subset of 100 samples per task:
218-
219-
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
220-
|-------|------|------------|-------|-----------------|------|
221-
| Student (before distillation) | 0.6626 | 0.7069 | 0.6892 | 0.7525 | 0.5574 |
222-
| Student (after distillation) | 0.6496 | 0.6862 | 0.6677 | 0.7433 | 0.5532 |
223-
| Teacher (original Llama-3.1-8B-Instruct) | 0.6839 | 0.7231 | 0.7038 | 0.7667 | 0.5911 |
224-
225-
MMLU decreased from 66.26% to 64.96% (-1.30 pp) -- the model overfitted to WikiText-103. This highlights the importance of using larger, more diverse datasets for distillation.
226-
227-
#### Recommendations
228-
229-
- **Use larger datasets** for production distillation (e.g., [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)) to avoid overfitting as shown in the regression case above.
230-
- **Train for more iterations** to ensure proper convergence.
189+
See [results/puzzletron.md](results/puzzletron.md) for MMLU results demonstrating knowledge distillation on Puzzletron-compressed student models.
231190

232191
## Post-Training Quantization
233192

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Puzzletron Distillation Results
2+
3+
The following MMLU results demonstrate knowledge distillation on student models that were first compressed using [Puzzletron](../../puzzletron/README.md). The original (uncompressed) model serves as the teacher, and distillation recovers accuracy lost during compression.
4+
5+
## Qwen3-8B compressed to 80% of original
6+
7+
The student was created by compressing Qwen3-8B to 80% of its original size using Puzzletron.
8+
9+
| Model | MMLU | Humanities | Other | Social Sci | STEM |
10+
|-------|------|------------|-------|------------|------|
11+
| Student (before distillation) | 0.5910 | 0.5046 | 0.6363 | 0.6831 | 0.5855 |
12+
| Student (after distillation) | 0.6921 | 0.5906 | 0.7316 | 0.7975 | 0.7016 |
13+
| Teacher (original Qwen3-8B) | 0.7493 | 0.6648 | 0.7856 | 0.8385 | 0.7526 |
14+
15+
MMLU accuracy improved from 59.10% to 69.21% (+10.11 pp) after distillation with just 100 iterations on WikiText-103, recovering 64% of the gap to the teacher model.
16+
17+
## Llama-3.1-8B-Instruct compressed to 50% of original
18+
19+
The student was created by compressing Llama-3.1-8B-Instruct to 50% of its original size using Puzzletron.
20+
21+
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
22+
|-------|------|------------|-------|-----------------|------|
23+
| Student (before distillation) | 0.2316 | 0.2462 | 0.2292 | 0.2250 | 0.2274 |
24+
| Student (after distillation) | 0.2960 | 0.3146 | 0.3085 | 0.2925 | 0.2768 |
25+
| Teacher (original Llama-3.1-8B-Instruct) | 0.6839 | 0.7231 | 0.7038 | 0.7667 | 0.5911 |
26+
27+
## Llama-3.1-8B-Instruct compressed to 69% of original (regression)
28+
29+
The student was created by compressing Llama-3.1-8B-Instruct to ~69% of its original size using Puzzletron. This example shows regression due to overfitting on the small WikiText-103 dataset (100 iterations). MMLU was evaluated on a subset of 100 samples per task:
30+
31+
| Model | MMLU | Humanities | Other | Social Sciences | STEM |
32+
|-------|------|------------|-------|-----------------|------|
33+
| Student (before distillation) | 0.6626 | 0.7069 | 0.6892 | 0.7525 | 0.5574 |
34+
| Student (after distillation) | 0.6496 | 0.6862 | 0.6677 | 0.7433 | 0.5532 |
35+
| Teacher (original Llama-3.1-8B-Instruct) | 0.6839 | 0.7231 | 0.7038 | 0.7667 | 0.5911 |
36+
37+
MMLU decreased from 66.26% to 64.96% (-1.30 pp) -- the model overfitted to WikiText-103. This highlights the importance of using larger, more diverse datasets for distillation.
38+
39+
## Recommendations
40+
41+
- **Use larger datasets** for production distillation (e.g., [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)) to avoid overfitting as shown in the regression case above.
42+
- **Train for more iterations** to ensure proper convergence.

tests/examples/megatron_bridge/test_distill.py

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -72,19 +72,16 @@ def test_distill_puzzletron_anymodel(tmp_path: Path, num_gpus):
7272
"""Integration test for distill.py with Puzzletron AnyModel (heterogeneous) checkpoints.
7373
7474
Creates Qwen3 models, converts the student to Puzzletron AnyModel format
75-
(heterogeneous layer architectures), and runs mbridge distillation.
76-
77-
Note: HF export via --hf_export_path is NOT tested here because Megatron-Bridge's
78-
export_ckpt cannot reload heterogeneous model configs from saved checkpoints
79-
(heterogeneous_layers_config_encoded_json is None during __post_init__).
80-
HF export for standard models is tested in test_distill_and_convert.
75+
(heterogeneous layer architectures), runs mbridge distillation, and exports
76+
the distilled checkpoint to HuggingFace format via --hf_export_path.
8177
"""
82-
_, student_anymodel_dir, teacher_hf_dir = _prepare_puzzletron_anymodel_student_and_teacher(
83-
tmp_path
78+
student_hf_dir, student_anymodel_dir, teacher_hf_dir = (
79+
_prepare_puzzletron_anymodel_student_and_teacher(tmp_path)
8480
)
8581

8682
train_iters = 5
8783
output_dir = tmp_path / "distill_output"
84+
hf_export_path = tmp_path / "distilled_anymodel_hf"
8885
cmd_parts = extend_cmd_parts(
8986
["torchrun", f"--nproc_per_node={num_gpus}", "distill.py", "--use_mock_data"],
9087
student_hf_path=student_anymodel_dir,
@@ -100,12 +97,18 @@ def test_distill_puzzletron_anymodel(tmp_path: Path, num_gpus):
10097
eval_interval=5,
10198
eval_iters=1,
10299
log_interval=1,
100+
hf_export_path=hf_export_path,
101+
student_hf_model=student_hf_dir,
103102
)
104103
run_example_command(cmd_parts, example_path="megatron_bridge")
105104

106105
run_config_path = output_dir / "checkpoints" / f"iter_{train_iters:07d}" / "run_config.yaml"
107106
assert run_config_path.exists(), f"Expected run_config.yaml at: {run_config_path}"
108107

108+
assert (hf_export_path / "config.json").exists(), (
109+
f"Expected HF export at: {hf_export_path}/config.json"
110+
)
111+
109112

110113
def _prepare_puzzletron_anymodel_student_and_teacher(tmp_path: Path) -> tuple[Path, Path, Path]:
111114
"""Create Qwen3 models and convert student to Puzzletron AnyModel format."""

0 commit comments

Comments
 (0)