move distillation results to results/puzzletron.md; enable anymodel hf export test

j-rausch · j-rausch · commit f3d90a3fdbd3 · 2026-04-13T07:50:34.000-07:00
Signed-off-by: jrausch &lt;jrausch@nvidia.com&gt;
diff --git a/examples/megatron_bridge/README.md b/examples/megatron_bridge/README.md
@@ -184,50 +184,9 @@ uv run python /opt/Megatron-Bridge/examples/conversion/convert_checkpoints.py ex
 
 For more details, see the [Megatron-Bridge conversion README](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/conversion).
 
-> **Known limitation:** HF export does not yet work for Puzzletron AnyModel (heterogeneous) checkpoints -- Megatron-Bridge cannot reload heterogeneous configs from saved checkpoints. Standard models export correctly with both methods.
-
 ### Distillation Results
 
-The following MMLU results demonstrate knowledge distillation on student models that were first compressed using [Puzzletron](../puzzletron/README.md). The original (uncompressed) model serves as the teacher, and distillation recovers accuracy lost during compression.
-
-#### Qwen3-8B compressed to 80% of original
-
-The student was created by compressing Qwen3-8B to 80% of its original size using Puzzletron.
-
-| Model | MMLU | Humanities | Other | Social Sci | STEM |
-|-------|------|------------|-------|------------|------|
-| Student (before distillation) | 0.5910 | 0.5046 | 0.6363 | 0.6831 | 0.5855 |
-| Student (after distillation) | 0.6921 | 0.5906 | 0.7316 | 0.7975 | 0.7016 |
-| Teacher (original Qwen3-8B) | 0.7493 | 0.6648 | 0.7856 | 0.8385 | 0.7526 |
-
-MMLU accuracy improved from 59.10% to 69.21% (+10.11 pp) after distillation with just 100 iterations on WikiText-103, recovering 64% of the gap to the teacher model.
-
-#### Llama-3.1-8B-Instruct compressed to 50% of original
-
-The student was created by compressing Llama-3.1-8B-Instruct to 50% of its original size using Puzzletron.
-
-| Model | MMLU | Humanities | Other | Social Sciences | STEM |
-|-------|------|------------|-------|-----------------|------|
-| Student (before distillation) | 0.2316 | 0.2462 | 0.2292 | 0.2250 | 0.2274 |
-| Student (after distillation) | 0.2960 | 0.3146 | 0.3085 | 0.2925 | 0.2768 |
-| Teacher (original Llama-3.1-8B-Instruct) | 0.6839 | 0.7231 | 0.7038 | 0.7667 | 0.5911 |
-
-#### Llama-3.1-8B-Instruct compressed to 69% of original (regression)
-
-The student was created by compressing Llama-3.1-8B-Instruct to ~69% of its original size using Puzzletron. This example shows regression due to overfitting on the small WikiText-103 dataset (100 iterations). MMLU was evaluated on a subset of 100 samples per task:
-
-| Model | MMLU | Humanities | Other | Social Sciences | STEM |
-|-------|------|------------|-------|-----------------|------|
-| Student (before distillation) | 0.6626 | 0.7069 | 0.6892 | 0.7525 | 0.5574 |
-| Student (after distillation) | 0.6496 | 0.6862 | 0.6677 | 0.7433 | 0.5532 |
-| Teacher (original Llama-3.1-8B-Instruct) | 0.6839 | 0.7231 | 0.7038 | 0.7667 | 0.5911 |
-
-MMLU decreased from 66.26% to 64.96% (-1.30 pp) -- the model overfitted to WikiText-103. This highlights the importance of using larger, more diverse datasets for distillation.
-
-#### Recommendations
-
-- **Use larger datasets** for production distillation (e.g., [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)) to avoid overfitting as shown in the regression case above.
-- **Train for more iterations** to ensure proper convergence.
+See [results/puzzletron.md](results/puzzletron.md) for MMLU results demonstrating knowledge distillation on Puzzletron-compressed student models.
 
 ## Post-Training Quantization
 
diff --git a/examples/megatron_bridge/results/puzzletron.md b/examples/megatron_bridge/results/puzzletron.md
@@ -0,0 +1,42 @@
+# Puzzletron Distillation Results
+
+The following MMLU results demonstrate knowledge distillation on student models that were first compressed using [Puzzletron](../../puzzletron/README.md). The original (uncompressed) model serves as the teacher, and distillation recovers accuracy lost during compression.
+
+## Qwen3-8B compressed to 80% of original
+
+The student was created by compressing Qwen3-8B to 80% of its original size using Puzzletron.
+
+| Model | MMLU | Humanities | Other | Social Sci | STEM |
+|-------|------|------------|-------|------------|------|
+| Student (before distillation) | 0.5910 | 0.5046 | 0.6363 | 0.6831 | 0.5855 |
+| Student (after distillation) | 0.6921 | 0.5906 | 0.7316 | 0.7975 | 0.7016 |
+| Teacher (original Qwen3-8B) | 0.7493 | 0.6648 | 0.7856 | 0.8385 | 0.7526 |
+
+MMLU accuracy improved from 59.10% to 69.21% (+10.11 pp) after distillation with just 100 iterations on WikiText-103, recovering 64% of the gap to the teacher model.
+
+## Llama-3.1-8B-Instruct compressed to 50% of original
+
+The student was created by compressing Llama-3.1-8B-Instruct to 50% of its original size using Puzzletron.
+
+| Model | MMLU | Humanities | Other | Social Sciences | STEM |
+|-------|------|------------|-------|-----------------|------|
+| Student (before distillation) | 0.2316 | 0.2462 | 0.2292 | 0.2250 | 0.2274 |
+| Student (after distillation) | 0.2960 | 0.3146 | 0.3085 | 0.2925 | 0.2768 |
+| Teacher (original Llama-3.1-8B-Instruct) | 0.6839 | 0.7231 | 0.7038 | 0.7667 | 0.5911 |
+
+## Llama-3.1-8B-Instruct compressed to 69% of original (regression)
+
+The student was created by compressing Llama-3.1-8B-Instruct to ~69% of its original size using Puzzletron. This example shows regression due to overfitting on the small WikiText-103 dataset (100 iterations). MMLU was evaluated on a subset of 100 samples per task:
+
+| Model | MMLU | Humanities | Other | Social Sciences | STEM |
+|-------|------|------------|-------|-----------------|------|
+| Student (before distillation) | 0.6626 | 0.7069 | 0.6892 | 0.7525 | 0.5574 |
+| Student (after distillation) | 0.6496 | 0.6862 | 0.6677 | 0.7433 | 0.5532 |
+| Teacher (original Llama-3.1-8B-Instruct) | 0.6839 | 0.7231 | 0.7038 | 0.7667 | 0.5911 |
+
+MMLU decreased from 66.26% to 64.96% (-1.30 pp) -- the model overfitted to WikiText-103. This highlights the importance of using larger, more diverse datasets for distillation.
+
+## Recommendations
+
+- **Use larger datasets** for production distillation (e.g., [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)) to avoid overfitting as shown in the regression case above.
+- **Train for more iterations** to ensure proper convergence.
diff --git a/tests/examples/megatron_bridge/test_distill.py b/tests/examples/megatron_bridge/test_distill.py
@@ -72,19 +72,16 @@ def test_distill_puzzletron_anymodel(tmp_path: Path, num_gpus):
     """Integration test for distill.py with Puzzletron AnyModel (heterogeneous) checkpoints.
 
     Creates Qwen3 models, converts the student to Puzzletron AnyModel format
-    (heterogeneous layer architectures), and runs mbridge distillation.
-
-    Note: HF export via --hf_export_path is NOT tested here because Megatron-Bridge's
-    export_ckpt cannot reload heterogeneous model configs from saved checkpoints
-    (heterogeneous_layers_config_encoded_json is None during __post_init__).
-    HF export for standard models is tested in test_distill_and_convert.
+    (heterogeneous layer architectures), runs mbridge distillation, and exports
+    the distilled checkpoint to HuggingFace format via --hf_export_path.
     """
-    _, student_anymodel_dir, teacher_hf_dir = _prepare_puzzletron_anymodel_student_and_teacher(
-        tmp_path
+    student_hf_dir, student_anymodel_dir, teacher_hf_dir = (
+        _prepare_puzzletron_anymodel_student_and_teacher(tmp_path)
     )
 
     train_iters = 5
     output_dir = tmp_path / "distill_output"
+    hf_export_path = tmp_path / "distilled_anymodel_hf"
     cmd_parts = extend_cmd_parts(
         ["torchrun", f"--nproc_per_node={num_gpus}", "distill.py", "--use_mock_data"],
         student_hf_path=student_anymodel_dir,
@@ -100,12 +97,18 @@ def test_distill_puzzletron_anymodel(tmp_path: Path, num_gpus):
         eval_interval=5,
         eval_iters=1,
         log_interval=1,
+        hf_export_path=hf_export_path,
+        student_hf_model=student_hf_dir,
     )
     run_example_command(cmd_parts, example_path="megatron_bridge")
 
     run_config_path = output_dir / "checkpoints" / f"iter_{train_iters:07d}" / "run_config.yaml"
     assert run_config_path.exists(), f"Expected run_config.yaml at: {run_config_path}"
 
+    assert (hf_export_path / "config.json").exists(), (
+        f"Expected HF export at: {hf_export_path}/config.json"
+    )
+
 
 def _prepare_puzzletron_anymodel_student_and_teacher(tmp_path: Path) -> tuple[Path, Path, Path]:
     """Create Qwen3 models and convert student to Puzzletron AnyModel format."""