Update readme

kevalmorabia97 · kevalmorabia97 · commit 50b6b7eab328 · 2026-02-10T12:40:58.000-08:00
Signed-off-by: Keval Morabia &lt;28916987+kevalmorabia97@users.noreply.github.com&gt;
diff --git a/examples/megatron_bridge/README.md b/examples/megatron_bridge/README.md
@@ -4,13 +4,13 @@ This directory contains examples of using Model Optimizer with [NeMo Megatron-Br
 
 <div align="center">
 
-| **Section** | **Description** | **Link** | **Docs** |
-| :------------: | :------------: | :------------: | :------------: |
-| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] | |
-| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] | |
-| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] | |
-| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] | |
-| Resources | Extra links to relevant resources | \[[Link](#resources)\] | |
+| **Section** | **Description** | **Link** |
+| :------------: | :------------: | :------------: |
+| Pre-Requisites | Development environment setup | \[[Link](#pre-requisites)\] |
+| Pruning | Examples of pruning a model using Minitron algorithm | \[[Link](#pruning)\] |
+| Distillation | Examples of distillation a pruned or quantized model | \[[Link](#distillation)\] |
+| Quantization | Examples of quantizing a model | \[[Link](#quantization)\] |
+| Resources | Extra links to relevant resources | \[[Link](#resources)\] |
 
 </div>
 
@@ -57,6 +57,7 @@ Example usage to prune Qwen3-8B to 6B on 2-GPUs (Pipeline Parallelism = 2) while
 
 ```bash
 torchrun --nproc_per_node 2 prune_minitron.py \
+    --pp_size 2 \
     --hf_model_name_or_path Qwen/Qwen3-8B \
     --prune_target_params 6e9 \
     --hparams_to_skip num_attention_heads \
@@ -68,6 +69,7 @@ Example usage for manually pruning to a specific architecture using following de
 
 ```bash
 torchrun --nproc_per_node 2 prune_minitron.py \
+    --pp_size 2 \
     --hf_model_name_or_path Qwen/Qwen3-8B \
     --prune_export_config '{"hidden_size": 3584, "ffn_hidden_size": 9216}' \
     --output_hf_path /tmp/Qwen3-8B-Pruned-6B-manual
@@ -86,7 +88,89 @@ torchrun --nproc_per_node 1 prune_minitron.py --help
 
 ## Distillation
 
-TODO - Add info!
+This section shows how to distill a student model from a teacher model in the Megatron-Bridge framework.
+
+This can be used stand-alone or after pruning (see [Pruning](#pruning)) / quantization (see [Quantization](#quantization)) to recover accuracy of the model by distilling from the original model (teacher).
+
+The [distill.py](distill.py) script loads student and teacher models from HuggingFace checkpoints and saves the distilled model to `<output_dir>/checkpoints` in Megatron distributed checkpoint format.
+
+### Data Preparation
+
+The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
+You can tokenize your JSONL dataset using the following function:
+
+```python
+from modelopt.torch.utils.plugins import megatron_preprocess_data
+
+megatron_preprocess_data(
+    input_path="/path/to/your/data.jsonl",
+    output_dir="/path/to/tokenized/data",
+    tokenizer_name_or_path="Qwen/Qwen3-0.6B",
+    json_keys=["text"],  # change to your JSON key if needed
+    workers=32,
+    log_interval=100000,
+    max_sequence_length=256000,  # To avoid rare OOM errors if text is too long
+)
+```
+
+If you have multiple JSONL files, you can tokenize them one by one and pass all the paths to the `--data_paths` argument.
+
+### Distillation with Real Data
+
+Example usage to distill a 4B student (HF) from an 8B teacher (HF) on 8 GPUs (TP=8, PP=1):
+
+```bash
+torchrun --nnodes 1 --nproc_per_node 8 distill.py \
+    --tp_size 8 \
+    --teacher_hf_path Qwen/Qwen3-8B \
+    --student_hf_path Qwen/Qwen3-4B \
+    --data_paths 1.0 /path/to/tokenized/data \
+    --data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
+    --seq_length 8192 \
+    --mbs 1 \
+    --gbs 768 \
+    --train_iters 15000 \
+    --lr 1e-4 \
+    --min_lr 1e-5 \
+    --lr_warmup_iters 50 \
+    --eval_interval 100 \
+    --eval_iters 32 \
+    --log_interval 10 \
+    --output_dir /output/qwen3_8b_to_4b_distill
+```
+
+Tensorboard logging is enabled by default and logs are saved to `<output_dir>/tensorboard` directory.
+To use Weights & Biases for logging, set the `WANDB_API_KEY` environment variable and pass the `--wandb_project` argument.
+Optionally, you can also pass `--wandb_entity` and `--wandb_exp_name` arguments to group runs under a project and experiment name.
+
+To see all available arguments:
+
+```bash
+torchrun --nproc_per_node 1 distill.py --help
+```
+
+### Quick Test with Mock Data
+
+Example usage with mock data for quick testing (no pre-tokenized data needed):
+
+```bash
+torchrun --nproc_per_node 8 distill.py \
+    --tp_size 8 \
+    --teacher_hf_path Qwen/Qwen3-0.6B \
+    --student_hf_path Qwen/Qwen3-0.6B \
+    --use_mock_data \
+    --seq_length 512 \
+    --mbs 1 \
+    --gbs 8 \
+    --train_iters 100 \
+    --eval_interval 10 \
+    --eval_iters 4 \
+    --output_dir /tmp/test_distill
+```
+
+### Slurm Usage
+
+To run the distillation script on a Slurm cluster for multi-node training, you just need use `python` instead of `torchrun` and set the number of nodes using `#SBATCH --nodes=<num_nodes>` clause in your Slurm script.
 
 ## Quantization
 
diff --git a/examples/megatron_bridge/distill.py b/examples/megatron_bridge/distill.py
@@ -15,62 +15,9 @@
 """Distillation script for Megatron-Bridge.
 
 Loads student and teacher models directly from HuggingFace checkpoints (local or remote) and saves the distilled model
-to <output_dir>/checkpoints in megatron distributed checkpoint format.
+to `<output_dir>/checkpoints` in megatron distributed checkpoint format.
 
-Example usage to distill a 4B student from an 8B teacher on 8 GPUs:
-
-.. code-block:: bash
-
-    torchrun --nproc_per_node 8 distill.py \
-        --teacher_hf_path Qwen/Qwen3-8B \
-        --student_hf_path Qwen/Qwen3-4B \
-        --tp_size 8 \
-        --data_paths 1.0 /path/to/tokenized/data \
-        --data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
-        --seq_length 8192 \
-        --mbs 1 \
-        --gbs 768 \
-        --train_iters 15000 \
-        --lr 1e-4 \
-        --min_lr 1e-5 \
-        --lr_warmup_iters 50 \
-        --eval_interval 100 \
-        --eval_iters 32 \
-        --log_interval 10 \
-        --output_dir /output/qwen3_8b_to_4b_distill
-
-Example usage to use mock data for quick testing:
-
-.. code-block:: bash
-
-    torchrun --nproc_per_node 8 distill.py \
-        --teacher_hf_path Qwen/Qwen3-0.6B \
-        --student_hf_path Qwen/Qwen3-0.6B \
-        --tp_size 8 \
-        --use_mock_data \
-        --seq_length 512 \
-        --mbs 1 \
-        --gbs 8 \
-        --train_iters 100 \
-        --eval_interval 10 \
-        --eval_iters 4 \
-        --output_dir /tmp/test_distill
-
-If you want to tokenize your own data for a specific tokenizer, you can use the following command:
-
-.. code-block:: python
-
-    from modelopt.torch.utils.plugins import megatron_preprocess_data
-
-    megatron_preprocess_data(
-        input_path="/path/to/your/data.jsonl",
-        output_dir="/path/to/tokenized/data",
-        tokenizer_name_or_path="Qwen/Qwen3-0.6B",
-        json_keys=["text"],
-        workers=32,
-        log_interval=100000,
-        max_sequence_length=256000,
-    )
+See `README.md` in this directory for example usage and data preparation instructions.
 """
 
 import argparse
@@ -106,7 +53,7 @@
 def get_args():
     """Parse command-line arguments."""
     parser = argparse.ArgumentParser(description="Distillation for Megatron-Bridge.")
-    # Model arguments
+    # Model arguments (accepts HuggingFace input only at the moment)
     parser.add_argument(
         "--student_hf_path",
         type=str,
@@ -142,7 +89,10 @@ def get_args():
         "--output_dir", type=str, required=True, help="Folder for logging and checkpoint saving"
     )
     parser.add_argument(
-        "--seq_length", type=int, default=8192, help="Number of tokens per input sample"
+        "--seq_length",
+        type=int,
+        default=4096,
+        help="Number of tokens per input sample. Use 8192 if your dataset has longer sequences.",
     )
     parser.add_argument("--mbs", type=int, default=1, help="Micro-batch Size")
     parser.add_argument("--gbs", type=int, default=768, help="Global Batch Size")
@@ -187,16 +137,18 @@ def main(args: argparse.Namespace):
     def _build_model_provider(hf_path):
         bridge = AutoBridge.from_hf_pretrained(hf_path)
         provider = bridge.to_megatron_provider(load_weights=True)
+
+        # Override parallelism / training settings
         provider.tensor_model_parallel_size = args.tp_size
         provider.pipeline_model_parallel_size = args.pp_size
         provider.context_parallel_size = 1
         provider.sequence_parallel = args.tp_size > 1
         provider.seq_length = args.seq_length
         provider.pipeline_dtype = torch.bfloat16
-        provider.cross_entropy_fusion_impl = "te"
         return provider
 
-    # TODO: Support megatron-ckpt as an alternative to HF checkpoints
+    # TODO: Support megatron-ckpt as an alternative to HF checkpoints (e.g. /path/to/ckpt/iter_0000000)
+    # Still requires an HF model name or path to build provider correctly
     student_provider = _build_model_provider(args.student_hf_path)
     teacher_provider = _build_model_provider(args.teacher_hf_path)
 
diff --git a/examples/megatron_bridge/prune_minitron.py b/examples/megatron_bridge/prune_minitron.py
@@ -28,8 +28,11 @@
 
 To see the full usage for advanced configurations, run:
     torchrun --nproc_per_node 1 prune_minitron.py --help
+
+See `README.md` in this directory for more details.
 """
 
+# TODO: Test multi-node pruning
 import argparse
 import json
 import os
@@ -66,9 +69,20 @@ def get_args() -> argparse.Namespace:
         "--output_hf_path", type=str, help="Path to save the pruned model in HF checkpoint format"
     )
 
-    # Uneven Pipeline Parallelism parameters
-    parser.add_argument("--num_layers_in_first_pipeline_stage", type=int, default=None)
-    parser.add_argument("--num_layers_in_last_pipeline_stage", type=int, default=None)
+    # Parallelism arguments
+    parser.add_argument("--pp_size", type=int, default=1, help="Pipeline parallel size")
+    parser.add_argument(
+        "--num_layers_in_first_pipeline_stage",
+        type=int,
+        default=None,
+        help="Number of layers in the first pipeline stage (Uneven Pipeline Parallelism)",
+    )
+    parser.add_argument(
+        "--num_layers_in_last_pipeline_stage",
+        type=int,
+        default=None,
+        help="Number of layers in the last pipeline stage (Uneven Pipeline Parallelism)",
+    )
 
     # Calibration dataset parameters
     parser.add_argument(
@@ -201,8 +215,7 @@ def get_args() -> argparse.Namespace:
 
 
 def main(args: argparse.Namespace):
-    pp_size = dist.size()
-    print_rank_0(f"Setting pipeline_model_parallel_size to {pp_size}")
+    assert dist.size() == args.pp_size, "Only Pipeline parallelism is supported for pruning."
 
     if args.output_megatron_path and os.path.exists(
         f"{args.output_megatron_path}/latest_checkpointed_iteration.txt"
@@ -218,7 +231,7 @@ def main(args: argparse.Namespace):
         trust_remote_code=args.trust_remote_code,
         provider_overrides={
             "tensor_model_parallel_size": 1,
-            "pipeline_model_parallel_size": pp_size,
+            "pipeline_model_parallel_size": args.pp_size,
             "num_layers_in_first_pipeline_stage": args.num_layers_in_first_pipeline_stage,
             "num_layers_in_last_pipeline_stage": args.num_layers_in_last_pipeline_stage,
             "pipeline_dtype": torch.bfloat16,