Improve megatron dataset preprocessing script and update docs (#918)

kevalmorabia97 · kevalmorabia97 · commit 9fda554e4ae2 · 2026-02-28T01:44:34.000+05:30
## What does this PR do?

Improve megatron dataset preprocessing script and update docs

## Usage
&lt;!-- You can potentially add a usage example below. --&gt;

```python
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
    --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
    --hf_name Nemotron-SFT-General \
    --hf_split train \
    --hf_max_samples_per_split 10_000_000 \
    --json_keys text \
    --tokenizer Qwen/Qwen3-0.6B \
    --output_dir /path/to/tokenized/data/qwen3 \
    --workers 32 \
    --max_sequence_length 256_000
```

```python
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
    --jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
    --json_keys text \
    --tokenizer Qwen/Qwen3-0.6B \
    --output_dir /path/to/tokenized/data/qwen3 \
    --workers 32 \
    --max_sequence_length 256_000
```

## Testing
&lt;!-- Mention how have you tested your change if applicable. --&gt;

- Downloaded and tokenized Nemotron-Pretraining-SFT-v1 with
Nemotron-Nano-v2 tokenizer

&lt;!-- This is an auto-generated comment: release notes by coderabbit.ai
--&gt;

## Summary by CodeRabbit

* **Documentation**
* Updated data preparation guides with new CLI patterns and Hugging Face
Hub integration instructions.

* **New Features**
* Added batch tokenization via directory input and direct Hugging Face
dataset downloads with flexible subset/split filtering.

* **Configuration Updates**
* Optimized distillation settings: adjusted optimizer parameters and
increased checkpoint retention.

&lt;!-- end of auto-generated comment: release notes by coderabbit.ai --&gt;

Signed-off-by: Keval Morabia &lt;28916987+kevalmorabia97@users.noreply.github.com&gt;
diff --git a/examples/megatron_bridge/README.md b/examples/megatron_bridge/README.md
@@ -43,7 +43,7 @@ Once inside the container, you need to login with your HuggingFace token to down
 Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.
 
 ```bash
-huggingface-cli login --token <your token>
+hf auth login --token <your token>
 ```
 
 ## Pruning
@@ -97,23 +97,40 @@ The [distill.py](distill.py) script loads student and teacher models from Huggin
 ### Data Preparation
 
 The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
-You can tokenize your JSONL dataset using the following function:
-
-```python
-from modelopt.torch.utils.plugins import megatron_preprocess_data
-
-megatron_preprocess_data(
-    input_path="/path/to/your/data.jsonl",
-    output_dir="/path/to/tokenized/data",
-    tokenizer_name_or_path="Qwen/Qwen3-0.6B",
-    json_keys=["text"],  # change to your JSON key if needed
-    workers=32,
-    log_interval=100000,
-    max_sequence_length=256000,  # To avoid rare OOM errors if text is too long
-)
+
+You can tokenize your JSONL datasets using the following command:
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
+    --json_keys text \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir /path/to/tokenized/data/qwen3 \
+    --workers 32 \
+    --max_sequence_length 256_000
+```
+
+Instead of `--jsonl_paths`, you can also pass a directory path to the `--input_dir` argument to tokenize all JSONL files in the directory.
+We are setting a maximum sequence length of 256k to avoid rare OOM errors in tokenization if text is too long.
+
+If you want to download and tokenize a dataset from Hugging Face Hub directly, you can use the following command:
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
+    --hf_name Nemotron-SFT-General \
+    --hf_split train \
+    --hf_max_samples_per_split 10_000_000 \
+    --json_keys text \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir /path/to/tokenized/data/qwen3 \
+    --workers 32 \
+    --max_sequence_length 256_000
 ```
 
-If you have multiple JSONL files, you can tokenize them one by one and pass all the paths to the `--data_paths` argument.
+If you skip `--hf_name`, it will download and tokenize all subsets for the dataset.
+If you skip `--hf_split`, it will download and tokenize all splits for the subset.
+If you skip `--hf_max_samples_per_split`, it will download and tokenize all samples for the split.
 
 ### Distillation with Real Data
 
@@ -124,7 +141,7 @@ torchrun --nnodes 1 --nproc_per_node 8 distill.py \
     --tp_size 8 \
     --teacher_hf_path Qwen/Qwen3-8B \
     --student_hf_path Qwen/Qwen3-4B \
-    --data_paths 1.0 /path/to/tokenized/data \
+    --data_paths 1.0 /path/to/tokenized/data/qwen3 \
     --data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
     --seq_length 8192 \
     --mbs 1 \
diff --git a/examples/megatron_bridge/distill.py b/examples/megatron_bridge/distill.py
@@ -163,7 +163,7 @@ def _build_model_provider(hf_path):
         lr_warmup_iters=args.lr_warmup_iters,
         max_lr=args.lr,
         min_lr=args.min_lr,
-        adam_beta2=0.98,
+        adam_beta2=0.95,
     )
 
     # Build dataset config
@@ -227,7 +227,7 @@ def _build_model_provider(hf_path):
             save_interval=args.eval_interval,
             save=checkpoint_dir,
             load=checkpoint_dir,  # Resume from this directory (if exists)
-            most_recent_k=3,  # Keeps 3 most recent checkpoints (not metric-based)
+            most_recent_k=5,  # Keeps 5 most recent checkpoints (not metric-based)
             ckpt_format="torch_dist",
             async_save=True,
             fully_parallel_save=True,
@@ -238,7 +238,9 @@ def _build_model_provider(hf_path):
 
     print_rank_0("\nStarting distillation...")
     distill(config)
-    print_rank_0(f"\nDistillation done! Saved checkpoint to {checkpoint_dir}\n")
+    print_rank_0(
+        f"\nDistillation done! Saved checkpoint to {checkpoint_dir} in megatron distributed checkpoint format.\n"
+    )
 
 
 if __name__ == "__main__":
diff --git a/examples/nemo_run/common/process_climbmix.py b/examples/nemo_run/common/process_climbmix.py
@@ -67,7 +67,7 @@ def get_args():
     print("Tokenizing ClimbMix dataset...")
     input_paths = [raw_dir / name for name in subset_filenames]
     megatron_preprocess_data(
-        input_paths,
+        jsonl_paths=input_paths,
         output_dir=proc_dir,
         tokenizer_name_or_path=args.tokenizer,
         append_eod=True,
diff --git a/modelopt/torch/prune/plugins/mcore_minitron.py b/modelopt/torch/prune/plugins/mcore_minitron.py
@@ -317,6 +317,7 @@ def run_search(self) -> None:
         # Prune homogeneously
         self._prune(export_config, prune_depth=True)
 
+        # TODO: Rename to hybrid_layer_pattern after https://github.com/NVIDIA/Megatron-LM/pull/3377
         # Update hybrid_override_pattern if pruning is done on a hybrid model
         if isinstance(self.model, MambaModel):
             print_rank_0(f"Original hybrid_override_pattern: {self.model.hybrid_override_pattern}")
diff --git a/modelopt/torch/utils/plugins/megatron_preprocess_data.py b/modelopt/torch/utils/plugins/megatron_preprocess_data.py
diff --git a/tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py b/tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py