Skip to content

Commit 9fda554

Browse files
Improve megatron dataset preprocessing script and update docs (#918)
## What does this PR do? Improve megatron dataset preprocessing script and update docs ## Usage <!-- You can potentially add a usage example below. --> ```python python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \ --hf_name Nemotron-SFT-General \ --hf_split train \ --hf_max_samples_per_split 10_000_000 \ --json_keys text \ --tokenizer Qwen/Qwen3-0.6B \ --output_dir /path/to/tokenized/data/qwen3 \ --workers 32 \ --max_sequence_length 256_000 ``` ```python python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ --jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \ --json_keys text \ --tokenizer Qwen/Qwen3-0.6B \ --output_dir /path/to/tokenized/data/qwen3 \ --workers 32 \ --max_sequence_length 256_000 ``` ## Testing <!-- Mention how have you tested your change if applicable. --> - Downloaded and tokenized Nemotron-Pretraining-SFT-v1 with Nemotron-Nano-v2 tokenizer <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Updated data preparation guides with new CLI patterns and Hugging Face Hub integration instructions. * **New Features** * Added batch tokenization via directory input and direct Hugging Face dataset downloads with flexible subset/split filtering. * **Configuration Updates** * Optimized distillation settings: adjusted optimizer parameters and increased checkpoint retention. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent f08a65f commit 9fda554

File tree

6 files changed

+275
-177
lines changed

6 files changed

+275
-177
lines changed

examples/megatron_bridge/README.md

Lines changed: 34 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ Once inside the container, you need to login with your HuggingFace token to down
4343
Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.
4444

4545
```bash
46-
huggingface-cli login --token <your token>
46+
hf auth login --token <your token>
4747
```
4848

4949
## Pruning
@@ -97,23 +97,40 @@ The [distill.py](distill.py) script loads student and teacher models from Huggin
9797
### Data Preparation
9898

9999
The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
100-
You can tokenize your JSONL dataset using the following function:
101-
102-
```python
103-
from modelopt.torch.utils.plugins import megatron_preprocess_data
104-
105-
megatron_preprocess_data(
106-
input_path="/path/to/your/data.jsonl",
107-
output_dir="/path/to/tokenized/data",
108-
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
109-
json_keys=["text"], # change to your JSON key if needed
110-
workers=32,
111-
log_interval=100000,
112-
max_sequence_length=256000, # To avoid rare OOM errors if text is too long
113-
)
100+
101+
You can tokenize your JSONL datasets using the following command:
102+
103+
```bash
104+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
105+
--jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
106+
--json_keys text \
107+
--tokenizer Qwen/Qwen3-0.6B \
108+
--output_dir /path/to/tokenized/data/qwen3 \
109+
--workers 32 \
110+
--max_sequence_length 256_000
111+
```
112+
113+
Instead of `--jsonl_paths`, you can also pass a directory path to the `--input_dir` argument to tokenize all JSONL files in the directory.
114+
We are setting a maximum sequence length of 256k to avoid rare OOM errors in tokenization if text is too long.
115+
116+
If you want to download and tokenize a dataset from Hugging Face Hub directly, you can use the following command:
117+
118+
```bash
119+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
120+
--hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
121+
--hf_name Nemotron-SFT-General \
122+
--hf_split train \
123+
--hf_max_samples_per_split 10_000_000 \
124+
--json_keys text \
125+
--tokenizer Qwen/Qwen3-0.6B \
126+
--output_dir /path/to/tokenized/data/qwen3 \
127+
--workers 32 \
128+
--max_sequence_length 256_000
114129
```
115130

116-
If you have multiple JSONL files, you can tokenize them one by one and pass all the paths to the `--data_paths` argument.
131+
If you skip `--hf_name`, it will download and tokenize all subsets for the dataset.
132+
If you skip `--hf_split`, it will download and tokenize all splits for the subset.
133+
If you skip `--hf_max_samples_per_split`, it will download and tokenize all samples for the split.
117134

118135
### Distillation with Real Data
119136

@@ -124,7 +141,7 @@ torchrun --nnodes 1 --nproc_per_node 8 distill.py \
124141
--tp_size 8 \
125142
--teacher_hf_path Qwen/Qwen3-8B \
126143
--student_hf_path Qwen/Qwen3-4B \
127-
--data_paths 1.0 /path/to/tokenized/data \
144+
--data_paths 1.0 /path/to/tokenized/data/qwen3 \
128145
--data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
129146
--seq_length 8192 \
130147
--mbs 1 \

examples/megatron_bridge/distill.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ def _build_model_provider(hf_path):
163163
lr_warmup_iters=args.lr_warmup_iters,
164164
max_lr=args.lr,
165165
min_lr=args.min_lr,
166-
adam_beta2=0.98,
166+
adam_beta2=0.95,
167167
)
168168

169169
# Build dataset config
@@ -227,7 +227,7 @@ def _build_model_provider(hf_path):
227227
save_interval=args.eval_interval,
228228
save=checkpoint_dir,
229229
load=checkpoint_dir, # Resume from this directory (if exists)
230-
most_recent_k=3, # Keeps 3 most recent checkpoints (not metric-based)
230+
most_recent_k=5, # Keeps 5 most recent checkpoints (not metric-based)
231231
ckpt_format="torch_dist",
232232
async_save=True,
233233
fully_parallel_save=True,
@@ -238,7 +238,9 @@ def _build_model_provider(hf_path):
238238

239239
print_rank_0("\nStarting distillation...")
240240
distill(config)
241-
print_rank_0(f"\nDistillation done! Saved checkpoint to {checkpoint_dir}\n")
241+
print_rank_0(
242+
f"\nDistillation done! Saved checkpoint to {checkpoint_dir} in megatron distributed checkpoint format.\n"
243+
)
242244

243245

244246
if __name__ == "__main__":

examples/nemo_run/common/process_climbmix.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ def get_args():
6767
print("Tokenizing ClimbMix dataset...")
6868
input_paths = [raw_dir / name for name in subset_filenames]
6969
megatron_preprocess_data(
70-
input_paths,
70+
jsonl_paths=input_paths,
7171
output_dir=proc_dir,
7272
tokenizer_name_or_path=args.tokenizer,
7373
append_eod=True,

modelopt/torch/prune/plugins/mcore_minitron.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -317,6 +317,7 @@ def run_search(self) -> None:
317317
# Prune homogeneously
318318
self._prune(export_config, prune_depth=True)
319319

320+
# TODO: Rename to hybrid_layer_pattern after https://github.com/NVIDIA/Megatron-LM/pull/3377
320321
# Update hybrid_override_pattern if pruning is done on a hybrid model
321322
if isinstance(self.model, MambaModel):
322323
print_rank_0(f"Original hybrid_override_pattern: {self.model.hybrid_override_pattern}")

0 commit comments

Comments
 (0)