Commit 9fda554
committed
Improve megatron dataset preprocessing script and update docs (#918)
## What does this PR do?
Improve megatron dataset preprocessing script and update docs
## Usage
<!-- You can potentially add a usage example below. -->
```python
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
--hf_name Nemotron-SFT-General \
--hf_split train \
--hf_max_samples_per_split 10_000_000 \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir /path/to/tokenized/data/qwen3 \
--workers 32 \
--max_sequence_length 256_000
```
```python
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir /path/to/tokenized/data/qwen3 \
--workers 32 \
--max_sequence_length 256_000
```
## Testing
<!-- Mention how have you tested your change if applicable. -->
- Downloaded and tokenized Nemotron-Pretraining-SFT-v1 with
Nemotron-Nano-v2 tokenizer
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **Documentation**
* Updated data preparation guides with new CLI patterns and Hugging Face
Hub integration instructions.
* **New Features**
* Added batch tokenization via directory input and direct Hugging Face
dataset downloads with flexible subset/split filtering.
* **Configuration Updates**
* Optimized distillation settings: adjusted optimizer parameters and
increased checkpoint retention.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>1 parent f08a65f commit 9fda554
File tree
6 files changed
+275
-177
lines changed- examples
- megatron_bridge
- nemo_run/common
- modelopt/torch
- prune/plugins
- utils/plugins
- tests/gpu_megatron/torch/utils/plugins
6 files changed
+275
-177
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
46 | | - | |
| 46 | + | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| |||
97 | 97 | | |
98 | 98 | | |
99 | 99 | | |
100 | | - | |
101 | | - | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
108 | | - | |
109 | | - | |
110 | | - | |
111 | | - | |
112 | | - | |
113 | | - | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
114 | 129 | | |
115 | 130 | | |
116 | | - | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
117 | 134 | | |
118 | 135 | | |
119 | 136 | | |
| |||
124 | 141 | | |
125 | 142 | | |
126 | 143 | | |
127 | | - | |
| 144 | + | |
128 | 145 | | |
129 | 146 | | |
130 | 147 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
163 | 163 | | |
164 | 164 | | |
165 | 165 | | |
166 | | - | |
| 166 | + | |
167 | 167 | | |
168 | 168 | | |
169 | 169 | | |
| |||
227 | 227 | | |
228 | 228 | | |
229 | 229 | | |
230 | | - | |
| 230 | + | |
231 | 231 | | |
232 | 232 | | |
233 | 233 | | |
| |||
238 | 238 | | |
239 | 239 | | |
240 | 240 | | |
241 | | - | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
242 | 244 | | |
243 | 245 | | |
244 | 246 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
67 | 67 | | |
68 | 68 | | |
69 | 69 | | |
70 | | - | |
| 70 | + | |
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
317 | 317 | | |
318 | 318 | | |
319 | 319 | | |
| 320 | + | |
320 | 321 | | |
321 | 322 | | |
322 | 323 | | |
| |||
0 commit comments