|
| 1 | +# Tokenizing for Megatron Frameworks |
| 2 | + |
| 3 | +| **Section** | **Description** | **Link** | |
| 4 | +| :---: | :---: | :---: | |
| 5 | +| From JSONL files | Tokenize local JSONL files | \[[Link](#from-jsonl-files)\] | |
| 6 | +| From Hugging Face Hub | Stream or download HF datasets and tokenize | \[[Link](#from-hugging-face-hub)\] | |
| 7 | +| `reasoning_content` for Post-Training v3 | Control how chain-of-thought traces are handled | \[[Link](#reasoning_content-for-post-training-v3-datasets)\] | |
| 8 | +| Nemotron Pre/Post-Training Datasets | Ready-to-run commands for all Nemotron datasets | \[[Link](#ready-to-run-tokenization-commands)\] | |
| 9 | + |
| 10 | +The distillation and pre-training scripts in Megatron-Bridge or Megatron-LM expect data pre-tokenized in Megatron's binary indexed format (`.bin` / `.idx`). |
| 11 | +Use the `megatron_preprocess_data` utility to tokenize any JSONL or Hugging Face dataset. |
| 12 | +The tokenization scripts below print the list of output prefixes (e.g. `tokenized_qwen3/data1_text`) that you can use for the `data_paths` argument (with relative weights on different files) in Megatron training scripts. |
| 13 | + |
| 14 | +**Important Notes:** |
| 15 | + |
| 16 | +- For Pretraining / raw-text data (`text` key) — use `--append_eod` so Megatron can tell where documents end when concatenating them into long sequences. |
| 17 | +- For Post-training chat data (`messages` key) — omit `--append_eod`; the chat template already appends EOS at the end of each conversation. |
| 18 | +- Set `--max_sequence_length 256_000` to avoid rare OOM errors if some text is very long. |
| 19 | + |
| 20 | +## From JSONL files |
| 21 | + |
| 22 | +```bash |
| 23 | +python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 24 | + --jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \ |
| 25 | + --json_keys text \ |
| 26 | + --tokenizer Qwen/Qwen3-0.6B \ |
| 27 | + --output_dir tokenized_qwen3 \ |
| 28 | + --workers 32 \ |
| 29 | + --append_eod |
| 30 | +``` |
| 31 | + |
| 32 | +```bash |
| 33 | +python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 34 | + --jsonl_paths /path/to/sft_data.jsonl \ |
| 35 | + --json_keys messages \ |
| 36 | + --tokenizer Qwen/Qwen3-0.6B \ |
| 37 | + --output_dir tokenized_qwen3 \ |
| 38 | + --workers 32 |
| 39 | +``` |
| 40 | + |
| 41 | +Instead of `--jsonl_paths`, pass `--input_dir /path/to/dir` to tokenize all JSONL files in a directory (`.jsonl` and `.jsonl.gz` are both supported). |
| 42 | + |
| 43 | +## From Hugging Face Hub |
| 44 | + |
| 45 | +To tokenize a dataset directly from Hugging Face Hub: |
| 46 | + |
| 47 | +```bash |
| 48 | +python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 49 | + --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \ |
| 50 | + --hf_name Nemotron-SFT-Code \ |
| 51 | + --hf_split train \ |
| 52 | + --hf_max_samples_per_split 10_000_000 \ |
| 53 | + --json_keys text \ |
| 54 | + --tokenizer Qwen/Qwen3-0.6B \ |
| 55 | + --output_dir tokenized_qwen3 \ |
| 56 | + --workers 32 \ |
| 57 | + --append_eod |
| 58 | +``` |
| 59 | + |
| 60 | +Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples. |
| 61 | +To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample). |
| 62 | + |
| 63 | +For very large datasets (tens of millions of documents), or datasets with complex nested message schemas (e.g. `tool_calls`, `function_call` fields) that cause Arrow type-cast errors in non-streaming mode, add `--hf_streaming` to avoid downloading the full dataset — only the rows actually consumed are fetched. Optionally pair with `--hf_max_samples_per_split <num_samples>` to cap the row count; without it streaming still works but re-downloads on every run with no disk cache. |
| 64 | + |
| 65 | +> **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk. |
| 66 | +> Re-runs read from cache and are much faster. |
| 67 | +> Streaming re-downloads on every run with no cache, so it is slower for full-dataset processing. |
| 68 | +
|
| 69 | +## `reasoning_content` for Post-Training v3 Datasets |
| 70 | + |
| 71 | +v3 datasets include a `reasoning_content` field in assistant messages (chain-of-thought separate from |
| 72 | +the final answer). Use `--reasoning_content` to control how it is handled: |
| 73 | + |
| 74 | +| Value | Behaviour | |
| 75 | +| --- | --- | |
| 76 | +| `strip` (default) | Field is discarded before `apply_chat_template`. Safe for any tokenizer. | |
| 77 | +| `inline` | Wrapped as `<think>…</think>` and prepended to `content`. Preserves reasoning in a tokenizer-agnostic way. | |
| 78 | +| `native` | Passed unchanged. Requires the tokenizer's chat template to handle the field (e.g. Qwen3). | |
| 79 | + |
| 80 | +```bash |
| 81 | +python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 82 | + --hf_dataset nvidia/Nemotron-Math-v2 \ |
| 83 | + --hf_split high_part00 \ |
| 84 | + --json_keys messages \ |
| 85 | + --tokenizer nvidia/NVIDIA-Nemotron-Nano-9B-v2 \ |
| 86 | + --output_dir tokenized_nemotron_v2 \ |
| 87 | + --workers 32 \ |
| 88 | + --reasoning_content inline |
| 89 | +``` |
| 90 | + |
| 91 | +--- |
| 92 | + |
| 93 | +## Ready-to-run tokenization commands |
| 94 | + |
| 95 | +Tokenization commands for all Nemotron Pre-Training and Post-Training datasets used in Megatron-Bridge distillation experiments. |
| 96 | + |
| 97 | +Two parameters vary by model — set them before running the commands below: |
| 98 | + |
| 99 | +```bash |
| 100 | +TOKENIZER=nvidia/NVIDIA-Nemotron-Nano-9B-v2 # HuggingFace tokenizer (or local path) |
| 101 | +OUTPUT_DIR=tokenized_nemotron_v2 # Output directory for tokenized files |
| 102 | +``` |
| 103 | + |
| 104 | +> [!TIP] |
| 105 | +> Token count for a `.bin` file = file size in bytes ÷ 4. This is also printed by the tokenization script on completion. |
| 106 | +
|
| 107 | +> [!NOTE] |
| 108 | +> Tokenizing each of the datasets below will take anywhere between 10 minutes to few hours. You can tokenize all in parallel to speed up the process. |
| 109 | +> |
| 110 | +> You may tokenize more datasets or skip some datasets depending on your needs. |
| 111 | +
|
| 112 | +### Nemotron Pretraining dataset |
| 113 | + |
| 114 | +**[nvidia/Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1)** — raw text; omitting `--hf_name` tokenizes all 3 subsets (Code, General, MATH) in one command, producing a separate output file per subset named after each: |
| 115 | + |
| 116 | +```bash |
| 117 | +python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 118 | + --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \ |
| 119 | + --hf_split train \ |
| 120 | + --hf_streaming \ |
| 121 | + --hf_max_samples_per_split 10_000_000 \ |
| 122 | + --json_keys text \ |
| 123 | + --tokenizer ${TOKENIZER} \ |
| 124 | + --output_dir ${OUTPUT_DIR} \ |
| 125 | + --workers 96 \ |
| 126 | + --max_sequence_length 256_000 \ |
| 127 | + --append_eod \ |
| 128 | + --strip_newlines |
| 129 | +``` |
| 130 | + |
| 131 | +--- |
| 132 | + |
| 133 | +### Nemotron Post-training v1 dataset |
| 134 | + |
| 135 | +**[nvidia/Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1)** — STEM subset, capped at 5M samples. v1 data does not contain reasoning traces: |
| 136 | + |
| 137 | +```bash |
| 138 | +python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 139 | + --hf_dataset nvidia/Nemotron-Post-Training-Dataset-v1 \ |
| 140 | + --hf_name default \ |
| 141 | + --hf_split stem \ |
| 142 | + --hf_streaming \ |
| 143 | + --hf_max_samples_per_split 5_000_000 \ |
| 144 | + --json_keys messages \ |
| 145 | + --tokenizer ${TOKENIZER} \ |
| 146 | + --output_dir ${OUTPUT_DIR} \ |
| 147 | + --workers 96 \ |
| 148 | + --max_sequence_length 256_000 |
| 149 | +``` |
| 150 | + |
| 151 | +--- |
| 152 | + |
| 153 | +### Nemotron Post-training v3 collection |
| 154 | + |
| 155 | +Datasets below are from the [Nemotron Post-Training v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3). All use `--reasoning_content inline` to preserve `<think>…</think>` traces. The collection contains many more datasets — if you care about benchmarks not covered here (e.g. multilingual, agentic/tool use, SWE, safety), pick the relevant datasets from the collection and tokenize them the same way. |
| 156 | + |
| 157 | +**[nvidia/Nemotron-Math-v2](https://huggingface.co/datasets/nvidia/Nemotron-Math-v2)** — tokenize `high_part00` and `high_part01` separately: |
| 158 | + |
| 159 | +```bash |
| 160 | +for SPLIT in high_part00 high_part01; do |
| 161 | + python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 162 | + --hf_dataset nvidia/Nemotron-Math-v2 \ |
| 163 | + --hf_split ${SPLIT} \ |
| 164 | + --json_keys messages \ |
| 165 | + --tokenizer ${TOKENIZER} \ |
| 166 | + --output_dir ${OUTPUT_DIR} \ |
| 167 | + --workers 96 \ |
| 168 | + --max_sequence_length 256_000 \ |
| 169 | + --reasoning_content inline |
| 170 | +done |
| 171 | +``` |
| 172 | + |
| 173 | +**[nvidia/Nemotron-SFT-Competitive-Programming-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Competitive-Programming-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing: |
| 174 | + |
| 175 | +```bash |
| 176 | +hf download nvidia/Nemotron-SFT-Competitive-Programming-v2 \ |
| 177 | + --repo-type dataset \ |
| 178 | + --local-dir datasets/Nemotron-SFT-Competitive-Programming-v2/ |
| 179 | +for FILE in competitive_programming_python_00 competitive_programming_cpp_00; do |
| 180 | + python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 181 | + --jsonl_paths datasets/Nemotron-SFT-Competitive-Programming-v2/data/${FILE}.jsonl \ |
| 182 | + --json_keys messages \ |
| 183 | + --tokenizer ${TOKENIZER} \ |
| 184 | + --output_dir ${OUTPUT_DIR} \ |
| 185 | + --workers 96 \ |
| 186 | + --max_sequence_length 256_000 \ |
| 187 | + --reasoning_content inline |
| 188 | +done |
| 189 | +``` |
| 190 | + |
| 191 | +**[nvidia/Nemotron-Science-v1](https://huggingface.co/datasets/nvidia/Nemotron-Science-v1)** — stored as raw JSONL on HuggingFace, download before tokenizing: |
| 192 | + |
| 193 | +```bash |
| 194 | +hf download nvidia/Nemotron-Science-v1 \ |
| 195 | + --repo-type dataset \ |
| 196 | + --local-dir datasets/Nemotron-Science-v1/ |
| 197 | +python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 198 | + --input_dir datasets/Nemotron-Science-v1/data/ \ |
| 199 | + --json_keys messages \ |
| 200 | + --tokenizer ${TOKENIZER} \ |
| 201 | + --output_dir ${OUTPUT_DIR} \ |
| 202 | + --workers 96 \ |
| 203 | + --max_sequence_length 256_000 \ |
| 204 | + --reasoning_content inline |
| 205 | +``` |
| 206 | + |
| 207 | +**[nvidia/Nemotron-SFT-Instruction-Following-Chat-v2](https://huggingface.co/datasets/nvidia/Nemotron-SFT-Instruction-Following-Chat-v2)** — stored as raw JSONL on HuggingFace, download before tokenizing: |
| 208 | + |
| 209 | +```bash |
| 210 | +hf download nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 \ |
| 211 | + --repo-type dataset \ |
| 212 | + --local-dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/ |
| 213 | +python -m modelopt.torch.utils.plugins.megatron_preprocess_data \ |
| 214 | + --input_dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/data/ \ |
| 215 | + --json_keys messages \ |
| 216 | + --tokenizer ${TOKENIZER} \ |
| 217 | + --output_dir ${OUTPUT_DIR} \ |
| 218 | + --workers 96 \ |
| 219 | + --max_sequence_length 256_000 \ |
| 220 | + --reasoning_content inline |
| 221 | +``` |
| 222 | + |
| 223 | +--- |
| 224 | + |
| 225 | +### Expected output |
| 226 | + |
| 227 | +After running all commands above, `${OUTPUT_DIR}/` should contain the following `.bin` / `.idx` file pairs: |
| 228 | + |
| 229 | +```text |
| 230 | +nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000.{bin,idx} |
| 231 | +nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000.{bin,idx} |
| 232 | +nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bin,idx} |
| 233 | +nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx} |
| 234 | +nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx} |
| 235 | +nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx} |
| 236 | +competitive_programming_python_00_messages.{bin,idx} |
| 237 | +competitive_programming_cpp_00_messages.{bin,idx} |
| 238 | +MCQ_messages.{bin,idx} |
| 239 | +RQA_messages.{bin,idx} |
| 240 | +reasoning_off_messages.{bin,idx} |
| 241 | +reasoning_on_messages.{bin,idx} |
| 242 | +``` |
0 commit comments