| Section | Description | Link |
|---|---|---|
| From JSONL files | Tokenize local JSONL files | [Link] |
| From Hugging Face Hub | Stream or download HF datasets and tokenize | [Link] |
reasoning_content for Post-Training v3 |
Control how chain-of-thought traces are handled | [Link] |
| Nemotron Pre/Post-Training Datasets | Ready-to-run commands for all Nemotron datasets | [Link] |
The distillation and pre-training scripts in Megatron-Bridge or Megatron-LM expect data pre-tokenized in Megatron's binary indexed format (.bin / .idx).
Use the megatron_preprocess_data utility to tokenize any JSONL or Hugging Face dataset.
The tokenization scripts below print the list of output prefixes (e.g. tokenized_qwen3/data1_text) that you can use for the data_paths argument (with relative weights on different files) in Megatron training scripts.
Important Notes:
- For Pretraining / raw-text data (
textkey) — use--append_eodso Megatron can tell where documents end when concatenating them into long sequences. - For Post-training chat data (
messageskey) — omit--append_eod; the chat template already appends EOS at the end of each conversation. - Set
--max_sequence_length 256_000to avoid rare OOM errors if some text is very long.
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir tokenized_qwen3 \
--workers 32 \
--append_eodpython -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths /path/to/sft_data.jsonl \
--json_keys messages \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir tokenized_qwen3 \
--workers 32Instead of --jsonl_paths, pass --input_dir /path/to/dir to tokenize all JSONL files in a directory (.jsonl and .jsonl.gz are both supported).
To tokenize a dataset directly from Hugging Face Hub:
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
--hf_name Nemotron-SFT-Code \
--hf_split train \
--hf_max_samples_per_split 10_000_000 \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir tokenized_qwen3 \
--workers 32 \
--append_eodOmit --hf_name to process all subsets, --hf_split for all splits, or --hf_max_samples_per_split for all samples.
To quickly test, use nvidia/Nemotron-Pretraining-Dataset-sample.
For very large datasets (tens of millions of documents), or datasets with complex nested message schemas (e.g. tool_calls, function_call fields) that cause Arrow type-cast errors in non-streaming mode, add --hf_streaming to avoid downloading the full dataset — only the rows actually consumed are fetched. Optionally pair with --hf_max_samples_per_split <num_samples> to cap the row count; without it streaming still works but re-downloads on every run with no disk cache.
Performance note: Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk. Re-runs read from cache and are much faster. Streaming re-downloads on every run with no cache, so it is slower for full-dataset processing.
v3 datasets include a reasoning_content field in assistant messages (chain-of-thought separate from
the final answer). Use --reasoning_content to control how it is handled:
| Value | Behaviour |
|---|---|
strip (default) |
Field is discarded before apply_chat_template. Safe for any tokenizer. |
inline |
Wrapped as <think>…</think> and prepended to content. Preserves reasoning in a tokenizer-agnostic way. |
native |
Passed unchanged. Requires the tokenizer's chat template to handle the field (e.g. Qwen3). |
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Math-v2 \
--hf_split high_part00 \
--json_keys messages \
--tokenizer nvidia/NVIDIA-Nemotron-Nano-9B-v2 \
--output_dir tokenized_nemotron_v2 \
--workers 32 \
--reasoning_content inlineTokenization commands for all Nemotron Pre-Training and Post-Training datasets used in Megatron-Bridge distillation experiments.
Two parameters vary by model — set them before running the commands below:
TOKENIZER=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 # HuggingFace tokenizer (or local path)
OUTPUT_DIR=tokenized_nemotron_3 # Output directory for tokenized filesTip
Token count for a .bin file = file size in bytes ÷ 4. This is also printed by the tokenization script on completion.
Note
Tokenizing each of the datasets below will take anywhere between 10 minutes to few hours. You can tokenize all in parallel to speed up the process.
You may tokenize more datasets or skip some datasets depending on your needs.
nvidia/Nemotron-Pretraining-SFT-v1 — raw text; omitting --hf_name tokenizes all 3 subsets (Code, General, MATH) in one command, producing a separate output file per subset named after each:
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
--hf_split train \
--hf_streaming \
--hf_max_samples_per_split 10_000_000 \
--json_keys text \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--append_eod \
--strip_newlinesnvidia/Nemotron-Post-Training-Dataset-v1 — STEM subset, capped at 5M samples. v1 data does not contain reasoning traces:
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Post-Training-Dataset-v1 \
--hf_name default \
--hf_split stem \
--hf_streaming \
--hf_max_samples_per_split 5_000_000 \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000Datasets below are from the Nemotron Post-Training v3 collection. All use --reasoning_content inline to preserve <think>…</think> traces. The collection contains many more datasets — if you care about benchmarks not covered here (e.g. multilingual, agentic/tool use, SWE, safety), pick the relevant datasets from the collection and tokenize them the same way.
nvidia/Nemotron-Math-v2 — tokenize high_part00 and high_part01 separately. --hf_streaming is required because the messages contain extra fields (e.g. tool_calls) that cause Arrow type-cast errors in non-streaming mode when using tokenizers with complex chat templates (such as Nemotron v3):
for SPLIT in high_part00 high_part01; do
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_dataset nvidia/Nemotron-Math-v2 \
--hf_split ${SPLIT} \
--hf_streaming \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inline
donenvidia/Nemotron-SFT-Math-v3 — stored as raw JSONL on HuggingFace, download before tokenizing (more reliable than streaming for this dataset due to complex nested tool_calls fields):
hf download nvidia/Nemotron-SFT-Math-v3 \
--repo-type dataset \
--local-dir datasets/Nemotron-SFT-Math-v3/
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths datasets/Nemotron-SFT-Math-v3/data/train.jsonl \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inline
# Rename to avoid generic file name
mv ${OUTPUT_DIR}/train_messages.bin ${OUTPUT_DIR}/nvidia--Nemotron-SFT-Math-v3_default_train_messages.bin
mv ${OUTPUT_DIR}/train_messages.idx ${OUTPUT_DIR}/nvidia--Nemotron-SFT-Math-v3_default_train_messages.idxnvidia/Nemotron-SFT-Competitive-Programming-v2 — stored as raw JSONL on HuggingFace, download before tokenizing:
hf download nvidia/Nemotron-SFT-Competitive-Programming-v2 \
--repo-type dataset \
--local-dir datasets/Nemotron-SFT-Competitive-Programming-v2/
for FILE in competitive_programming_python_00 competitive_programming_cpp_00; do
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths datasets/Nemotron-SFT-Competitive-Programming-v2/data/${FILE}.jsonl \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inline
donenvidia/Nemotron-Science-v1 — stored as raw JSONL on HuggingFace, download before tokenizing:
hf download nvidia/Nemotron-Science-v1 \
--repo-type dataset \
--local-dir datasets/Nemotron-Science-v1/
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--input_dir datasets/Nemotron-Science-v1/data/ \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inlinenvidia/Nemotron-SFT-Instruction-Following-Chat-v2 — stored as raw JSONL on HuggingFace, download before tokenizing:
hf download nvidia/Nemotron-SFT-Instruction-Following-Chat-v2 \
--repo-type dataset \
--local-dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--input_dir datasets/Nemotron-SFT-Instruction-Following-Chat-v2/data/ \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inlinenvidia/Nemotron-Agentic-v1 — tool_calling.jsonl (316K samples). Stored as raw JSONL on HuggingFace, download before tokenizing (more reliable than streaming for this dataset due to complex nested tool_calls / tools fields):
hf download nvidia/Nemotron-Agentic-v1 \
--repo-type dataset \
--local-dir datasets/Nemotron-Agentic-v1/
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths datasets/Nemotron-Agentic-v1/data/tool_calling.jsonl \
--json_keys messages \
--tokenizer ${TOKENIZER} \
--output_dir ${OUTPUT_DIR} \
--workers 96 \
--max_sequence_length 256_000 \
--reasoning_content inline
# Rename to avoid collision with potential future Nemotron-SFT-Agentic-v2 / tool_calling
mv ${OUTPUT_DIR}/tool_calling_messages.bin ${OUTPUT_DIR}/nvidia--Nemotron-Agentic-v1_tool_calling_messages.bin
mv ${OUTPUT_DIR}/tool_calling_messages.idx ${OUTPUT_DIR}/nvidia--Nemotron-Agentic-v1_tool_calling_messages.idxAfter running all commands above, ${OUTPUT_DIR}/ should contain the following .bin / .idx file pairs:
nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-Code_train_text_max10000000.{bin,idx}
nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-General_train_text_max10000000.{bin,idx}
nvidia--Nemotron-Pretraining-SFT-v1_Nemotron-SFT-MATH_train_text_max10000000.{bin,idx}
nvidia--Nemotron-Post-Training-Dataset-v1_default_stem_messages_max5000000.{bin,idx}
nvidia--Nemotron-Math-v2_default_high_part00_messages.{bin,idx}
nvidia--Nemotron-Math-v2_default_high_part01_messages.{bin,idx}
nvidia--Nemotron-SFT-Math-v3_default_train_messages.{bin,idx}
competitive_programming_python_00_messages.{bin,idx}
competitive_programming_cpp_00_messages.{bin,idx}
MCQ_messages.{bin,idx}
RQA_messages.{bin,idx}
reasoning_off_messages.{bin,idx}
reasoning_on_messages.{bin,idx}
nvidia--Nemotron-Agentic-v1_tool_calling_messages.{bin,idx}