Improve Megatron Tokenization: streaming, reasoning_content support, HF in-memory tokenization, etc (#1221)

kevalmorabia97 · claude · kinjalpatel27 · commit 88158534362b · 2026-04-13T02:41:02.000Z
### What does this PR do? Type of change: New feature Improvements to `megatron_preprocess_data` for Nemotron v3 post-training datasets and Megatron-Bridge distillation workflows: - **`--reasoning_content`** flag (`strip` / `inline` / `native`) to handle the `reasoning_content` field in Nemotron Post-Training v3 assistant messages - **No intermediate JSONL** for HuggingFace datasets — load directly from Arrow cache via `_iter_hf_as_json` + `process_hf_split` - **Return output prefixes** (`list[str]`) from the Python API so callers can build `--data_paths` without hardcoding paths; also printed at end of run - **Gzip input support** — `.jsonl.gz` files accepted directly; `--input_dir` globs both `*.jsonl` and `*.jsonl.gz` - **`--strip_newlines`** flag (opt-in) to replace newlines with spaces in plain-text values; default preserves newlines (no breaking change for code/structured-text datasets) - **`--hf_streaming`** flag for very large datasets — only consumed rows are downloaded; automatically falls back to non-streaming (with a warning) if `--hf_max_samples_per_split` is not set, since streaming without a cap is slower than cached non-streaming - **Auto-shuffle** when `--hf_max_samples_per_split` is set — reservoir sampling (buffer=10,000, seed=42) applied before capping to avoid biased prefix sampling - Remove `_document` suffix from output filenames (`_text.bin` instead of `_text_document.bin`) - Fix duplicate BOS token for chat-template data (`add_special_tokens=False`) - Fix `TypeError` in `process_hf_split` (`sum(list)` not `sum(int)`) - Suppress duplicate prints across pool workers via `_is_main_or_first_worker()` - Raise `KeyError` instead of warning for missing JSON keys - Default `hf_split=None` (all splits) instead of `"train"` ### Usage ```python from modelopt.torch.utils.plugins.megatron_preprocess_data import megatron_preprocess_data # Nemotron v3 with reasoning content preserved inline as <think>...</think> prefixes = megatron_preprocess_data( hf_dataset="nvidia/Nemotron-Post-Training-Dataset-v3", json_keys=["messages"], tokenizer_name_or_path="Qwen/Qwen3-0.6B", output_dir="tokenized/", workers=32, reasoning_content="inline", ) # prefixes == ["tokenized/nvidia--Nemotron-Post-Training-Dataset-v3_..._messages"] data_paths = [x for p in prefixes for x in ("1.0", p)] # Large pretraining dataset — stream + cap (auto-shuffled before capping) prefixes = megatron_preprocess_data( hf_dataset="nvidia/Nemotron-CC-v2.1", hf_name="High-Quality", hf_max_samples_per_split=5_000_000, hf_streaming=True, json_keys=["text"], tokenizer_name_or_path="Qwen/Qwen3-0.6B", output_dir="tokenized/", workers=32, append_eod=True, strip_newlines=True, ) ``` ### Testing - New unit tests - Tested tokenization on Nemotron Pretraining and Post-training v3 datasets ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ (output filename changed: `_text_document` → `_text`; existing callers need to re-tokenize or rename files) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅  ## Summary by CodeRabbit * **New Features** * Preprocessing tool: configurable reasoning-content modes (strip|inline|native), optional newline stripping, gzip (.jsonl.gz) input support, HF streaming mode, auto-shuffle when per-split max samples set, returns output-file prefixes, and processes all HF splits by default without writing intermediate JSONL. * **Documentation** * Consolidated and reformatted dataset preparation and tokenization guidance; updated install/auth instructions and examples. * **Tests** * Tests updated to validate returned prefixes, reasoning-content behaviors, gzip input handling, HF streaming warnings, and HF output assertions.  --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -28,6 +28,7 @@ Changelog
 - [Security] Changed the default of ``weights_only`` to ``True`` in ``torch.load`` for secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class in ``torch.serialization.add_safe_globals([cls])`` before loading. Added :meth:`safe_save <modelopt.torch.utils.serialization.safe_save>` and :meth:`safe_load <modelopt.torch.utils.serialization.safe_load>` API to save and load checkpoints securely.
 - Bump minimum required PyTorch version to 2.8.
 - [Experimental] Add support for transformers>=5.0. Unified Hugging Face checkpoint export for quantized checkpoints may not work for MoE models with transformers>=5.0 yet.
+- Improve ``megatron_preprocess_data``: add ``--reasoning_content`` support for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (``.jsonl.gz``), add ``--strip_newlines`` flag for plain-text pretraining data, add ``--hf_streaming`` for very large datasets (only consumed rows downloaded), and auto-shuffle when ``--hf_max_samples_per_split`` is set to avoid biased sampling.
 
 0.43 (2026-04-09)
 ^^^^^^^^^^^^^^^^^
diff --git a/examples/dataset/README.md b/examples/dataset/README.md
@@ -1,15 +1,26 @@
-# Dataset Preparation Scripts
+# Dataset Preparation
+
+<div align="center">
+
+| **Section** | **Description** | **Link** |
+| :------------: | :------------: | :------------: |
+| Building Chat Datasets | Scripts to build conversation datasets from Nemotron and other HuggingFace sources | \[[Link](#building-chat-datasets)\] |
+| Tokenizing for Megatron Frameworks | Convert JSONL or HF datasets to Megatron binary format for distillation and pre-training | \[[Link](#tokenizing-for-megatron-frameworks)\] |
+
+</div>
+
+## Building Chat Datasets
 
 Utilities for building conversation datasets from NVIDIA Nemotron Post-Training
 collections and other HuggingFace sources.  These scripts produce datasets in
 **standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`)
 and can be used for any downstream fine-tuning task — SFT, distillation,
 speculative decoding draft-model training, etc.
 
-## Files
+### Files
 
 | File | Description |
-|---|---|
+| --- | --- |
 | `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix |
 | `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) |
 | `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) |
@@ -19,16 +30,16 @@ speculative decoding draft-model training, etc.
 | `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` |
 | `example_data_config.yaml` | Example YAML config for `make_dataset.py` |
 
-## Quick Start
+### Quick Start
 
-### Install dependencies
+#### Install dependencies
 
 ```bash
-pip install datasets huggingface_hub pyyaml
-huggingface-cli login   # required for gated datasets
-```text
+pip install nvidia-modelopt[hf]
+hf auth login --token <your token> # required for gated datasets
+```
 
-### Build a Nemotron PT v3 dataset
+#### Build a Nemotron PT v3 dataset
 
 ```bash
 # Synthetic data generation inputs (strips last assistant turn so a model can regenerate it)
@@ -39,45 +50,45 @@ python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train
 
 # Use a custom dataset mix
 python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom
-```text
+```
 
-### Build a Nemotron PT v2 dataset
+#### Build a Nemotron PT v2 dataset
 
 ```bash
 python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
 python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
-```text
+```
 
-### Build a general-purpose mixed dataset
+#### Build a general-purpose mixed dataset
 
 ```bash
 python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed
-```text
+```
 
-## Dataset Modes
+### Dataset Modes
 
 Both `make_nemotron_pt*.py` scripts support two modes:
 
 | Mode | Description | Use case |
-|---|---|---|
+| --- | --- | --- |
 | `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) |
 | `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training |
 
-## Synthetic Generation Pipeline
+### Synthetic Generation Pipeline
 
 The `generate` mode produces conversation skeletons that are fed to a target model
 via `tools/launcher/common/query.py` (vLLM or TRT-LLM).  The output becomes training
 data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student:
 
-```text
+```bash
 make_nemotron_ptv3_dataset.py --mode generate  →  skeleton.jsonl
         ↓
 query.py  (target model generates responses turn-by-turn)
         ↓
 training data for draft model / student
-```text
+```
 
-## Augmentations
+### Augmentations
 
 `augmentations.yaml` defines language-redirect and style-hint variants that are
 applied cyclically across the dataset.  Each enabled entry produces one augmented
@@ -95,9 +106,9 @@ augmentations:
   - type: system_prompt
     content: "You are a helpful assistant."
     enabled: false   # disable without deleting
-```text
+```
 
-## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)
+### Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)
 
 Edit this file to add, remove, or re-weight datasets without touching the script:
 
@@ -111,9 +122,9 @@ datasets:
   - repo_id: nvidia/OpenMathReasoning-mini
     splits: [train]
     augment: false   # multilingual — skip language-redirect augmentation
-```text
+```
 
-## Output Format
+### Output Format
 
 Every output row is a JSONL object with a single `messages` key:
 
@@ -123,6 +134,88 @@ Every output row is a JSONL object with a single `messages` key:
   {"role": "user",      "content": "What is 2+2?"},
   {"role": "assistant", "content": "4"}
 ]}
-```text
+```
 
 In `generate` mode, assistant turns are stripped so the row ends with a user turn.
+
+## Tokenizing for Megatron Frameworks
+
+The distillation and pre-training scripts in Megatron-Bridge or Megatron-LM expect data pre-tokenized in Megatron's binary indexed format (`.bin` / `.idx`).
+Use the `megatron_preprocess_data` utility to tokenize any JSONL or Hugging Face dataset.
+The tokenization scripts below prints the list of output prefixes (e.g. `tokenized_qwen3/data1_text`) that you can use for the `data_paths` argument (with relative weights on different files) in Megatron training scripts.
+
+**Important Notes:**
+
+- For Pretraining / raw-text data (`text` key) — use `--append_eod` so Megatron can tell where documents end when concatenating them into long sequences.
+- For Post-training chat data (`messages` key) — omit `--append_eod`; the chat template already appends EOS at the end of each conversation.
+- Set `--max_sequence_length 256_000` to avoid rare OOM errors if some text is very long.
+
+### From JSONL files
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
+    --json_keys text \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir tokenized_qwen3 \
+    --workers 32 \
+    --append_eod
+```
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --jsonl_paths /path/to/sft_data.jsonl \
+    --json_keys messages \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir tokenized_qwen3 \
+    --workers 32
+```
+
+Instead of `--jsonl_paths`, pass `--input_dir /path/to/dir` to tokenize all JSONL files in a directory (`.jsonl` and `.jsonl.gz` are both supported).
+
+### From Hugging Face Hub
+
+To tokenize a dataset directly from Hugging Face Hub:
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
+    --hf_name Nemotron-SFT-Code \
+    --hf_split train \
+    --hf_max_samples_per_split 10_000_000 \
+    --json_keys text \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir tokenized_qwen3 \
+    --workers 32 \
+    --append_eod
+```
+
+Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples.
+To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample).
+
+For **very large datasets** (tens of millions of documents), add `--hf_streaming --hf_max_samples_per_split <num_samples>` to avoid downloading the full dataset — only the rows actually consumed are fetched.
+
+> **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk.
+> Re-runs read from cache and are much faster.
+> Streaming re-downloads on every run with no cache, so it is slower for full-dataset processing.
+
+### Nemotron Post-Training v3 (`reasoning_content`)
+
+v3 datasets include a `reasoning_content` field in assistant messages (chain-of-thought separate from
+the final answer). Use `--reasoning_content` to control how it is handled:
+
+| Value | Behaviour |
+| --- | --- |
+| `strip` (default) | Field is discarded before `apply_chat_template`. Safe for any tokenizer. |
+| `inline` | Wrapped as `<think>…</think>` and prepended to `content`. Preserves reasoning in a tokenizer-agnostic way. |
+| `native` | Passed unchanged. Requires the tokenizer's chat template to handle the field (e.g. Qwen3). |
+
+```bash
+python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
+    --hf_dataset nvidia/Nemotron-Post-Training-Dataset-v3 \
+    --json_keys messages \
+    --tokenizer Qwen/Qwen3-0.6B \
+    --output_dir tokenized_qwen3 \
+    --workers 32 \
+    --reasoning_content inline
+```
diff --git a/examples/megatron_bridge/README.md b/examples/megatron_bridge/README.md
@@ -98,44 +98,8 @@ The [distill.py](distill.py) script loads student and teacher models from Huggin
 
 The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
 
-You can tokenize your JSONL datasets using the following command:
-
-```bash
-python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
-    --jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
-    --json_keys text \
-    --tokenizer Qwen/Qwen3-0.6B \
-    --output_dir tokenized_qwen3 \
-    --workers 32 \
-    --max_sequence_length 256_000
-```
-
-This will create `tokenized_qwen3/data1_text_document.{bin,idx}` and `tokenized_qwen3/data2_text_document.{bin,idx}` files. We can use these files in the distillation script by passing `--data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document` (equal weight for both datasets).
-
-Instead of `--jsonl_paths`, you can also pass a directory path to the `--input_dir` argument to tokenize all JSONL files in the directory.
-We are setting a maximum sequence length of 256k to avoid rare OOM errors in tokenization if text is too long.
-
-If you want to download and tokenize a dataset from Hugging Face Hub directly, you can use the following command:
-
-```bash
-python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
-    --hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
-    --hf_name Nemotron-SFT-General \
-    --hf_split train \
-    --hf_max_samples_per_split 10_000_000 \
-    --json_keys text \
-    --tokenizer Qwen/Qwen3-0.6B \
-    --output_dir tokenized_qwen3 \
-    --workers 32 \
-    --max_sequence_length 256_000
-```
-
-The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
-To quickly test the script, you can try the [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample) dataset.
-
-If you skip `--hf_name`, it will download and tokenize all subsets for the dataset.
-If you skip `--hf_split`, it will download and tokenize all splits for the subset.
-If you skip `--hf_max_samples_per_split`, it will download and tokenize all samples for the split.
+See the **[Dataset Preparation README](../dataset/README.md#tokenizing-for-megatron-frameworks)**
+for full instructions on tokenizing JSONL files and Hugging Face datasets and get the list of output prefixes that you can use for `--data_paths` argument.
 
 ### Distillation with Real Data
 
diff --git a/modelopt/torch/utils/dataset_utils.py b/modelopt/torch/utils/dataset_utils.py
@@ -690,7 +690,7 @@ def download_hf_dataset_as_jsonl(
     output_dir: str | Path,
     json_keys: str | list[str] = ["text"],
     name: str | None = None,
-    split: str | None = "train",
+    split: str | None = None,
     max_samples_per_split: int | None = None,
     num_proc: int | None = None,
 ) -> list[str]:
@@ -701,7 +701,7 @@ def download_hf_dataset_as_jsonl(
         output_dir: Directory to save the JSONL files
         json_keys: Key or list of keys to extract from the dataset. Defaults to ["text"].
         name: Name of the subset to download
-        split: Split of the dataset to download. Defaults to "train".
+        split: Split of the dataset to download. Defaults to None (all splits).
         max_samples_per_split: Maximum number of samples to download per split. Defaults to None.
         num_proc: Number of processes to use for parallel processing. Defaults to None.
 
@@ -744,7 +744,6 @@ def download_hf_dataset_as_jsonl(
         print(f"\t{entry}")
 
     for entry in splits_to_process:
-        skip_processing = False
         path = entry["dataset"]
         name = entry.get("config", None)
         split = entry["split"]
@@ -761,12 +760,9 @@ def download_hf_dataset_as_jsonl(
 
         for key in json_keys:
             if key not in ds.features:
-                warn(f"[SKIP] {key=} not found in {ds.features=}")
-                skip_processing = True
-                break
-
-        if skip_processing:
-            continue
+                raise KeyError(
+                    f"{key=} not found in dataset features. Available: {list(ds.features)}"
+                )
 
         print(f"Saving raw dataset to {jsonl_file_path}")
         ds.to_json(jsonl_file_path, num_proc=num_proc)
diff --git a/modelopt/torch/utils/plugins/megatron_preprocess_data.py b/modelopt/torch/utils/plugins/megatron_preprocess_data.py
diff --git a/tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py b/tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py