Skip to content

Commit 8815853

Browse files
kevalmorabia97claude
authored andcommitted
Improve Megatron Tokenization: streaming, reasoning_content support, HF in-memory tokenization, etc (#1221)
### What does this PR do? Type of change: New feature Improvements to `megatron_preprocess_data` for Nemotron v3 post-training datasets and Megatron-Bridge distillation workflows: - **`--reasoning_content`** flag (`strip` / `inline` / `native`) to handle the `reasoning_content` field in Nemotron Post-Training v3 assistant messages - **No intermediate JSONL** for HuggingFace datasets — load directly from Arrow cache via `_iter_hf_as_json` + `process_hf_split` - **Return output prefixes** (`list[str]`) from the Python API so callers can build `--data_paths` without hardcoding paths; also printed at end of run - **Gzip input support** — `.jsonl.gz` files accepted directly; `--input_dir` globs both `*.jsonl` and `*.jsonl.gz` - **`--strip_newlines`** flag (opt-in) to replace newlines with spaces in plain-text values; default preserves newlines (no breaking change for code/structured-text datasets) - **`--hf_streaming`** flag for very large datasets — only consumed rows are downloaded; automatically falls back to non-streaming (with a warning) if `--hf_max_samples_per_split` is not set, since streaming without a cap is slower than cached non-streaming - **Auto-shuffle** when `--hf_max_samples_per_split` is set — reservoir sampling (buffer=10,000, seed=42) applied before capping to avoid biased prefix sampling - Remove `_document` suffix from output filenames (`_text.bin` instead of `_text_document.bin`) - Fix duplicate BOS token for chat-template data (`add_special_tokens=False`) - Fix `TypeError` in `process_hf_split` (`sum(list)` not `sum(int)`) - Suppress duplicate prints across pool workers via `_is_main_or_first_worker()` - Raise `KeyError` instead of warning for missing JSON keys - Default `hf_split=None` (all splits) instead of `"train"` ### Usage ```python from modelopt.torch.utils.plugins.megatron_preprocess_data import megatron_preprocess_data # Nemotron v3 with reasoning content preserved inline as <think>...</think> prefixes = megatron_preprocess_data( hf_dataset="nvidia/Nemotron-Post-Training-Dataset-v3", json_keys=["messages"], tokenizer_name_or_path="Qwen/Qwen3-0.6B", output_dir="tokenized/", workers=32, reasoning_content="inline", ) # prefixes == ["tokenized/nvidia--Nemotron-Post-Training-Dataset-v3_..._messages"] data_paths = [x for p in prefixes for x in ("1.0", p)] # Large pretraining dataset — stream + cap (auto-shuffled before capping) prefixes = megatron_preprocess_data( hf_dataset="nvidia/Nemotron-CC-v2.1", hf_name="High-Quality", hf_max_samples_per_split=5_000_000, hf_streaming=True, json_keys=["text"], tokenizer_name_or_path="Qwen/Qwen3-0.6B", output_dir="tokenized/", workers=32, append_eod=True, strip_newlines=True, ) ``` ### Testing - New unit tests - Tested tokenization on Nemotron Pretraining and Post-training v3 datasets ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ (output filename changed: `_text_document` → `_text`; existing callers need to re-tokenize or rename files) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Preprocessing tool: configurable reasoning-content modes (strip|inline|native), optional newline stripping, gzip (.jsonl.gz) input support, HF streaming mode, auto-shuffle when per-split max samples set, returns output-file prefixes, and processes all HF splits by default without writing intermediate JSONL. * **Documentation** * Consolidated and reformatted dataset preparation and tokenization guidance; updated install/auth instructions and examples. * **Tests** * Tests updated to validate returned prefixes, reasoning-content behaviors, gzip input handling, HF streaming warnings, and HF output assertions. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent cf70e89 commit 8815853

6 files changed

Lines changed: 585 additions & 181 deletions

File tree

CHANGELOG.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ Changelog
2828
- [Security] Changed the default of ``weights_only`` to ``True`` in ``torch.load`` for secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class in ``torch.serialization.add_safe_globals([cls])`` before loading. Added :meth:`safe_save <modelopt.torch.utils.serialization.safe_save>` and :meth:`safe_load <modelopt.torch.utils.serialization.safe_load>` API to save and load checkpoints securely.
2929
- Bump minimum required PyTorch version to 2.8.
3030
- [Experimental] Add support for transformers>=5.0. Unified Hugging Face checkpoint export for quantized checkpoints may not work for MoE models with transformers>=5.0 yet.
31+
- Improve ``megatron_preprocess_data``: add ``--reasoning_content`` support for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (``.jsonl.gz``), add ``--strip_newlines`` flag for plain-text pretraining data, add ``--hf_streaming`` for very large datasets (only consumed rows downloaded), and auto-shuffle when ``--hf_max_samples_per_split`` is set to avoid biased sampling.
3132

3233
0.43 (2026-04-09)
3334
^^^^^^^^^^^^^^^^^

examples/dataset/README.md

Lines changed: 118 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,26 @@
1-
# Dataset Preparation Scripts
1+
# Dataset Preparation
2+
3+
<div align="center">
4+
5+
| **Section** | **Description** | **Link** |
6+
| :------------: | :------------: | :------------: |
7+
| Building Chat Datasets | Scripts to build conversation datasets from Nemotron and other HuggingFace sources | \[[Link](#building-chat-datasets)\] |
8+
| Tokenizing for Megatron Frameworks | Convert JSONL or HF datasets to Megatron binary format for distillation and pre-training | \[[Link](#tokenizing-for-megatron-frameworks)\] |
9+
10+
</div>
11+
12+
## Building Chat Datasets
213

314
Utilities for building conversation datasets from NVIDIA Nemotron Post-Training
415
collections and other HuggingFace sources. These scripts produce datasets in
516
**standard OpenAI chat format** (`{"messages": [{"role": ..., "content": ...}]}`)
617
and can be used for any downstream fine-tuning task — SFT, distillation,
718
speculative decoding draft-model training, etc.
819

9-
## Files
20+
### Files
1021

1122
| File | Description |
12-
|---|---|
23+
| --- | --- |
1324
| `make_nemotron_ptv3_dataset.py` | Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix |
1425
| `make_nemotron_ptv2_dataset.py` | Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2) |
1526
| `make_dataset.py` | General-purpose mixer for arbitrary HuggingFace datasets (mtbench, sharegpt, ultrachat, magpie, etc.) |
@@ -19,16 +30,16 @@ speculative decoding draft-model training, etc.
1930
| `nemotron_ptv3_datasets.yaml` | Dataset mix config for `make_nemotron_ptv3_dataset.py` |
2031
| `example_data_config.yaml` | Example YAML config for `make_dataset.py` |
2132

22-
## Quick Start
33+
### Quick Start
2334

24-
### Install dependencies
35+
#### Install dependencies
2536

2637
```bash
27-
pip install datasets huggingface_hub pyyaml
28-
huggingface-cli login # required for gated datasets
29-
```text
38+
pip install nvidia-modelopt[hf]
39+
hf auth login --token <your token> # required for gated datasets
40+
```
3041

31-
### Build a Nemotron PT v3 dataset
42+
#### Build a Nemotron PT v3 dataset
3243

3344
```bash
3445
# Synthetic data generation inputs (strips last assistant turn so a model can regenerate it)
@@ -39,45 +50,45 @@ python make_nemotron_ptv3_dataset.py --mode train --output-dir /tmp/ptv3_train
3950

4051
# Use a custom dataset mix
4152
python make_nemotron_ptv3_dataset.py --config my_mix.yaml --output-dir /tmp/ptv3_custom
42-
```text
53+
```
4354

44-
### Build a Nemotron PT v2 dataset
55+
#### Build a Nemotron PT v2 dataset
4556

4657
```bash
4758
python make_nemotron_ptv2_dataset.py --output-dir /tmp/ptv2_gen
4859
python make_nemotron_ptv2_dataset.py --mode train --output-dir /tmp/ptv2_train
49-
```text
60+
```
5061

51-
### Build a general-purpose mixed dataset
62+
#### Build a general-purpose mixed dataset
5263

5364
```bash
5465
python make_dataset.py --config example_data_config.yaml --output-dir /tmp/mixed
55-
```text
66+
```
5667

57-
## Dataset Modes
68+
### Dataset Modes
5869

5970
Both `make_nemotron_pt*.py` scripts support two modes:
6071

6172
| Mode | Description | Use case |
62-
|---|---|---|
73+
| --- | --- | --- |
6374
| `generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) |
6475
| `train` | Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training |
6576

66-
## Synthetic Generation Pipeline
77+
### Synthetic Generation Pipeline
6778

6879
The `generate` mode produces conversation skeletons that are fed to a target model
6980
via `tools/launcher/common/query.py` (vLLM or TRT-LLM). The output becomes training
7081
data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student:
7182

72-
```text
83+
```bash
7384
make_nemotron_ptv3_dataset.py --mode generate → skeleton.jsonl
7485
7586
query.py (target model generates responses turn-by-turn)
7687
7788
training data for draft model / student
78-
```text
89+
```
7990

80-
## Augmentations
91+
### Augmentations
8192

8293
`augmentations.yaml` defines language-redirect and style-hint variants that are
8394
applied cyclically across the dataset. Each enabled entry produces one augmented
@@ -95,9 +106,9 @@ augmentations:
95106
- type: system_prompt
96107
content: "You are a helpful assistant."
97108
enabled: false # disable without deleting
98-
```text
109+
```
99110
100-
## Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)
111+
### Dataset Mix Config (`nemotron_ptv3_datasets.yaml`)
101112

102113
Edit this file to add, remove, or re-weight datasets without touching the script:
103114

@@ -111,9 +122,9 @@ datasets:
111122
- repo_id: nvidia/OpenMathReasoning-mini
112123
splits: [train]
113124
augment: false # multilingual — skip language-redirect augmentation
114-
```text
125+
```
115126

116-
## Output Format
127+
### Output Format
117128

118129
Every output row is a JSONL object with a single `messages` key:
119130

@@ -123,6 +134,88 @@ Every output row is a JSONL object with a single `messages` key:
123134
{"role": "user", "content": "What is 2+2?"},
124135
{"role": "assistant", "content": "4"}
125136
]}
126-
```text
137+
```
127138

128139
In `generate` mode, assistant turns are stripped so the row ends with a user turn.
140+
141+
## Tokenizing for Megatron Frameworks
142+
143+
The distillation and pre-training scripts in Megatron-Bridge or Megatron-LM expect data pre-tokenized in Megatron's binary indexed format (`.bin` / `.idx`).
144+
Use the `megatron_preprocess_data` utility to tokenize any JSONL or Hugging Face dataset.
145+
The tokenization scripts below prints the list of output prefixes (e.g. `tokenized_qwen3/data1_text`) that you can use for the `data_paths` argument (with relative weights on different files) in Megatron training scripts.
146+
147+
**Important Notes:**
148+
149+
- For Pretraining / raw-text data (`text` key) — use `--append_eod` so Megatron can tell where documents end when concatenating them into long sequences.
150+
- For Post-training chat data (`messages` key) — omit `--append_eod`; the chat template already appends EOS at the end of each conversation.
151+
- Set `--max_sequence_length 256_000` to avoid rare OOM errors if some text is very long.
152+
153+
### From JSONL files
154+
155+
```bash
156+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
157+
--jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
158+
--json_keys text \
159+
--tokenizer Qwen/Qwen3-0.6B \
160+
--output_dir tokenized_qwen3 \
161+
--workers 32 \
162+
--append_eod
163+
```
164+
165+
```bash
166+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
167+
--jsonl_paths /path/to/sft_data.jsonl \
168+
--json_keys messages \
169+
--tokenizer Qwen/Qwen3-0.6B \
170+
--output_dir tokenized_qwen3 \
171+
--workers 32
172+
```
173+
174+
Instead of `--jsonl_paths`, pass `--input_dir /path/to/dir` to tokenize all JSONL files in a directory (`.jsonl` and `.jsonl.gz` are both supported).
175+
176+
### From Hugging Face Hub
177+
178+
To tokenize a dataset directly from Hugging Face Hub:
179+
180+
```bash
181+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
182+
--hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
183+
--hf_name Nemotron-SFT-Code \
184+
--hf_split train \
185+
--hf_max_samples_per_split 10_000_000 \
186+
--json_keys text \
187+
--tokenizer Qwen/Qwen3-0.6B \
188+
--output_dir tokenized_qwen3 \
189+
--workers 32 \
190+
--append_eod
191+
```
192+
193+
Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples.
194+
To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample).
195+
196+
For **very large datasets** (tens of millions of documents), add `--hf_streaming --hf_max_samples_per_split <num_samples>` to avoid downloading the full dataset — only the rows actually consumed are fetched.
197+
198+
> **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk.
199+
> Re-runs read from cache and are much faster.
200+
> Streaming re-downloads on every run with no cache, so it is slower for full-dataset processing.
201+
202+
### Nemotron Post-Training v3 (`reasoning_content`)
203+
204+
v3 datasets include a `reasoning_content` field in assistant messages (chain-of-thought separate from
205+
the final answer). Use `--reasoning_content` to control how it is handled:
206+
207+
| Value | Behaviour |
208+
| --- | --- |
209+
| `strip` (default) | Field is discarded before `apply_chat_template`. Safe for any tokenizer. |
210+
| `inline` | Wrapped as `<think>…</think>` and prepended to `content`. Preserves reasoning in a tokenizer-agnostic way. |
211+
| `native` | Passed unchanged. Requires the tokenizer's chat template to handle the field (e.g. Qwen3). |
212+
213+
```bash
214+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
215+
--hf_dataset nvidia/Nemotron-Post-Training-Dataset-v3 \
216+
--json_keys messages \
217+
--tokenizer Qwen/Qwen3-0.6B \
218+
--output_dir tokenized_qwen3 \
219+
--workers 32 \
220+
--reasoning_content inline
221+
```

examples/megatron_bridge/README.md

Lines changed: 2 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -98,44 +98,8 @@ The [distill.py](distill.py) script loads student and teacher models from Huggin
9898

9999
The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
100100

101-
You can tokenize your JSONL datasets using the following command:
102-
103-
```bash
104-
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
105-
--jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
106-
--json_keys text \
107-
--tokenizer Qwen/Qwen3-0.6B \
108-
--output_dir tokenized_qwen3 \
109-
--workers 32 \
110-
--max_sequence_length 256_000
111-
```
112-
113-
This will create `tokenized_qwen3/data1_text_document.{bin,idx}` and `tokenized_qwen3/data2_text_document.{bin,idx}` files. We can use these files in the distillation script by passing `--data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document` (equal weight for both datasets).
114-
115-
Instead of `--jsonl_paths`, you can also pass a directory path to the `--input_dir` argument to tokenize all JSONL files in the directory.
116-
We are setting a maximum sequence length of 256k to avoid rare OOM errors in tokenization if text is too long.
117-
118-
If you want to download and tokenize a dataset from Hugging Face Hub directly, you can use the following command:
119-
120-
```bash
121-
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
122-
--hf_dataset nvidia/Nemotron-Pretraining-SFT-v1 \
123-
--hf_name Nemotron-SFT-General \
124-
--hf_split train \
125-
--hf_max_samples_per_split 10_000_000 \
126-
--json_keys text \
127-
--tokenizer Qwen/Qwen3-0.6B \
128-
--output_dir tokenized_qwen3 \
129-
--workers 32 \
130-
--max_sequence_length 256_000
131-
```
132-
133-
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
134-
To quickly test the script, you can try the [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample) dataset.
135-
136-
If you skip `--hf_name`, it will download and tokenize all subsets for the dataset.
137-
If you skip `--hf_split`, it will download and tokenize all splits for the subset.
138-
If you skip `--hf_max_samples_per_split`, it will download and tokenize all samples for the split.
101+
See the **[Dataset Preparation README](../dataset/README.md#tokenizing-for-megatron-frameworks)**
102+
for full instructions on tokenizing JSONL files and Hugging Face datasets and get the list of output prefixes that you can use for `--data_paths` argument.
139103

140104
### Distillation with Real Data
141105

modelopt/torch/utils/dataset_utils.py

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -690,7 +690,7 @@ def download_hf_dataset_as_jsonl(
690690
output_dir: str | Path,
691691
json_keys: str | list[str] = ["text"],
692692
name: str | None = None,
693-
split: str | None = "train",
693+
split: str | None = None,
694694
max_samples_per_split: int | None = None,
695695
num_proc: int | None = None,
696696
) -> list[str]:
@@ -701,7 +701,7 @@ def download_hf_dataset_as_jsonl(
701701
output_dir: Directory to save the JSONL files
702702
json_keys: Key or list of keys to extract from the dataset. Defaults to ["text"].
703703
name: Name of the subset to download
704-
split: Split of the dataset to download. Defaults to "train".
704+
split: Split of the dataset to download. Defaults to None (all splits).
705705
max_samples_per_split: Maximum number of samples to download per split. Defaults to None.
706706
num_proc: Number of processes to use for parallel processing. Defaults to None.
707707
@@ -744,7 +744,6 @@ def download_hf_dataset_as_jsonl(
744744
print(f"\t{entry}")
745745

746746
for entry in splits_to_process:
747-
skip_processing = False
748747
path = entry["dataset"]
749748
name = entry.get("config", None)
750749
split = entry["split"]
@@ -761,12 +760,9 @@ def download_hf_dataset_as_jsonl(
761760

762761
for key in json_keys:
763762
if key not in ds.features:
764-
warn(f"[SKIP] {key=} not found in {ds.features=}")
765-
skip_processing = True
766-
break
767-
768-
if skip_processing:
769-
continue
763+
raise KeyError(
764+
f"{key=} not found in dataset features. Available: {list(ds.features)}"
765+
)
770766

771767
print(f"Saving raw dataset to {jsonl_file_path}")
772768
ds.to_json(jsonl_file_path, num_proc=num_proc)

0 commit comments

Comments
 (0)