You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### What does this PR do?
Type of change: New feature
Improvements to `megatron_preprocess_data` for Nemotron v3 post-training
datasets and Megatron-Bridge distillation workflows:
- **`--reasoning_content`** flag (`strip` / `inline` / `native`) to
handle the `reasoning_content` field in Nemotron Post-Training v3
assistant messages
- **No intermediate JSONL** for HuggingFace datasets — load directly
from Arrow cache via `_iter_hf_as_json` + `process_hf_split`
- **Return output prefixes** (`list[str]`) from the Python API so
callers can build `--data_paths` without hardcoding paths; also printed
at end of run
- **Gzip input support** — `.jsonl.gz` files accepted directly;
`--input_dir` globs both `*.jsonl` and `*.jsonl.gz`
- **`--strip_newlines`** flag (opt-in) to replace newlines with spaces
in plain-text values; default preserves newlines (no breaking change for
code/structured-text datasets)
- **`--hf_streaming`** flag for very large datasets — only consumed rows
are downloaded; automatically falls back to non-streaming (with a
warning) if `--hf_max_samples_per_split` is not set, since streaming
without a cap is slower than cached non-streaming
- **Auto-shuffle** when `--hf_max_samples_per_split` is set — reservoir
sampling (buffer=10,000, seed=42) applied before capping to avoid biased
prefix sampling
- Remove `_document` suffix from output filenames (`_text.bin` instead
of `_text_document.bin`)
- Fix duplicate BOS token for chat-template data
(`add_special_tokens=False`)
- Fix `TypeError` in `process_hf_split` (`sum(list)` not `sum(int)`)
- Suppress duplicate prints across pool workers via
`_is_main_or_first_worker()`
- Raise `KeyError` instead of warning for missing JSON keys
- Default `hf_split=None` (all splits) instead of `"train"`
### Usage
```python
from modelopt.torch.utils.plugins.megatron_preprocess_data import megatron_preprocess_data
# Nemotron v3 with reasoning content preserved inline as <think>...</think>
prefixes = megatron_preprocess_data(
hf_dataset="nvidia/Nemotron-Post-Training-Dataset-v3",
json_keys=["messages"],
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
output_dir="tokenized/",
workers=32,
reasoning_content="inline",
)
# prefixes == ["tokenized/nvidia--Nemotron-Post-Training-Dataset-v3_..._messages"]
data_paths = [x for p in prefixes for x in ("1.0", p)]
# Large pretraining dataset — stream + cap (auto-shuffled before capping)
prefixes = megatron_preprocess_data(
hf_dataset="nvidia/Nemotron-CC-v2.1",
hf_name="High-Quality",
hf_max_samples_per_split=5_000_000,
hf_streaming=True,
json_keys=["text"],
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
output_dir="tokenized/",
workers=32,
append_eod=True,
strip_newlines=True,
)
```
### Testing
- New unit tests
- Tested tokenization on Nemotron Pretraining and Post-training v3
datasets
### Before your PR is "*Ready for review*"
Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).
Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(...,
weights_only=False)`, `pickle`, etc.).
- Is this change backward compatible?: ✅ (output filename changed:
`_text_document` → `_text`; existing callers need to re-tokenize or
rename files)
- If you copied code from any other sources or added a new PIP
dependency, did you follow guidance in `CONTRIBUTING.md`: N/A
- Did you write any new necessary tests?: ✅
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
✅
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Preprocessing tool: configurable reasoning-content modes
(strip|inline|native), optional newline stripping, gzip (.jsonl.gz)
input support, HF streaming mode, auto-shuffle when per-split max
samples set, returns output-file prefixes, and processes all HF splits
by default without writing intermediate JSONL.
* **Documentation**
* Consolidated and reformatted dataset preparation and tokenization
guidance; updated install/auth instructions and examples.
* **Tests**
* Tests updated to validate returned prefixes, reasoning-content
behaviors, gzip input handling, HF streaming warnings, and HF output
assertions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
---------
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,6 +28,7 @@ Changelog
28
28
- [Security] Changed the default of ``weights_only`` to ``True`` in ``torch.load`` for secure checkpoint loading. If you need to load a checkpoint that requires unpickling arbitrary objects, first register the class in ``torch.serialization.add_safe_globals([cls])`` before loading. Added :meth:`safe_save <modelopt.torch.utils.serialization.safe_save>` and :meth:`safe_load <modelopt.torch.utils.serialization.safe_load>` API to save and load checkpoints securely.
29
29
- Bump minimum required PyTorch version to 2.8.
30
30
- [Experimental] Add support for transformers>=5.0. Unified Hugging Face checkpoint export for quantized checkpoints may not work for MoE models with transformers>=5.0 yet.
31
+
- Improve ``megatron_preprocess_data``: add ``--reasoning_content`` support for Nemotron v3 datasets, eliminate intermediate JSONL for HuggingFace datasets, return output file prefixes from the Python API, add gzip input support (``.jsonl.gz``), add ``--strip_newlines`` flag for plain-text pretraining data, add ``--hf_streaming`` for very large datasets (only consumed rows downloaded), and auto-shuffle when ``--hf_max_samples_per_split`` is set to avoid biased sampling.
| Building Chat Datasets | Scripts to build conversation datasets from Nemotron and other HuggingFace sources |\[[Link](#building-chat-datasets)\]|
8
+
| Tokenizing for Megatron Frameworks | Convert JSONL or HF datasets to Megatron binary format for distillation and pre-training |\[[Link](#tokenizing-for-megatron-frameworks)\]|
9
+
10
+
</div>
11
+
12
+
## Building Chat Datasets
2
13
3
14
Utilities for building conversation datasets from NVIDIA Nemotron Post-Training
4
15
collections and other HuggingFace sources. These scripts produce datasets in
and can be used for any downstream fine-tuning task — SFT, distillation,
7
18
speculative decoding draft-model training, etc.
8
19
9
-
## Files
20
+
###Files
10
21
11
22
| File | Description |
12
-
|---|---|
23
+
|---|---|
13
24
|`make_nemotron_ptv3_dataset.py`| Build a dataset from the [Nemotron PT v3 collection](https://huggingface.co/collections/nvidia/nemotron-post-training-v3) using a configurable YAML mix |
14
25
|`make_nemotron_ptv2_dataset.py`| Build a dataset from [Nemotron-Post-Training-Dataset-v2](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2)|
Both `make_nemotron_pt*.py` scripts support two modes:
60
71
61
72
| Mode | Description | Use case |
62
-
|---|---|---|
73
+
|---|---|---|
63
74
|`generate` (default) | Strips assistant turns, optionally augments prompts | Input data for synthetic generation (query a target model to produce training responses) |
64
75
|`train`| Keeps all turns, normalizes to clean OpenAI format | Direct SFT / distillation training |
65
76
66
-
## Synthetic Generation Pipeline
77
+
###Synthetic Generation Pipeline
67
78
68
79
The `generate` mode produces conversation skeletons that are fed to a target model
69
80
via `tools/launcher/common/query.py` (vLLM or TRT-LLM). The output becomes training
70
81
data for a draft model (e.g. EAGLE3 speculative decoding) or a distilled student:
Every output row is a JSONL object with a single `messages` key:
119
130
@@ -123,6 +134,88 @@ Every output row is a JSONL object with a single `messages` key:
123
134
{"role": "user", "content": "What is 2+2?"},
124
135
{"role": "assistant", "content": "4"}
125
136
]}
126
-
```text
137
+
```
127
138
128
139
In `generate` mode, assistant turns are stripped so the row ends with a user turn.
140
+
141
+
## Tokenizing for Megatron Frameworks
142
+
143
+
The distillation and pre-training scripts in Megatron-Bridge or Megatron-LM expect data pre-tokenized in Megatron's binary indexed format (`.bin` / `.idx`).
144
+
Use the `megatron_preprocess_data` utility to tokenize any JSONL or Hugging Face dataset.
145
+
The tokenization scripts below prints the list of output prefixes (e.g. `tokenized_qwen3/data1_text`) that you can use for the `data_paths` argument (with relative weights on different files) in Megatron training scripts.
146
+
147
+
**Important Notes:**
148
+
149
+
- For Pretraining / raw-text data (`text` key) — use `--append_eod` so Megatron can tell where documents end when concatenating them into long sequences.
150
+
- For Post-training chat data (`messages` key) — omit `--append_eod`; the chat template already appends EOS at the end of each conversation.
151
+
- Set `--max_sequence_length 256_000` to avoid rare OOM errors if some text is very long.
Omit `--hf_name` to process all subsets, `--hf_split` for all splits, or `--hf_max_samples_per_split` for all samples.
194
+
To quickly test, use [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample).
195
+
196
+
For **very large datasets** (tens of millions of documents), add `--hf_streaming --hf_max_samples_per_split <num_samples>` to avoid downloading the full dataset — only the rows actually consumed are fetched.
197
+
198
+
> **Performance note:** Non-streaming mode downloads all Parquet shards once and caches them as Arrow files on disk.
199
+
> Re-runs read from cache and are much faster.
200
+
> Streaming re-downloads on every run with no cache, so it is slower for full-dataset processing.
This will create `tokenized_qwen3/data1_text_document.{bin,idx}` and `tokenized_qwen3/data2_text_document.{bin,idx}` files. We can use these files in the distillation script by passing `--data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document` (equal weight for both datasets).
114
-
115
-
Instead of `--jsonl_paths`, you can also pass a directory path to the `--input_dir` argument to tokenize all JSONL files in the directory.
116
-
We are setting a maximum sequence length of 256k to avoid rare OOM errors in tokenization if text is too long.
117
-
118
-
If you want to download and tokenize a dataset from Hugging Face Hub directly, you can use the following command:
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
134
-
To quickly test the script, you can try the [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample) dataset.
135
-
136
-
If you skip `--hf_name`, it will download and tokenize all subsets for the dataset.
137
-
If you skip `--hf_split`, it will download and tokenize all splits for the subset.
138
-
If you skip `--hf_max_samples_per_split`, it will download and tokenize all samples for the split.
101
+
See the **[Dataset Preparation README](../dataset/README.md#tokenizing-for-megatron-frameworks)**
102
+
for full instructions on tokenizing JSONL files and Hugging Face datasets and get the list of output prefixes that you can use for `--data_paths` argument.
0 commit comments