Skip to content

Commit d0abca7

Browse files
Support megatron tokenization for post training datasets (NVIDIA#1018)
### What does this PR do? Update megatron_preprocess_data.py to support applying chat template for tokenizing chat based post training datasets <!-- Details about the change. --> ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing <!-- Mention how have you tested your change if applicable. --> - Tokenized Nemotron-Post-Training-Dataset-v2 (~2B tokens for stem + chat + math + code splits) - Doing distillation on pruned nano v2 7B ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ <!--- If ❌, explain why. --> - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!--- Mandatory --> - Did you write any new necessary tests?: ❌ <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Added runtime logging for data truncation operations to enhance processing visibility * Improved handling of chat-formatted conversation data in list format * Eliminated duplicate log messages during data encoding operations <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
1 parent bc87981 commit d0abca7

2 files changed

Lines changed: 55 additions & 12 deletions

File tree

modelopt/torch/utils/plugins/megatron_preprocess_data.py

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,12 @@
1515

1616
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
1717

18-
"""Processing large data to tokenize for pretraining.
18+
"""Processing large data pretraining and post-training datasets to tokenize for usage in megatron pretraining scripts.
1919
20-
Usage to tokenize one or more JSONL files:
20+
We apply chat_template to the data if the JSON key is a list of message dicts (e.g. Nemotron-Post-Training-Dataset-v2)
21+
so that we can tokenize the data for usage in megatron pretraining scripts.
22+
23+
Usage to tokenize one or more JSONL files (pretraining, ``text`` key):
2124
2225
```bash
2326
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
@@ -37,6 +40,21 @@
3740
--tokenizer Qwen/Qwen3-0.6B
3841
```
3942
43+
Usage to tokenize a post-training dataset with ``messages`` key (chat format):
44+
45+
```bash
46+
python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
47+
--jsonl_paths path/to/sft_data.jsonl \
48+
--json_keys messages \
49+
--output_dir /path/to/tokenized/Qwen3/ \
50+
--tokenizer Qwen/Qwen3-0.6B
51+
```
52+
53+
When the value for a JSON key is a list of message dicts (e.g.
54+
``[{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]``),
55+
``tokenizer.apply_chat_template`` is automatically used to render the conversation
56+
into a single text string before tokenization.
57+
4058
Usage to download and tokenize a dataset from Hugging Face Hub:
4159
4260
```bash
@@ -69,6 +87,7 @@
6987

7088
class _Encoder:
7189
tokenizer: AutoTokenizer = None
90+
_chat_template_logged: set[str] = set()
7291

7392
def __init__(
7493
self,
@@ -97,21 +116,35 @@ def encode(self, json_line: str):
97116
doc_len = 0
98117
enc_len = 0
99118
for key in self.json_keys:
100-
text = data[key]
119+
value = data[key]
120+
121+
if isinstance(value, list):
122+
if key not in _Encoder._chat_template_logged:
123+
_Encoder._chat_template_logged.add(key)
124+
print(f"Applying chat_template to '{key}' key")
125+
kwargs = {}
126+
tools = data.get("tools")
127+
if tools:
128+
kwargs["tools"] = tools
129+
text = _Encoder.tokenizer.apply_chat_template(value, tokenize=False, **kwargs)
130+
else:
131+
text = value
101132

102133
# Truncate text by character length if specified
103-
doc_len += len(text)
104134
if self.max_document_length is not None:
135+
original_length = len(text)
105136
text = text[: self.max_document_length]
106-
# print(f"Document truncated from {original_length} to {self.max_document_length} characters")
137+
if original_length != len(text):
138+
print(f"Document truncated from {original_length} to {len(text)} characters")
139+
doc_len += len(text)
107140

108141
# Tokenize the entire text as one document
109142
encoded = _Encoder.tokenizer.encode(text)
110143

111-
enc_len += len(encoded)
112144
if self.max_sequence_length is not None:
113145
encoded = encoded[: self.max_sequence_length]
114146
# print(f"Sequence truncated from {original_length} to {self.max_sequence_length} tokens")
147+
enc_len += len(encoded)
115148

116149
if len(encoded) > 0 and self.append_eod:
117150
encoded.append(_Encoder.tokenizer.eos_token_id)

tests/gpu_megatron/torch/utils/plugins/test_megatron_preprocess_data.py

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,8 @@
1717
import os
1818
from pathlib import Path
1919

20+
import pytest
21+
2022
from modelopt.torch.utils.dataset_utils import download_hf_dataset_as_jsonl
2123
from modelopt.torch.utils.plugins.megatron_preprocess_data import megatron_preprocess_data
2224

@@ -65,19 +67,27 @@ def test_megatron_preprocess_data_with_minipile_jsonl(tmp_path):
6567
assert os.path.getsize(expected_idx_file) > 0, "Index file should not be empty"
6668

6769

68-
def test_megatron_preprocess_data_with_hf_dataset(tmp_path):
70+
@pytest.mark.parametrize(
71+
("hf_dataset", "hf_split", "json_keys"),
72+
[
73+
("nanotron/minipile_100_samples", "train", ["text"]),
74+
("HuggingFaceTB/everyday-conversations-llama3.1-2k", "test_sft", ["messages"]),
75+
],
76+
)
77+
def test_megatron_preprocess_data_with_hf_dataset(tmp_path, hf_dataset, hf_split, json_keys):
6978
"""Test megatron_preprocess_data with dataset download, --append_eod and --max_sequence_length.
7079
7180
Downloads nanotron/minipile_100_samples train split from Hugging Face and tokenizes it.
7281
"""
7382
megatron_preprocess_data(
74-
hf_dataset="nanotron/minipile_100_samples",
75-
hf_split="train",
83+
hf_dataset=hf_dataset,
84+
hf_split=hf_split,
85+
hf_max_samples_per_split=10,
7686
output_dir=tmp_path,
77-
tokenizer_name_or_path="gpt2",
78-
json_keys=["text"],
87+
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
88+
json_keys=json_keys,
7989
append_eod=True,
80-
max_sequence_length=512,
90+
max_sequence_length=32,
8191
workers=4,
8292
)
8393

0 commit comments

Comments
 (0)