Skip to content

Commit dd33fce

Browse files
kevalmorabia97danielkorzekwa
authored andcommitted
flush print megatron tokenization stats and update readme (#927)
## What does this PR do? When running the script, I often see the print stats for tokenization (every `log_interval`) not showing up or showing up very very delayed. Hence using `print(..., flush=True)` to fix this. Also update README that the example shown for tokenization takes too long to run, split into multiple .jsonl files for efficiently running the tokenization; and try out a smaller dataset first to test the script ## Testing <!-- Mention how have you tested your change if applicable. --> Split Nemotron-pretraining-SFT-v1 dataset into multiple .jsonl splits and then tokenize them parallelly in different slurm jobs. Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
1 parent dc3a6ea commit dd33fce

File tree

2 files changed

+5
-1
lines changed

2 files changed

+5
-1
lines changed

examples/megatron_bridge/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,9 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
128128
--max_sequence_length 256_000
129129
```
130130

131+
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a while to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly.
132+
To quickly test the script, you can try the [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample) dataset.
133+
131134
If you skip `--hf_name`, it will download and tokenize all subsets for the dataset.
132135
If you skip `--hf_split`, it will download and tokenize all splits for the subset.
133136
If you skip `--hf_max_samples_per_split`, it will download and tokenize all samples for the split.

modelopt/torch/utils/plugins/megatron_preprocess_data.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,8 @@ def _print_processing_stats(
137137
):
138138
if count % self.log_interval == 0 or force_print:
139139
print(
140-
f"\tProcessed {num2hrb(count)} docs = {num2hrb(total_doc_len)} chars = {num2hrb(total_enc_len)} tokens"
140+
f"\tProcessed {num2hrb(count)} docs = {num2hrb(total_doc_len)} chars = {num2hrb(total_enc_len)} tokens",
141+
flush=True,
141142
)
142143

143144
def process_json_file(

0 commit comments

Comments
 (0)