flush print megatron tokenization stats and update readme (#927)

kevalmorabia97 · danielkorzekwa · commit dd33fce97a4f · 2026-03-04T03:27:11.000-08:00
## What does this PR do?

When running the script, I often see the print stats for tokenization
(every `log_interval`) not showing up or showing up very very delayed.
Hence using `print(..., flush=True)` to fix this.

Also update README that the example shown for tokenization takes too
long to run, split into multiple .jsonl files for efficiently running
the tokenization; and try out a smaller dataset first to test the script

## Testing
&lt;!-- Mention how have you tested your change if applicable. --&gt;

Split Nemotron-pretraining-SFT-v1 dataset into multiple .jsonl splits
and then tokenize them parallelly in different slurm jobs.

Signed-off-by: Keval Morabia &lt;28916987+kevalmorabia97@users.noreply.github.com&gt;
Signed-off-by: Daniel Korzekwa &lt;dkorzekwa@nvidia.com&gt;
diff --git a/examples/megatron_bridge/README.md b/examples/megatron_bridge/README.md
@@ -128,6 +128,9 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
     --max_sequence_length 256_000
 ```
 
+The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a while to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly.
+To quickly test the script, you can try the [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample) dataset.
+
 If you skip `--hf_name`, it will download and tokenize all subsets for the dataset.
 If you skip `--hf_split`, it will download and tokenize all splits for the subset.
 If you skip `--hf_max_samples_per_split`, it will download and tokenize all samples for the split.
diff --git a/modelopt/torch/utils/plugins/megatron_preprocess_data.py b/modelopt/torch/utils/plugins/megatron_preprocess_data.py
@@ -137,7 +137,8 @@ def _print_processing_stats(
     ):
         if count % self.log_interval == 0 or force_print:
             print(
-                f"\tProcessed {num2hrb(count)} docs = {num2hrb(total_doc_len)} chars = {num2hrb(total_enc_len)} tokens"
+                f"\tProcessed {num2hrb(count)} docs = {num2hrb(total_doc_len)} chars = {num2hrb(total_enc_len)} tokens",
+                flush=True,
             )
 
     def process_json_file(

Original file line number	Diff line number	Diff line change
`@@ -137,7 +137,8 @@ def _print_processing_stats(`
`137`	`137`	`):`
`138`	`138`	`if count % self.log_interval == 0 or force_print:`
`139`	`139`	`print(`
`140`		`- f"\tProcessed {num2hrb(count)} docs = {num2hrb(total_doc_len)} chars = {num2hrb(total_enc_len)} tokens"`
	`140`	`+ f"\tProcessed {num2hrb(count)} docs = {num2hrb(total_doc_len)} chars = {num2hrb(total_enc_len)} tokens",`
	`141`	`+ flush=True,`
`141`	`142`	`)`
`142`	`143`
`143`	`144`	`def process_json_file(`