Commit dd33fce
flush print megatron tokenization stats and update readme (#927)
## What does this PR do?
When running the script, I often see the print stats for tokenization
(every `log_interval`) not showing up or showing up very very delayed.
Hence using `print(..., flush=True)` to fix this.
Also update README that the example shown for tokenization takes too
long to run, split into multiple .jsonl files for efficiently running
the tokenization; and try out a smaller dataset first to test the script
## Testing
<!-- Mention how have you tested your change if applicable. -->
Split Nemotron-pretraining-SFT-v1 dataset into multiple .jsonl splits
and then tokenize them parallelly in different slurm jobs.
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>1 parent dc3a6ea commit dd33fce
File tree
2 files changed
+5
-1
lines changed- examples/megatron_bridge
- modelopt/torch/utils/plugins
2 files changed
+5
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
128 | 128 | | |
129 | 129 | | |
130 | 130 | | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
131 | 134 | | |
132 | 135 | | |
133 | 136 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
137 | 137 | | |
138 | 138 | | |
139 | 139 | | |
140 | | - | |
| 140 | + | |
| 141 | + | |
141 | 142 | | |
142 | 143 | | |
143 | 144 | | |
| |||
0 commit comments