Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 6 additions & 4 deletions examples/megatron_bridge/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,11 +105,13 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--jsonl_paths /path/to/data1.jsonl /path/to/data2.jsonl ... \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir /path/to/tokenized/data/qwen3 \
--output_dir tokenized_qwen3 \
--workers 32 \
--max_sequence_length 256_000
```

This will create `tokenized_qwen3/data1_text_document.{bin,idx}` and `tokenized_qwen3/data2_text_document.{bin,idx}` files. We can use these files in the distillation script by passing `--data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document` (equal weight for both datasets).

Instead of `--jsonl_paths`, you can also pass a directory path to the `--input_dir` argument to tokenize all JSONL files in the directory.
We are setting a maximum sequence length of 256k to avoid rare OOM errors in tokenization if text is too long.

Expand All @@ -123,12 +125,12 @@ python -m modelopt.torch.utils.plugins.megatron_preprocess_data \
--hf_max_samples_per_split 10_000_000 \
--json_keys text \
--tokenizer Qwen/Qwen3-0.6B \
--output_dir /path/to/tokenized/data/qwen3 \
--output_dir tokenized_qwen3 \
--workers 32 \
--max_sequence_length 256_000
```

The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a while to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly.
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix wording: use “in parallel” instead of “parallelly”.

Line 133 has a minor text-quality issue.

Suggested wording fix
-The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
+The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them in parallel via the `--jsonl_paths` argument.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them parallelly via the `--jsonl_paths` argument.
The [Nemotron-Pretraining-SFT-v1](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-SFT-v1) dataset is huge, so it will take a few hours to download and tokenize. You can also split the large `.jsonl` into multiple files (e.g. 10M samples per file using `split -l 10000000 -d --additional-suffix=.jsonl <file>.jsonl <file>_part`) and tokenize them in parallel via the `--jsonl_paths` argument.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/megatron_bridge/README.md` at line 133, Replace the incorrect adverb
"parallelly" in the README sentence that describes tokenizing split .jsonl files
using the --jsonl_paths argument with the correct phrase "in parallel" (i.e.,
change "...tokenize them parallelly via the `--jsonl_paths` argument." to
"...tokenize them in parallel via the `--jsonl_paths` argument.").

To quickly test the script, you can try the [nvidia/Nemotron-Pretraining-Dataset-sample](https://huggingface.co/datasets/nvidia/Nemotron-Pretraining-Dataset-sample) dataset.

If you skip `--hf_name`, it will download and tokenize all subsets for the dataset.
Expand All @@ -144,7 +146,7 @@ torchrun --nnodes 1 --nproc_per_node 8 distill.py \
--tp_size 8 \
--teacher_hf_path Qwen/Qwen3-8B \
--student_hf_path Qwen/Qwen3-4B \
--data_paths 1.0 /path/to/tokenized/data/qwen3 \
--data_paths 1.0 tokenized_qwen3/data1_text_document 1.0 tokenized_qwen3/data2_text_document \
Comment thread
kevalmorabia97 marked this conversation as resolved.
--data_path_to_cache /path/to/cache/dataset_indices_qwen3 \
--seq_length 8192 \
--mbs 1 \
Expand Down
3 changes: 2 additions & 1 deletion examples/megatron_bridge/distill.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ def get_args():
required=True,
help="HuggingFace model name or path for the teacher (e.g. Qwen/Qwen3-8B)",
)
parser.add_argument("--trust_remote_code", action="store_true", help="Trust remote code")
# Parallelism arguments
parser.add_argument("--tp_size", type=int, default=1, help="Tensor parallel size")
parser.add_argument("--pp_size", type=int, default=1, help="Pipeline parallel size")
Expand Down Expand Up @@ -135,7 +136,7 @@ def main(args: argparse.Namespace):

# Build student and teacher model providers
def _build_model_provider(hf_path):
bridge = AutoBridge.from_hf_pretrained(hf_path)
bridge = AutoBridge.from_hf_pretrained(hf_path, trust_remote_code=args.trust_remote_code)
provider = bridge.to_megatron_provider(load_weights=True)

# Override parallelism / training settings
Expand Down