You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/megatron_bridge/README.md
+34-17Lines changed: 34 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,7 +43,7 @@ Once inside the container, you need to login with your HuggingFace token to down
43
43
Note that the default dataset for pruning and quantization is [`nemotron-post-training-dataset-v2`](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v2), which is gated.
44
44
45
45
```bash
46
-
huggingface-cli login --token <your token>
46
+
hf auth login --token <your token>
47
47
```
48
48
49
49
## Pruning
@@ -97,23 +97,40 @@ The [distill.py](distill.py) script loads student and teacher models from Huggin
97
97
### Data Preparation
98
98
99
99
The distillation script expects pre-tokenized data in Megatron's binary format (`.bin` / `.idx` files).
100
-
You can tokenize your JSONL dataset using the following function:
101
-
102
-
```python
103
-
from modelopt.torch.utils.plugins import megatron_preprocess_data
104
-
105
-
megatron_preprocess_data(
106
-
input_path="/path/to/your/data.jsonl",
107
-
output_dir="/path/to/tokenized/data",
108
-
tokenizer_name_or_path="Qwen/Qwen3-0.6B",
109
-
json_keys=["text"], # change to your JSON key if needed
110
-
workers=32,
111
-
log_interval=100000,
112
-
max_sequence_length=256000, # To avoid rare OOM errors if text is too long
113
-
)
100
+
101
+
You can tokenize your JSONL datasets using the following command:
0 commit comments