Skip to content

Commit 01d4b30

Browse files
authored
Merge pull request #48 from KempnerInstitute/fix-internet-2
Update tokenizer caching instructions for clarity
2 parents 37ca7e1 + 66e9ed9 commit 01d4b30

1 file changed

Lines changed: 3 additions & 2 deletions

File tree

docs/how-to/prepare-tokenized-data.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -121,8 +121,9 @@ local directory.
121121

122122
### Cache the tokenizer first
123123

124-
HuggingFace won't reach out to the hub from a compute node that
125-
doesn't have internet. Cache on the login node once:
124+
Hugging Face cannot access the hub from compute nodes with no or limited internet connectivity.
125+
Since compute nodes also have much slower bandwidth (~1 Gbps vs. ~100 Gbps on login nodes),
126+
cache the tokenizer once on the login node:
126127

127128
```bash
128129
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"

0 commit comments

Comments
 (0)