Skip to content

Commit 7f3e519

Browse files
authored
Merge pull request #47 from KempnerInstitute/fix-internet-access-1
Update tokenizer caching instructions for clarity
2 parents 01d4b30 + 2abbf54 commit 7f3e519

1 file changed

Lines changed: 2 additions & 3 deletions

File tree

docs/how-to/end-to-end-training-run.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,8 @@ uv sync # creates .venv and installs all deps
3333

3434
## 2. Cache the tokenizer
3535

36-
The reference config uses the GPT-2 tokenizer. Compute nodes
37-
typically don't have internet access, so pre-cache it on the login
38-
node:
36+
The reference config uses the GPT-2 tokenizer. Compute nodes typically have restricted or much slower
37+
internet access (~1 Gbps vs. ~100 Gbps on login nodes), so it’s best to pre-cache it on the login node:
3938

4039
```bash
4140
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"

0 commit comments

Comments
 (0)