We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
1 parent cfd0090 commit 20135e2Copy full SHA for 20135e2
1 file changed
docs/examples/Python/llm_dataset_creation.rst
@@ -8,7 +8,7 @@ all of it on the disk at once. This becomes a considerable problem when you just
8
In this example, we will be bypassing this problem by downloading a text dataset in parts, tokenizing it and saving it as a Lance dataset.
9
This can be done for as many or as few data samples as you wish with average memory consumption approximately 3-4 GBs!
10
11
-For this example, we are working with the `wikitext <https://huggingface.co/datasets/wikitext>`_ dataset,
+For this example, we are working with the `wikitext <https://huggingface.co/datasets/Salesforce/wikitext>`_ dataset,
12
which is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.
13
14
Preparing and pre-processing the raw dataset
0 commit comments