(pretraining)=
In this tutorial, we introduce how to run pretraining with real datasets. While synthetic data is commonly used for benchmarking, we rely on real datasets to obtain meaningful weights. Currently, MaxText supports three dataset input pipelines: HuggingFace, Grain, and TensorFlow Datasets (TFDS). We will walk you through: setting up dataset, modifying the dataset configs and tokenizer configs for training, and optionally enabling evaluation.
Before starting this tutorial, ensure you have installed MaxText following the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). For pre-training, install `maxtext[tpu]` for TPUs or `maxtext[cuda12]` for GPUs.
To start with, we focus on HuggingFace datasets for convenience.
- Later on, we will give brief examples for Grain and TFDS. For a comprehensive guide, see the Data Input Pipeline topic.
- For demonstration, we use Deepseek-V2-Lite model and C4 dataset. C4 stands for "Colossal Clean Crawled Corpus", a high-quality pretraining dataset first introduced by Google's T5 work. Feel free to try other models and datasets.
We use the HuggingFace dataset allenai/c4, a processed version of Google's C4. This dataset is organized into subsets (e.g., en, es), and each subset contains data splits (e.g., train, validation).
Data preparation: You don't need to download data, as the pipeline can stream data directly from the HuggingFace Hub. Alternatively, it can stream from a Cloud Storage bucket; see the HuggingFace Pipeline page.
We can use this command for pretraining:
# replace base_output_directory with your bucket
python3 -m maxtext.trainers.pre_train.train \
base_output_directory=gs://runner-maxtext-logs run_name=demo \
model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \
dataset_type=hf hf_path=allenai/c4 hf_data_dir=en train_split=train \
tokenizer_type=huggingface tokenizer_path=deepseek-ai/DeepSeek-V2-LiteDataset config:
dataset_type:hfhf_path: the HuggingFace dataset repository isallenai/c4hf_data_dir: the subset isen, corresponding to English data.train_split:train. Training will use thetrainsplit.
The above command runs training only: steps=10 on the train split, for en subset of allenai/c4. The log shows:
completed step: 1, seconds: 0.287, TFLOP/s/device: 110.951, Tokens/s/device: 7131.788, total_weights: 7517, loss: 12.021
...
completed step: 9, seconds: 1.010, TFLOP/s/device: 31.541, Tokens/s/device: 2027.424, total_weights: 7979, loss: 9.436
The total weights is the number of real tokens processed in each step. More explanation can be found in Understand Logs and Metrics page.
Evaluation config (optional):
To add evaluation steps, we can specify a positive evaluation interval and the dataset split, for instance eval_interval=5 eval_steps=10 hf_eval_split=validation. For every 5 training step, we run evaluation for 10 steps, using the validation split. In the log, you will additionally see:
Completed eval step 0
...
Completed eval step 9
eval metrics after step: 4, loss=9.855, total_weights=75264.0
Completed eval step 0
...
Completed eval step 9
eval metrics after step: 9, loss=9.420, total_weights=75264.0
Tokenizer config:
tokenizer_type:huggingface. Note HuggingFace input pipeline only supports HuggingFace tokenizer.tokenizer_path:deepseek-ai/DeepSeek-V2-Lite, corresponding to the HuggingFace model repository.
HuggingFace access token (optional):
- For a gated dataset or a tokenizer from a gated model, you need to request access on HuggingFace and provide
hf_access_token=<YOUR_TOKEN>in the command. For instance, meta-llama/Llama-2-7b is a gated model.
Grain is a library for reading data for training and evaluating JAX models. It is the recommended input pipeline for determinism and resilience! It supports data formats like ArrayRecord and Parquet. You can check Grain pipeline for more details.
Data preparation: You need to download data to a Cloud Storage bucket, and read data via Cloud Storage Fuse with setup_gcsfuse.sh.
- For example, we can mount the bucket
gs://maxtext-dataseton the local path/tmp/gcsfusebefore trainingbash setup_gcsfuse.sh DATASET_GCS_BUCKET=maxtext-dataset MOUNT_PATH=/tmp/gcsfuse
- After training, we unmount the local path
fusermount -u /tmp/gcsfuse
This command shows pretraining with Grain pipeline, along with evaluation:
# replace DATASET_GCS_BUCKET and base_output_directory with your buckets
python3 -m maxtext.trainers.pre_train.train \
base_output_directory=gs://runner-maxtext-logs run_name=demo \
model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \
dataset_type=grain grain_file_type=arrayrecord grain_train_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record* grain_worker_count=2 \
eval_interval=5 eval_steps=10 grain_eval_files=/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record* \
tokenizer_type=huggingface tokenizer_path=deepseek-ai/DeepSeek-V2-LiteDataset config:
dataset_type:graingrain_file_type:arrayrecord. We also supportparquet.grain_train_files:/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-train.array_record*, which is a regex pattern.grain_worker_count:2. This parameter controls the number of child processes used by Grain, which should be tuned for performance.
Evaluation config (optional):
eval_interval=5 eval_steps=10: after every 5 train steps, perform 10 evaluation stepsgrain_eval_files:/tmp/gcsfuse/array-record/c4/en/3.0.1/c4-validation.array_record*, which is a regex pattern.
Tokenizer config:
- The Grain pipeline supports tokenizer_type:
sentencepiece, huggingface - Here we use the same
huggingfacetokenizer as in Section 1. If you use a HuggingFace tokenizer from a gated model, you will need to providehf_access_token.
The TensorFlow Datasets (TFDS) pipeline uses dataset in the TFRecord format. You can check TFDS Pipeline for more details.
Data preparation: You need to download data to a Cloud Storage bucket, and the pipeline streams data from the bucket.
- To download the AllenAI C4 dataset to your bucket, you can use download_dataset.sh:
bash download_dataset.sh <GCS_PROJECT> <GCS_BUCKET_FOR_DATASET>
This command shows pretraining with TFDS pipeline, along with evaluation:
# replace base_output_directory and dataset_path with your buckets
python3 -m maxtext.trainers.pre_train.train \
base_output_directory=gs://runner-maxtext-logs run_name=demo \
model_name=deepseek2-16b per_device_batch_size=1 steps=10 max_target_length=2048 enable_checkpointing=false \
dataset_type=tfds dataset_path=gs://maxtext-dataset dataset_name='c4/en:3.0.1' train_split=train \
eval_interval=5 eval_steps=10 eval_dataset_name='c4/en:3.0.1' eval_split=validation \
tokenizer_type=huggingface tokenizer_path=deepseek-ai/DeepSeek-V2-LiteDataset config:
dataset_type:tfdsdataset_path: the cloud storage bucket isgs://maxtext-datasetdataset_name:c4/en:3.0.1corresponds to the subdirectory inside dataset_pathgs://maxtext-dataset/c4/en/3.0.1train_split:train, corresponds to*-train.tfrecord-*files- Putting together, we are training on files like
gs://maxtext-dataset/c4/en/3.0.1/c4-train.tfrecord-0000-of-01024
Evaluation config (optional):
eval_interval=5 eval_steps=10: after every 5 train steps, perform 10 evaluation stepseval_dataset_name:c4/en:3.0.1, corresponds to the subdirectory inside dataset_pathgs://maxtext-dataset/c4/en/3.0.1. It can be different fromdataset_name.eval_split:validation, corresponds to*-validation.tfrecord-*files- Putting together, we are evaluating on files like
gs://maxtext-dataset/c4/en/3.0.1/c4-validation.tfrecord-00000-of-00008
Tokenizer config:
- TFDS pipeline supports tokenizer_type:
sentencepiece, huggingface, tiktoken - Here we use the same
huggingfacetokenizer as in Section 1. If you use a HuggingFace tokenizer from a gated model, you will need to providehf_access_token.