You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This tutorial demonstrates step-by-step instructions for setting up the environment and then training the Llama3.1 70B-IT model on the GSM8K math reasoning dataset using [Pathways for orchestration](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro) on multi-host TPU-VMs such as `v5p-128`.
19
+
This tutorial provides step-by-step instructions for setting up the environment and training the Llama3.1 70B-IT model on the GSM8K math reasoning dataset using [Pathways for orchestration](https://cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/pathways-intro) on multi-host TPU-VMs, such as `v5p-128`.
20
20
21
21
We utilize two RL algorithms, implemented via the Tunix library, to enhance the model's reasoning capabilities:
22
22
@@ -26,16 +26,28 @@ We utilize two RL algorithms, implemented via the Tunix library, to enhance the
26
26
27
27
For efficient model inference and response generation during this process, we rely on the vLLM library.
28
28
29
-
Let's get started!
29
+
## Table of Contents
30
30
31
-
## Create virtual environment and Install MaxText dependencies
32
-
Follow instructions in [Install MaxText](../../install_maxtext.md), but
33
-
recommend creating the virtual environment outside the `maxtext` directory.
- A Pathways-ready GKE cluster (see [create GKE cluster](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster)).
37
47
38
-
Setup following environment variables:
48
+
## Setup Environment Variables
49
+
50
+
Set up the following environment variables. Replace placeholders with your actual values.
39
51
40
52
```bash
41
53
# -- Model configuration --
@@ -46,18 +58,17 @@ export HF_TOKEN=<Hugging Face access token>
46
58
47
59
# -- MaxText configuration --
48
60
export BASE_OUTPUT_DIRECTORY=<output directory to store run logs># e.g., gs://my-bucket/my-output-directory
49
-
exportRUN_NAME=<Name for this run># e.g., llama-3-70b-grpo
You can convert a Hugging Face checkpoint to MaxText format using the `src/MaxText/utils/ckpt_conversion/to_maxtext.py` script. This is useful if you have a pre-trained model from Hugging Face that you want to use with MaxText.
73
84
74
-
First, ensure you have the necessary dependencies installed. Then, run the conversion script on a CPU machine. For large models, it is recommended to use the `--lazy_load_tensors` flag to reduce memory usage during conversion. \
75
-
For example, converting a Llama3.1-70B model scanned checkpoint using `--lazy_load_tensors=true` will use around 200GB of RAM and completes in ~10 mins. This command will download the Hugging Face model and convert it to the MaxText format, saving it to the specified GCS bucket.
85
+
First, ensure you have the necessary dependencies installed (PyTorch for the conversion script). Then, run the conversion script on a CPU machine. For large models, use the `--lazy_load_tensors` flag to reduce memory usage during conversion.
86
+
87
+
For example, converting a Llama3.1-70B model with `--lazy_load_tensors=true` uses around 200GB of RAM and completes in ~10 minutes. This command will download the Hugging Face model and convert it to the MaxText format, saving it to the specified GCS bucket.
You can install the required dependencies using either of the following two options:
102
-
103
112
### Option 1: Install stable releases of post-training dependencies
104
113
> **Caution:** RL in MaxText is currently broken with stable releases of post-training dependencies. We are working on fixing this and recommend following [Option 2: Install from Git repositories of post-training dependencies](#option-2-install-from-git-repositories-of-post-training-dependencies) in the meantime.
105
-
106
-
Run the following bash script to create a docker image with MaxText dependencies, plus all the post-training dependencies installed. For the post-training dependencies, primarily, it installs `Tunix`, and `vllm-tpu` which is [vllm](https://github.com/vllm-project/vllm) and [tpu-inference](https://github.com/vllm-project/tpu-inference) and thereby providing TPU inference for vLLM, with unified JAX and PyTorch support. This build process takes approximately 10 to 15 minutes.
107
114
108
-
```
115
+
Run the following script to create a Docker image with stable releases of MaxText, [Tunix](https://github.com/google/tunix), [vLLM](https://github.com/vllm-project/vllm), and [tpu-inference](https://github.com/vllm-project/tpu-inference) dependencies. This installs `vllm-tpu` which provides TPU inference for vLLM with unified JAX and PyTorch support. The build process takes approximately 10-15 minutes.
You can also use `bash dependencies/scripts/docker_build_dependency_image.sh WORKFLOW=post-training-experimental` to try out new features via experimental dependencies such as improved pathwaysutils resharding API.
121
+
For experimentalfeatures (such as improved pathwaysutils resharding API), use:
113
122
114
-
### Option 2: Install from Git repositories of post-training dependencies
115
-
You can also locally git clone [tunix](https://github.com/google/tunix), [tpu-inference](https://github.com/vllm-project/tpu-inference), [vllm](https://github.com/vllm-project/vllm) and then use the following command to build a docker image using them:
### Option 2: Install from Git repositories of post-training dependencies
128
+
129
+
You can also locally clone the [tunix](https://github.com/google/tunix), [tpu-inference](https://github.com/vllm-project/tpu-inference), and [vllm](https://github.com/vllm-project/vllm.git) repositories and then build the docker image with these local sources.
130
+
131
+
**Note:** Clone these repositories as siblings of the `maxtext` directory (e.g., in the same parent directory). After cloning, run the build from inside the `maxtext` repository so it picks up the local sources:
### Upload the dependency docker image along with MaxText code
121
-
> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access".
122
-
```
137
+
### Upload the Docker Image
138
+
139
+
> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry. Contact your project administrator if you don't have this permission.
Please create a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster), and you can submit the `train_rl.py` script via XPK. You can install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md).
147
+
See the **Troubleshooting** section for concise instructions on how to retry or resume a failed workload.
148
+
149
+
Ensure you have a Pathways-ready GKE cluster (as mentioned in Prerequisites) and submit the `train_rl.py` script via XPK.
150
+
151
+
> **Note:** XPK v0.14.0+ automatically discovers your cluster's location from GCP. You don't need to specify `--zone` in the commands below. If using an older XPK version, add `--zone=<zone>` to the workload commands.
- **Tip**: Verify the checkpoint exists in GCS with read access before resuming.
230
+
231
+
For more detailed troubleshooting, refer to the [MaxText documentation](https://maxtext.readthedocs.io) and [XPK documentation](https://github.com/AI-Hypercomputer/xpk).
0 commit comments