diff --git a/docs/build_maxtext.md b/docs/build_maxtext.md new file mode 100644 index 0000000000..83cd058555 --- /dev/null +++ b/docs/build_maxtext.md @@ -0,0 +1,137 @@ + + +# Build and Upload MaxText Docker Images + +This guide covers setting up a MaxText development environment and building container images for TPU and GPU workloads. These images can be used to run MaxText on GKE clusters with TPUs or GPUs, and are also required for running MaxText through XPK. + +## Prerequisites + +Before starting, ensure you have the following tools installed and configured: + +1. Environment Prep: Install and configure all [XPK prerequisites](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md#1-prerequisites). + +2. Docker Permissions: Follow the steps to [configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/) to run Docker without `sudo`. + +3. Artifact Registry Access: Authenticate with [Google Artifact Registry](https://docs.cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) for permission to push your images and other access. + +4. Authentication & Access: Run the following commands to authenticate your account and configure Docker: + +```bash +# Authenticate your user account for gcloud CLI access +gcloud auth login + +# Configure application default credentials for Docker and other tools +gcloud auth application-default login + +# Configure Docker credentials and test your access +gcloud auth configure-docker +docker run hello-world +``` + +## Installation Modes + +We recommend building MaxText inside a Python virtual environment using `uv` for speed and dependency management. + +### Option 1: From PyPI (Recommended) + +This is the easiest way to get started with the latest stable version. + +```bash +# Install uv, a fast Python package installer +pip install uv + +# Create virtual environment +export VENV_NAME= # e.g., docker_venv +uv venv --python 3.12 --seed ${VENV_NAME?} +source ${VENV_NAME?}/bin/activate + +# Install MaxText with the [runner] extra +# This enables Docker image building and workload scheduling via XPK +uv pip install maxtext[runner] --resolution=lowest +``` + +> **Note:** The `maxtext[runner]` extra includes all necessary dependencies for building MaxText Docker images and running workloads through XPK. It automatically installs XPK, so you do not need to install it separately to manage your clusters and workloads. + +### Option 2: From Source + +If you plan to contribute to MaxText or need the latest unreleased features, install from source. + +```bash +# Clone the repository +git clone https://github.com/AI-Hypercomputer/maxtext.git +cd maxtext + +# Create virtual environment +export VENV_NAME= # e.g., docker_venv +uv venv --python 3.12 --seed ${VENV_NAME?} +source ${VENV_NAME?}/bin/activate + +# Install MaxText with the [runner] extra in editable mode +uv pip install .[runner] --resolution=lowest +``` + +> **Note:** The `maxtext[runner]` extra includes all necessary dependencies for building MaxText Docker images and running workloads through XPK. It automatically installs XPK, so you do not need to install it separately to manage your clusters and workloads. + +## Build MaxText Docker Image + +Select the appropriate build commands based on your hardware (`TPU` or `GPU`) and your specific workflow (`pre-training` or `post-training`). Each of these commands will generate a local Docker image named `maxtext_base_image`. + +### TPU Pre-Training Docker Image + +```bash +# Option 1: Build with the stable versions of dependencies (default) +build_maxtext_docker_image + +# Option 2: Build with latest nightly versions of jax/jaxlib +build_maxtext_docker_image MODE=nightly + +# Option 3: Build with the specified jax/jaxlib version +build_maxtext_docker_image MODE=nightly JAX_VERSION=$JAX_VERSION +``` + +### GPU Pre-Training Docker Image + +```bash +# Option 1: Build with the stable versions of dependencies (default) +build_maxtext_docker_image DEVICE=gpu + +# Option 2: Build with latest nightly versions of jax/jaxlib +build_maxtext_docker_image DEVICE=gpu MODE=nightly + +# Option 3: Build with base image as `ghcr.io/nvidia/jax:base-2024-12-04` +build_maxtext_docker_image DEVICE=gpu MODE=pinned + +# Option 4: Build with the specified jax/jaxlib version +build_maxtext_docker_image DEVICE=gpu MODE=nightly JAX_VERSION=$JAX_VERSION +``` + +### TPU Post-Training Docker Image + +```bash +# This build process takes approximately 10 to 15 minutes. +build_maxtext_docker_image WORKFLOW=post-training +``` + +## Upload MaxText Docker Image to Artifact Registry + +> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access". + +```bash +# Make sure to replace with your desired image name. +export CLOUD_IMAGE_NAME= +upload_maxtext_docker_image CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME?} +``` diff --git a/docs/guides/data_input_pipeline/data_input_grain.md b/docs/guides/data_input_pipeline/data_input_grain.md index 63c60482e8..a125cb2a13 100644 --- a/docs/guides/data_input_pipeline/data_input_grain.md +++ b/docs/guides/data_input_pipeline/data_input_grain.md @@ -34,7 +34,7 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state 1. Grain currently supports two data formats: [ArrayRecord](https://github.com/google/array_record) (random access) and [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources.html) class. - **Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet. -2. When the dataset is hosted on a Cloud Storage bucket, Grain can read it through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh). The script configures some parameters for the mount. +2. When the dataset is hosted on a Cloud Storage bucket, Grain can read it through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh). The script configures some parameters for the mount. ```sh bash tools/setup/setup_gcsfuse.sh \ diff --git a/docs/index.md b/docs/index.md index 5755dab367..bc7a6ad611 100644 --- a/docs/index.md +++ b/docs/index.md @@ -17,7 +17,9 @@ # MaxText ```{raw} html -:file: index.html +--- +file: index.html +--- ``` :link: reference/api @@ -26,18 +28,22 @@
```{include} ../README.md -:start-after: -:end-before: +--- +start-after: +end-before: +--- ```
```{toctree} -:maxdepth: 2 -:hidden: - +--- +maxdepth: 2 +hidden: +--- install_maxtext +build_maxtext tutorials run_maxtext guides diff --git a/docs/install_maxtext.md b/docs/install_maxtext.md index 52a6aea306..7f779db7a2 100644 --- a/docs/install_maxtext.md +++ b/docs/install_maxtext.md @@ -17,7 +17,7 @@ # Install MaxText This document discusses how to install MaxText. We recommend installing MaxText inside a Python virtual environment. -MaxText offers three installation modes: +MaxText offers following installation modes: 1. maxtext[tpu]. Used for pre-training and decode on TPUs. 2. maxtext[cuda12]. Used for pre-training and decode on GPUs. @@ -37,18 +37,18 @@ uv venv --python 3.12 --seed maxtext_venv source maxtext_venv/bin/activate # 3. Install MaxText and its dependencies. Choose a single -# installation option from this list to fit your use case. +# installation option from this list to fit your use case. # Option 1: Installing maxtext[tpu] -uv pip install "maxtext[tpu]>=0.2.0" --resolution=lowest +uv pip install maxtext[tpu] --resolution=lowest install_maxtext_tpu_github_deps # Option 2: Installing maxtext[cuda12] -uv pip install "maxtext[cuda12]>=0.2.0" --resolution=lowest +uv pip install maxtext[cuda12] --resolution=lowest install_maxtext_cuda12_github_dep # Option 3: Installing maxtext[tpu-post-train] -uv pip install "maxtext[tpu-post-train]>=0.2.0" --resolution=lowest +uv pip install maxtext[tpu-post-train] --resolution=lowest install_maxtext_tpu_post_train_extra_deps # Option 4: Installing maxtext[runner] @@ -91,7 +91,7 @@ uv pip install -e .[tpu-post-train] --resolution=lowest install_maxtext_tpu_post_train_extra_deps # Option 4: Installing maxtext[runner] -uv pip install .[runner] --resolution=lowest +uv pip install -e .[runner] --resolution=lowest ``` After installation, you can verify the package is available with `python3 -c "import maxtext"` and run training jobs with `python3 -m maxtext.trainers.pre_train.train ...`. @@ -176,22 +176,6 @@ After generating the new requirements, you need to update the files in the MaxTe Finally, test that the new dependencies install correctly and that MaxText runs as expected. -1. **Create a clean environment:** It's best to start with a fresh Python virtual environment. - -```bash -uv venv --python 3.12 --seed maxtext_venv -source maxtext_venv/bin/activate -``` - -2. **Run the setup script:** Execute `bash setup.sh` to install the new dependencies. - -```bash -pip install uv -# install the tpu package -uv pip install -e .[tpu] --resolution=lowest -# or install the gpu package by running the following line: -# uv pip install -e .[cuda12] --resolution=lowest -install_maxtext_github_deps -``` +1. **Install MaxText and dependencies**: For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.0/install_maxtext.html#from-source). -3. **Run tests:** Run MaxText tests to ensure there are no regressions. +2. **Verify the installation**: Run MaxText tests to ensure everything is working as expected with the newly installed dependencies and there are no regressions. diff --git a/docs/run_maxtext/run_maxtext_localhost.md b/docs/run_maxtext/run_maxtext_localhost.md index 4695d5237f..843c52a5f3 100644 --- a/docs/run_maxtext/run_maxtext_localhost.md +++ b/docs/run_maxtext/run_maxtext_localhost.md @@ -36,22 +36,7 @@ Local development on a single host TPU/GPU VM is a convenient way to run MaxText 1. Create and SSH to the single host VM of your choice. You can use any available single host TPU, such as `v5litepod-8`, `v5p-8`, or `v4-8`. For GPUs, you can use `nvidia-h100-mega-80gb`, `nvidia-h200-141gb`, or `nvidia-b200`. For setting up a TPU VM, use the Cloud TPU documentation available at https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm. For a GPU setup, refer to the guide at https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus. -2. Clone MaxText onto that VM. - - ```bash - git clone https://github.com/google/maxtext.git - cd maxtext - ``` - -3. Once you have cloned the repository, you have two primary options for setting up the necessary dependencies on your VM: Installing in a Python Environment, or building a Docker container. For single host workloads, we recommend to install dependencies in a python environment, and for multihost workloads we recommend the containerized approach. - -Within the root directory of the cloned repo, create a virtual environment and install dependencies and the pre-commit hook by running: - -```bash -python3.12 -m venv ~/venv-maxtext -source ~/venv-maxtext/bin/activate -bash tools/setup/setup.sh DEVICE={tpu|gpu} -``` +2. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). #### Run a Test Training Job diff --git a/docs/run_maxtext/run_maxtext_single_host_gpu.md b/docs/run_maxtext/run_maxtext_single_host_gpu.md index 94204cd428..2b0daaffb3 100644 --- a/docs/run_maxtext/run_maxtext_single_host_gpu.md +++ b/docs/run_maxtext/run_maxtext_single_host_gpu.md @@ -60,39 +60,9 @@ If you get the NVML Error: Please follow these instructions. https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours -## Install MaxText - -Clone MaxText: - -```bash -git clone https://github.com/AI-Hypercomputer/maxtext.git -``` - ## Build MaxText Docker image -This builds a docker image called `maxtext_base_image`. You can retag to a different name. - -1. Check out the code changes: - -```bash -cd maxtext -``` - -2. Run the following commands to build and push the docker image: - -```bash -export LOCAL_IMAGE_NAME= -sudo bash docker_build_dependency_image.sh DEVICE=gpu -docker tag maxtext_base_image ${LOCAL_IMAGE_NAME?} -docker push ${LOCAL_IMAGE_NAME?} -``` - -Note that when running `bash docker_build_dependency_image.sh DEVICE=gpu`, it -uses `MODE=stable` by default. If you want to use other modes, you need to -specify it explicitly: - -- using nightly mode: `bash docker_build_dependency_image.sh DEVICE=gpu MODE=nightly` -- using pinned mode: `bash docker_build_dependency_image.sh DEVICE=gpu MODE=pinned` +For instructions on building the MaxText Docker image, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html). ## Test diff --git a/docs/run_maxtext/run_maxtext_via_pathways.md b/docs/run_maxtext/run_maxtext_via_pathways.md index 9e954e5c8e..2e97c1f4d6 100644 --- a/docs/run_maxtext/run_maxtext_via_pathways.md +++ b/docs/run_maxtext/run_maxtext_via_pathways.md @@ -35,27 +35,7 @@ Before you can run a MaxText workload, you must complete the following setup ste 2. **Create a GKE cluster** configured for Pathways. -3. **Build and upload a MaxText Docker image** to your project's Artifact Registry. - - [Follow the steps to configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/) before running the commands below. - - Step 1: Build the Docker image for a TPU device. This image contains MaxText and its dependencies. - - ```shell - bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE=tpu MODE=stable - ``` - - Step 2: Configure Docker to authenticate with Google Cloud - - ```shell - gcloud auth configure-docker - ``` - - Step 3: Upload the image to your project's registry. Replace `$USER_runner` with your desired image name. - - ```shell - bash src/dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=$USER_runner - ``` +3. **Build and upload a MaxText Docker image** to your project's Artifact Registry. For instructions on building and uploading the MaxText Docker image, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html). ## 2. Environment configuration @@ -76,7 +56,7 @@ export WORKLOAD_NODEPOOL_COUNT=1 # Number of TPU slices for your job export BUCKET_NAME="your-gcs-bucket-name" export RUN_NAME="maxtext-run-1" # The Docker image you pushed in the prerequisite step -export DOCKER_IMAGE="gcr.io/${PROJECT?}/${USER}_runner" +export DOCKER_IMAGE="gcr.io/${PROJECT?}/${CLOUD_IMAGE_NAME}" ``` ## 3. Running a batch workload diff --git a/docs/run_maxtext/run_maxtext_via_xpk.md b/docs/run_maxtext/run_maxtext_via_xpk.md index 8d142ef9dc..760eea4b03 100644 --- a/docs/run_maxtext/run_maxtext_via_xpk.md +++ b/docs/run_maxtext/run_maxtext_via_xpk.md @@ -99,53 +99,13 @@ These commands configure your local environment to connect to Google Cloud servi ______________________________________________________________________ -## 3. Install XPK +## 3. Build the MaxText Docker image -It is best practice to install XPK in a dedicated Python virtual environment. - -``` -# Create a virtual environment (only needs to be done once) -python3 -m venv ~/xpk_venv - -# Activate the virtual environment (do this every time you open a new terminal) -source ~/xpk_venv/bin/activate - -# Install XPK -pip install xpk -``` - -______________________________________________________________________ - -## 4. Build the MaxText Docker image - -```{note} -Ensure Docker is configured for sudoless use before running the build script. Follow the steps to [configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/). -``` - -1. **Clone the MaxText repository** - - ``` - git clone https://github.com/google/maxtext.git - cd maxtext - ``` - -2. **Build the image for your target hardware (TPU or GPU)** This script creates a local Docker image named `maxtext_base_image`. - - - **For TPUs:** - - ``` - bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE=tpu MODE=stable - ``` - - - **For GPUs:** - - ``` - bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE=gpu MODE=stable - ``` +For instructions on building the MaxText Docker image, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html). ______________________________________________________________________ -## 5. Run your first MaxText job +## 4. Run your first MaxText job This section assumes you have an existing GKE cluster with either TPU or GPU nodes. @@ -204,7 +164,7 @@ For instance, to run a job across **four TPU slices**, you would change `--num-s ______________________________________________________________________ -## 6. Managing and monitoring your job +## 5. Managing and monitoring your job - **View logs in real-time:** The easiest way to see the output of your training job is through the Google Cloud Console. diff --git a/docs/tutorials/first_run.md b/docs/tutorials/first_run.md index ae0bae76d2..3b7468129b 100644 --- a/docs/tutorials/first_run.md +++ b/docs/tutorials/first_run.md @@ -36,17 +36,8 @@ Local development is a convenient way to run MaxText on a single host. It doesn' multiple hosts but is a good way to learn about MaxText. 1. [Create and SSH to the single host VM of your choice](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm). You can use any available single host TPU, such as `v5litepod-8`, `v5p-8`, or `v4-8`. -2. Clone MaxText onto that TPU VM. -3. Within the root directory of the cloned repo, install dependencies and pre-commit hook by running: - -```sh -python3 -m venv ~/venv-maxtext -source ~/venv-maxtext/bin/activate -bash tools/setup/setup.sh -pre-commit install -``` - -4. After installation completes, run training on synthetic data with the following command: +2. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). +3. After installation completes, run training on synthetic data with the following command: ```sh python3 -m maxtext.trainers.pre_train.train \ @@ -58,7 +49,7 @@ python3 -m maxtext.trainers.pre_train.train \ Optional: If you want to try training on a Hugging Face dataset, see [Data Input Pipeline](../guides/data_input_pipeline.md) for data input options. -5. To demonstrate model output, run the following command: +4. To demonstrate model output, run the following command: ```sh python3 -m maxtext.inference.decode \ @@ -79,7 +70,7 @@ You can use [demo_decoding.ipynb](https://github.com/AI-Hypercomputer/maxtext/bl ### Run MaxText on NVIDIA GPUs -1. Use `bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE=gpu` to build a container with the required dependencies. +1. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). 2. After installation is complete, run training with the following command on synthetic data: ```sh diff --git a/docs/tutorials/posttraining/rl_on_multi_host.md b/docs/tutorials/posttraining/rl_on_multi_host.md index 39caa61c39..d1c20a68b2 100644 --- a/docs/tutorials/posttraining/rl_on_multi_host.md +++ b/docs/tutorials/posttraining/rl_on_multi_host.md @@ -44,9 +44,9 @@ rely on the vLLM library. ## Table of Contents - [Prerequisites](#prerequisites) +- [Build and Upload MaxText Docker Image](#build-and-upload-maxtext-docker-image) - [Setup Environment Variables](#setup-environment-variables) - [Get Your Model Checkpoint](#get-your-model-checkpoint) -- [Build and Upload MaxText Docker Image](#build-and-upload-maxtext-docker-image-with-post-training-dependencies) - [Submit your RL workload via Pathways](#submit-your-rl-workload-via-pathways) - [Managing Workloads](#managing-workloads) - [Troubleshooting](#troubleshooting) @@ -58,10 +58,14 @@ Before starting, ensure you have: - Access to a Google Cloud Project with TPU quotas. - A Hugging Face account with an access token for downloading models. - Permissions for Google Artifact Registry (Artifact Registry Writer role). -- XPK installed (follow [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md)). +- Prerequisites for XPK installed (follow [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md#1-prerequisites)). - A Pathways-ready GKE cluster (see [create GKE cluster](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster)). - **Docker** installed and configured for sudoless use. Follow the steps to [configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/). +## Build and upload MaxText Docker image + +For instructions on building and uploading the MaxText Docker image with post-training dependencies, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html). + ## Setup Environment Variables Set up the following environment variables. Replace placeholders with your @@ -82,7 +86,6 @@ export TPU_TYPE= # e.g., 'v5p-128' export TPU_CLUSTER= export PROJECT_ID= export ZONE= -export CLOUD_IMAGE_NAME= # Name for the Docker image to be built ``` ## Get Your Model Checkpoint @@ -104,77 +107,6 @@ Refer the steps in [Hugging Face to MaxText](../../guides/checkpointing_solution export MAXTEXT_CKPT_PATH= # e.g., gs://my-bucket/my-model-checkpoint/0/items ``` -## Build and upload MaxText Docker image with post-training dependencies - -Before building the Docker image, follow the steps to [configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/). - -Then, authenticate to -[Google Artifact Registry](https://docs.cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) -for permission to push your images and other access. - -```bash -# Authenticate your user account for gcloud CLI access -gcloud auth login - -# Configure application default credentials for Docker and other tools -gcloud auth application-default login - -# Configure Docker credentials and test your access -gcloud auth configure-docker -docker run hello-world -``` - -### Option 1: From PyPI releases (Recommended) - -Get the latest stable release of MaxText from PyPI. This will automatically pull -compatible versions of post-training dependencies, such as [Tunix](https://github.com/google/tunix), -[vLLM](https://github.com/vllm-project/vllm), and -[tpu-inference](https://github.com/vllm-project/tpu-inference). - -```bash -git clone https://github.com/AI-Hypercomputer/maxtext.git -cd maxtext - -# checkout the latest stable release here: https://pypi.org/project/maxtext/ -export MAXTEXT_VERSION=0.2.0 -git checkout maxtext-v${MAXTEXT_VERSION?} -``` - -Run the following script to create a Docker image with stable releases of -MaxText, and its post-training dependencies. The build process takes approximately 10-15 minutes. - -```bash -bash src/dependencies/scripts/docker_build_dependency_image.sh WORKFLOW=post-training -``` - -For experimental features (such as improved pathwaysutils resharding API), use: - -```bash -bash src/dependencies/scripts/docker_build_dependency_image.sh WORKFLOW=post-training-experimental -``` - -### Option 2: From Github - -For using a version newer than the latest PyPI release, you could also build the Docker image with the latest vetted versions of post-training dependencies and MaxText in the following way: - -```bash -git clone https://github.com/AI-Hypercomputer/maxtext.git -cd maxtext - -bash src/dependencies/scripts/docker_build_dependency_image.sh WORKFLOW=post-training -``` - -### Upload the Docker Image - -> **Note:** You will need the -> [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) -> role to push Docker images to your project's Artifact Registry. Contact your -> project administrator if you don't have this permission. - -```bash -bash src/dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME?} -``` - ## Submit your RL workload via Pathways See the **Troubleshooting** section for concise instructions on how to retry or diff --git a/docs/tutorials/posttraining/sft_on_multi_host.md b/docs/tutorials/posttraining/sft_on_multi_host.md index 243ae56127..cc5c63b2ac 100644 --- a/docs/tutorials/posttraining/sft_on_multi_host.md +++ b/docs/tutorials/posttraining/sft_on_multi_host.md @@ -24,57 +24,26 @@ We use [Tunix](https://github.com/google/tunix), a JAX-based library designed fo Let's get started! -## 1. Build and upload MaxText Docker image +## Prerequisites -This section guides you through cloning the MaxText repository, building MaxText Docker image with dependencies, and uploading the docker image to your project's Artifact Registry. +Before starting, ensure you have: -### 1.1. Clone the MaxText repository +- Access to a Google Cloud Project with TPU quotas. +- A Hugging Face account with an access token for downloading models. +- Permissions for Google Artifact Registry (Artifact Registry Writer role). +- Prerequisites for XPK installed (follow [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md#1-prerequisites)). +- A Pathways-ready GKE cluster (see [create GKE cluster](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster)). +- **Docker** installed and configured for sudoless use. Follow the steps to [configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/). -```bash -git clone https://github.com/google/maxtext.git -cd maxtext -``` - -### 1.2. Build MaxText Docker image - -Before building the Docker image, follow the steps to [configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/). Then, authenticate to [Google Artifact Registry](https://docs.cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) for permission to push your images and other access. - -```bash -# Authenticate your user account for gcloud CLI access -gcloud auth login -# Configure application default credentials for Docker and other tools -gcloud auth application-default login -# Configure Docker credentials and test your access -gcloud auth configure-docker -docker run hello-world -``` - -Then run the following command to create a local Docker image named `maxtext_base_image`. This build process takes approximately 10 to 15 minutes. - -```bash -bash src/dependencies/scripts/docker_build_dependency_image.sh WORKFLOW=post-training -``` - -### 1.3. Upload the Docker image to Artifact Registry - -> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access". - -```bash -export DOCKER_IMAGE_NAME= -bash src/dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=${DOCKER_IMAGE_NAME?} -``` - -The `docker_upload_runner.sh` script uploads your Docker image to Artifact Registry. - -## 2. Install XPK +## Build and upload MaxText Docker image -Install XPK by following the instructions in the [official documentation](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md). +For instructions on building and uploading the MaxText Docker image with post-training dependencies, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html). -## 3. Create GKE cluster +## Create GKE cluster Use a pathways ready GKE cluster as described [here](https://docs.cloud.google.com/ai-hypercomputer/docs/workloads/pathways-on-cloud/create-gke-cluster). -## 4. Environment configuration +## Environment configuration ```bash # -- Google Cloud Configuration -- @@ -86,7 +55,6 @@ export ZONE= export WORKLOAD_NAME= # e.g., sft-$(date +%s) export TPU_TYPE= # e.g., v6e-256 export TPU_SLICE= -export DOCKER_IMAGE="gcr.io/${PROJECT?}/${DOCKER_IMAGE_NAME?}" # -- MaxText Configuration -- export OUTPUT_PATH= # e.g., gs://my-bucket/my-output-directory @@ -102,7 +70,7 @@ export TRAIN_SPLIT= # e.g., train_sft export TRAIN_DATA_COLUMNS= # e.g., ['messages'] ``` -## 5. Get MaxText model checkpoint +## Get MaxText model checkpoint This section explains how to prepare your model checkpoint for use with MaxText. You have two options: using an existing MaxText checkpoint or converting a Hugging Face checkpoint. @@ -130,18 +98,18 @@ Refer the steps in [Hugging Face to MaxText](../../guides/checkpointing_solution export MODEL_CHECKPOINT_PATH= # gs://my-bucket/my-checkpoint-directory/0/items ``` -## 6. Submit workload on GKE cluster +## Submit workload on GKE cluster This section provides the command to run SFT on a GKE cluster. -### 6.1. SFT with Multi-Controller JAX (McJAX) +### SFT with Multi-Controller JAX (McJAX) ```bash xpk workload create \ --cluster=${CLUSTER_NAME?} \ --project=${PROJECT?} \ --zone=${ZONE?} \ ---docker-image=${DOCKER_IMAGE?} \ +--docker-image=gcr.io/${PROJECT_ID?}/${CLOUD_IMAGE_NAME?} \ --workload=${WORKLOAD_NAME?} \ --tpu-type=${TPU_TYPE?} \ --num-slices=${TPU_SLICE?} \ @@ -150,7 +118,7 @@ xpk workload create \ Once the fine-tuning is completed, you can access your model checkpoints at `$OUTPUT_PATH/$WORKLOAD_NAME/checkpoints`. -### 6.2. SFT with Pathways +### SFT with Pathways ```bash export USE_PATHWAYS=1 @@ -159,7 +127,7 @@ xpk workload create-pathways \ --cluster=${CLUSTER_NAME?} \ --project=${PROJECT?} \ --zone=${ZONE?} \ ---docker-image=${DOCKER_IMAGE?} \ +--docker-image=gcr.io/${PROJECT_ID?}/${CLOUD_IMAGE_NAME?} \ --workload=${WORKLOAD_NAME?} \ --tpu-type=${TPU_TYPE?} \ --num-slices=${TPU_SLICE?} \ diff --git a/src/dependencies/scripts/docker_build_dependency_image.sh b/src/dependencies/scripts/docker_build_dependency_image.sh index c9c97c5614..3705334014 100644 --- a/src/dependencies/scripts/docker_build_dependency_image.sh +++ b/src/dependencies/scripts/docker_build_dependency_image.sh @@ -18,36 +18,7 @@ # different environments (stable, nightly) and use cases (pre-training, post-training). # IMPORTANT: This script must be executed from the root directory of the MaxText repository. -# ================================== -# PRE-TRAINING BUILD EXAMPLES -# ================================== - -# Build docker image with stable dependencies -## bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE={{gpu|tpu}} MODE=stable - -# Build docker image with nightly dependencies -## bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE={{gpu|tpu}} MODE=nightly - -# Build docker image with stable dependencies and, a pinned JAX_VERSION for TPUs -## bash src/dependencies/scripts/docker_build_dependency_image.sh MODE=stable JAX_VERSION=0.4.13 - -# Build docker image with a pinned JAX_VERSION and, a pinned LIBTPU_VERSION for TPUs -## bash src/dependencies/scripts/docker_build_dependency_image.sh MODE={{stable|nightly}} JAX_VERSION=0.8.1 LIBTPU_VERSION=0.0.31.dev20251119+nightly - -# Build docker image with a custom libtpu.so for TPUs -# Note: libtpu.so file must be present in the root directory of the MaxText repository -## bash src/dependencies/scripts/docker_build_dependency_image.sh MODE={{stable|nightly}} - -# Build docker image with nightly dependencies and, a pinned JAX_VERSION for GPUs -# Available versions listed at https://us-python.pkg.dev/ml-oss-artifacts-published/jax-public-nightly-artifacts-registry/simple/jax -## bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE=gpu MODE=nightly JAX_VERSION=0.4.36.dev20241109 - -# ================================== -# POST-TRAINING BUILD EXAMPLES -# ================================== - -# Build docker image with post-training dependencies -## bash src/dependencies/scripts/docker_build_dependency_image.sh WORKFLOW=post-training +# For instructions on building the MaxText Docker image, please refer to the https://maxtext.readthedocs.io/en/latest/build_maxtext.html. PACKAGE_DIR="${PACKAGE_DIR:-src}" echo "PACKAGE_DIR: $PACKAGE_DIR" @@ -153,4 +124,4 @@ echo "docker run -v $(pwd):/deps --rm -it --privileged --entrypoint bash ${LOCAL echo "" echo "You can run MaxText and your development tests inside of the docker image. Changes to your workspace will automatically be reflected inside the docker container." -echo "Once you want you upload your docker container to GCR, take a look at docker_upload_runner.sh" +echo "Once you want to upload your docker container to GCR, run 'upload_maxtext_docker_image CLOUD_IMAGE_NAME=your_image_name'." diff --git a/src/dependencies/scripts/docker_upload_runner.sh b/src/dependencies/scripts/docker_upload_runner.sh index 81091d8648..ce4efb9f52 100644 --- a/src/dependencies/scripts/docker_upload_runner.sh +++ b/src/dependencies/scripts/docker_upload_runner.sh @@ -17,11 +17,10 @@ # This scripts takes a docker image that already contains the MaxText dependencies, copies the local source code in and # uploads that image into GCR. Once in GCR the docker image can be used for development. -# Each time you update the base image via a "bash docker_build_dependency_image.sh", there will be a slow upload process -# (minutes). However, if you are simply changing local code and not updating dependencies, uploading just takes a few seconds. +# For instructions on building and uploading the MaxText Docker image, please refer to the https://maxtext.readthedocs.io/en/latest/build_maxtext.html. -# Example command: -# bash src/dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=${USER}_runner +# Each time you update the `maxtext_base_image`` via `build_maxtext_docker_image`, there will be a slow upload process. +# However, if you are simply changing local code and not updating dependencies, uploading just takes a few seconds. PACKAGE_DIR="${PACKAGE_DIR:-src}" echo "PACKAGE_DIR: $PACKAGE_DIR" diff --git a/src/maxtext/examples/sft_train_and_evaluate.py b/src/maxtext/examples/sft_train_and_evaluate.py index 029433113b..25efac29ff 100644 --- a/src/maxtext/examples/sft_train_and_evaluate.py +++ b/src/maxtext/examples/sft_train_and_evaluate.py @@ -21,19 +21,8 @@ ## Example command to run on single-host TPU: ``` - -# Create a virtual environment -export VENV_NAME= # e.g., maxtext_venv -pip install uv -uv venv --python 3.12 --seed ${VENV_NAME?} -source ${VENV_NAME?}/bin/activate - -# Run the following commands to get all the necessary installations. - -uv pip install "maxtext[tpu-post-train]>=0.2.0" --resolution=lowest -install_maxtext_tpu_post_train_extra_deps - - +# For instructions on installing MaxText with post-training dependencies, +# please refer to the https://maxtext.readthedocs.io/en/latest/install_maxtext.html documentation. # Environment configurations export RUN_NAME=$(date +%Y-%m-%d-%H-%M-%S) @@ -51,17 +40,15 @@ ## Example command to run on multi-host TPUs using McJAX: ``` -# Build & upload docker image -export DOCKER_IMAGE_NAME=${USER}_runner -bash docker_build_dependency_image.sh MODE=post-training && \ - bash docker_upload_runner.sh CLOUD_IMAGE_NAME=${DOCKER_IMAGE_NAME?} +# For instructions on building and uploading the MaxText Docker image with post-training dependencies, +# please refer to the https://maxtext.readthedocs.io/en/latest/build_maxtext.html documentation. # Environment configurations export PROJECT= export CLUSTER_NAME= export ZONE= export TPU_TYPE= -export DOCKER_IMAGE="gcr.io/${PROJECT?}/${DOCKER_IMAGE_NAME?}" +export DOCKER_IMAGE="gcr.io/${PROJECT?}/${CLOUD_IMAGE_NAME?}" export RUN_NAME=$(date +%Y-%m-%d-%H-%M-%S) export OUTPUT_PATH= export MODEL_NAME=llama3.1-8b diff --git a/src/maxtext/inference/mlperf/README.md b/src/maxtext/inference/mlperf/README.md index 0d553c852a..78776914e7 100644 --- a/src/maxtext/inference/mlperf/README.md +++ b/src/maxtext/inference/mlperf/README.md @@ -59,11 +59,11 @@ mv 09292024_mixtral_15k_mintoken2_v1.pkl mixtral-processed-data.pkl ``` ### Install Maxtext +For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html). Then, run the following commands to install the additional dependencies: ``` cd ~ git clone https://github.com/AI-Hypercomputer/maxtext.git cd maxtext -bash setup.sh python3 -m pip install -r src/maxtext/inference/mlperf/requirements.txt ```