Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions docs/build_maxtext.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
<!--
Copyright 2023-2026 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->

# Build and Upload MaxText Docker Images

This guide covers setting up a MaxText development environment and building container images for TPU and GPU workloads. These images can be used to run MaxText on GKE clusters with TPUs or GPUs, and are also required for running MaxText through XPK.

## Prerequisites

Before starting, ensure you have the following tools installed and configured:

1. Environment Prep: Install and configure all [XPK prerequisites](https://github.com/AI-Hypercomputer/xpk/blob/main/docs/installation.md#1-prerequisites).

2. Docker Permissions: Follow the steps to [configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/) to run Docker without `sudo`.

3. Artifact Registry Access: Authenticate with [Google Artifact Registry](https://docs.cloud.google.com/artifact-registry/docs/docker/authentication#gcloud-helper) for permission to push your images and other access.

4. Authentication & Access: Run the following commands to authenticate your account and configure Docker:

```bash
# Authenticate your user account for gcloud CLI access
gcloud auth login

# Configure application default credentials for Docker and other tools
gcloud auth application-default login

# Configure Docker credentials and test your access
gcloud auth configure-docker
docker run hello-world
```

## Installation Modes

We recommend building MaxText inside a Python virtual environment using `uv` for speed and dependency management.

### Option 1: From PyPI (Recommended)

This is the easiest way to get started with the latest stable version.

```bash
# Install uv, a fast Python package installer
pip install uv

# Create virtual environment
export VENV_NAME=<your virtual env name> # e.g., docker_venv
uv venv --python 3.12 --seed ${VENV_NAME?}
source ${VENV_NAME?}/bin/activate

# Install MaxText with the [runner] extra
# This enables Docker image building and workload scheduling via XPK
uv pip install maxtext[runner] --resolution=lowest
```

> **Note:** The `maxtext[runner]` extra includes all necessary dependencies for building MaxText Docker images and running workloads through XPK. It automatically installs XPK, so you do not need to install it separately to manage your clusters and workloads.

### Option 2: From Source

If you plan to contribute to MaxText or need the latest unreleased features, install from source.

```bash
# Clone the repository
git clone https://github.com/AI-Hypercomputer/maxtext.git
cd maxtext

# Create virtual environment
export VENV_NAME=<your virtual env name> # e.g., docker_venv
uv venv --python 3.12 --seed ${VENV_NAME?}
source ${VENV_NAME?}/bin/activate

# Install MaxText with the [runner] extra in editable mode
uv pip install .[runner] --resolution=lowest
```

> **Note:** The `maxtext[runner]` extra includes all necessary dependencies for building MaxText Docker images and running workloads through XPK. It automatically installs XPK, so you do not need to install it separately to manage your clusters and workloads.

## Build MaxText Docker Image

Select the appropriate build commands based on your hardware (`TPU` or `GPU`) and your specific workflow (`pre-training` or `post-training`). Each of these commands will generate a local Docker image named `maxtext_base_image`.

### TPU Pre-Training Docker Image

```bash
# Option 1: Build with the stable versions of dependencies (default)
build_maxtext_docker_image

# Option 2: Build with latest nightly versions of jax/jaxlib
build_maxtext_docker_image MODE=nightly

# Option 3: Build with the specified jax/jaxlib version
build_maxtext_docker_image MODE=nightly JAX_VERSION=$JAX_VERSION
```

### GPU Pre-Training Docker Image

```bash
# Option 1: Build with the stable versions of dependencies (default)
build_maxtext_docker_image DEVICE=gpu

# Option 2: Build with latest nightly versions of jax/jaxlib
build_maxtext_docker_image DEVICE=gpu MODE=nightly

# Option 3: Build with base image as `ghcr.io/nvidia/jax:base-2024-12-04`
build_maxtext_docker_image DEVICE=gpu MODE=pinned

# Option 4: Build with the specified jax/jaxlib version
build_maxtext_docker_image DEVICE=gpu MODE=nightly JAX_VERSION=$JAX_VERSION
```

### TPU Post-Training Docker Image

```bash
# This build process takes approximately 10 to 15 minutes.
build_maxtext_docker_image WORKFLOW=post-training
```

## Upload MaxText Docker Image to Artifact Registry

> **Note:** You will need the [**Artifact Registry Writer**](https://docs.cloud.google.com/artifact-registry/docs/access-control#permissions) role to push Docker images to your project's Artifact Registry and to allow the cluster to pull them during workload execution. If you don't have this permission, contact your project administrator to grant you this role through "Google Cloud Console -> IAM -> Grant access".

```bash
# Make sure to replace <Docker Image Name> with your desired image name.
export CLOUD_IMAGE_NAME=<Docker Image Name>
upload_maxtext_docker_image CLOUD_IMAGE_NAME=${CLOUD_IMAGE_NAME?}
```
2 changes: 1 addition & 1 deletion docs/guides/data_input_pipeline/data_input_grain.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state

1. Grain currently supports two data formats: [ArrayRecord](https://github.com/google/array_record) (random access) and [Parquet](https://arrow.apache.org/docs/python/parquet.html) (partial random-access through row groups). Only the ArrayRecord format supports the global shuffle mentioned above. For converting a dataset into ArrayRecord, see [Apache Beam Integration for ArrayRecord](https://github.com/google/array_record/tree/main/beam). Additionally, other random access data sources can be supported via a custom [data source](https://google-grain.readthedocs.io/en/latest/data_sources.html) class.
- **Community Resource**: The MaxText community has created a [ArrayRecord Documentation](https://array-record.readthedocs.io/). Note: we appreciate the contribution from the community, but as of now it has not been verified by the MaxText or ArrayRecord developers yet.
2. When the dataset is hosted on a Cloud Storage bucket, Grain can read it through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh). The script configures some parameters for the mount.
2. When the dataset is hosted on a Cloud Storage bucket, Grain can read it through [Cloud Storage FUSE](https://cloud.google.com/storage/docs/gcs-fuse). The installation of Cloud Storage FUSE is included in [setup.sh](https://github.com/google/maxtext/blob/main/src/dependencies/scripts/setup.sh). The user then needs to mount the Cloud Storage bucket to a local path for each worker, using the script [setup_gcsfuse.sh](https://github.com/google/maxtext/blob/main/tools/setup/setup_gcsfuse.sh). The script configures some parameters for the mount.

```sh
bash tools/setup/setup_gcsfuse.sh \
Expand Down
18 changes: 12 additions & 6 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@
# MaxText

```{raw} html
:file: index.html
---
file: index.html
---
```

:link: reference/api
Expand All @@ -26,18 +28,22 @@
<section class="latest-news">

```{include} ../README.md
:start-after: <!-- NEWS START -->
:end-before: <!-- NEWS END -->
---
start-after: <!-- NEWS START -->
end-before: <!-- NEWS END -->
---
```

</section>
</div>

```{toctree}
:maxdepth: 2
:hidden:

---
maxdepth: 2
hidden:
---
install_maxtext
build_maxtext
tutorials
run_maxtext
guides
Expand Down
32 changes: 8 additions & 24 deletions docs/install_maxtext.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
# Install MaxText

This document discusses how to install MaxText. We recommend installing MaxText inside a Python virtual environment.
MaxText offers three installation modes:
MaxText offers following installation modes:

1. maxtext[tpu]. Used for pre-training and decode on TPUs.
2. maxtext[cuda12]. Used for pre-training and decode on GPUs.
Expand All @@ -37,18 +37,18 @@ uv venv --python 3.12 --seed maxtext_venv
source maxtext_venv/bin/activate

# 3. Install MaxText and its dependencies. Choose a single
# installation option from this list to fit your use case.
# installation option from this list to fit your use case.

# Option 1: Installing maxtext[tpu]
uv pip install "maxtext[tpu]>=0.2.0" --resolution=lowest
uv pip install maxtext[tpu] --resolution=lowest
install_maxtext_tpu_github_deps

# Option 2: Installing maxtext[cuda12]
uv pip install "maxtext[cuda12]>=0.2.0" --resolution=lowest
uv pip install maxtext[cuda12] --resolution=lowest
install_maxtext_cuda12_github_dep

# Option 3: Installing maxtext[tpu-post-train]
uv pip install "maxtext[tpu-post-train]>=0.2.0" --resolution=lowest
uv pip install maxtext[tpu-post-train] --resolution=lowest
install_maxtext_tpu_post_train_extra_deps

# Option 4: Installing maxtext[runner]
Expand Down Expand Up @@ -91,7 +91,7 @@ uv pip install -e .[tpu-post-train] --resolution=lowest
install_maxtext_tpu_post_train_extra_deps

# Option 4: Installing maxtext[runner]
uv pip install .[runner] --resolution=lowest
uv pip install -e .[runner] --resolution=lowest
```

After installation, you can verify the package is available with `python3 -c "import maxtext"` and run training jobs with `python3 -m maxtext.trainers.pre_train.train ...`.
Expand Down Expand Up @@ -176,22 +176,6 @@ After generating the new requirements, you need to update the files in the MaxTe

Finally, test that the new dependencies install correctly and that MaxText runs as expected.

1. **Create a clean environment:** It's best to start with a fresh Python virtual environment.

```bash
uv venv --python 3.12 --seed maxtext_venv
source maxtext_venv/bin/activate
```

2. **Run the setup script:** Execute `bash setup.sh` to install the new dependencies.

```bash
pip install uv
# install the tpu package
uv pip install -e .[tpu] --resolution=lowest
# or install the gpu package by running the following line:
# uv pip install -e .[cuda12] --resolution=lowest
install_maxtext_github_deps
```
1. **Install MaxText and dependencies**: For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/maxtext-v0.2.0/install_maxtext.html#from-source).

3. **Run tests:** Run MaxText tests to ensure there are no regressions.
2. **Verify the installation**: Run MaxText tests to ensure everything is working as expected with the newly installed dependencies and there are no regressions.
17 changes: 1 addition & 16 deletions docs/run_maxtext/run_maxtext_localhost.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,22 +36,7 @@ Local development on a single host TPU/GPU VM is a convenient way to run MaxText

1. Create and SSH to the single host VM of your choice. You can use any available single host TPU, such as `v5litepod-8`, `v5p-8`, or `v4-8`. For GPUs, you can use `nvidia-h100-mega-80gb`, `nvidia-h200-141gb`, or `nvidia-b200`. For setting up a TPU VM, use the Cloud TPU documentation available at https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm. For a GPU setup, refer to the guide at https://cloud.google.com/compute/docs/gpus/create-vm-with-gpus.

2. Clone MaxText onto that VM.

```bash
git clone https://github.com/google/maxtext.git
cd maxtext
```

3. Once you have cloned the repository, you have two primary options for setting up the necessary dependencies on your VM: Installing in a Python Environment, or building a Docker container. For single host workloads, we recommend to install dependencies in a python environment, and for multihost workloads we recommend the containerized approach.

Within the root directory of the cloned repo, create a virtual environment and install dependencies and the pre-commit hook by running:

```bash
python3.12 -m venv ~/venv-maxtext
source ~/venv-maxtext/bin/activate
bash tools/setup/setup.sh DEVICE={tpu|gpu}
```
2. For instructions on installing MaxText on your VM, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/install_maxtext.html).

#### Run a Test Training Job

Expand Down
32 changes: 1 addition & 31 deletions docs/run_maxtext/run_maxtext_single_host_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,39 +60,9 @@ If you get the NVML Error: Please follow these instructions.

https://stackoverflow.com/questions/72932940/failed-to-initialize-nvml-unknown-error-in-docker-after-few-hours

## Install MaxText

Clone MaxText:

```bash
git clone https://github.com/AI-Hypercomputer/maxtext.git
```

## Build MaxText Docker image

This builds a docker image called `maxtext_base_image`. You can retag to a different name.

1. Check out the code changes:

```bash
cd maxtext
```

2. Run the following commands to build and push the docker image:

```bash
export LOCAL_IMAGE_NAME=<docker_image_name>
sudo bash docker_build_dependency_image.sh DEVICE=gpu
docker tag maxtext_base_image ${LOCAL_IMAGE_NAME?}
docker push ${LOCAL_IMAGE_NAME?}
```

Note that when running `bash docker_build_dependency_image.sh DEVICE=gpu`, it
uses `MODE=stable` by default. If you want to use other modes, you need to
specify it explicitly:

- using nightly mode: `bash docker_build_dependency_image.sh DEVICE=gpu MODE=nightly`
- using pinned mode: `bash docker_build_dependency_image.sh DEVICE=gpu MODE=pinned`
For instructions on building the MaxText Docker image, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html).

## Test

Expand Down
24 changes: 2 additions & 22 deletions docs/run_maxtext/run_maxtext_via_pathways.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,27 +35,7 @@ Before you can run a MaxText workload, you must complete the following setup ste

2. **Create a GKE cluster** configured for Pathways.

3. **Build and upload a MaxText Docker image** to your project's Artifact Registry.

[Follow the steps to configure sudoless Docker](https://docs.docker.com/engine/install/linux-postinstall/) before running the commands below.

Step 1: Build the Docker image for a TPU device. This image contains MaxText and its dependencies.

```shell
bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE=tpu MODE=stable
```

Step 2: Configure Docker to authenticate with Google Cloud

```shell
gcloud auth configure-docker
```

Step 3: Upload the image to your project's registry. Replace `$USER_runner` with your desired image name.

```shell
bash src/dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=$USER_runner
```
3. **Build and upload a MaxText Docker image** to your project's Artifact Registry. For instructions on building and uploading the MaxText Docker image, please refer to the [official documentation](https://maxtext.readthedocs.io/en/latest/build_maxtext.html).

## 2. Environment configuration

Expand All @@ -76,7 +56,7 @@ export WORKLOAD_NODEPOOL_COUNT=1 # Number of TPU slices for your job
export BUCKET_NAME="your-gcs-bucket-name"
export RUN_NAME="maxtext-run-1"
# The Docker image you pushed in the prerequisite step
export DOCKER_IMAGE="gcr.io/${PROJECT?}/${USER}_runner"
export DOCKER_IMAGE="gcr.io/${PROJECT?}/${CLOUD_IMAGE_NAME}"
```

## 3. Running a batch workload
Expand Down
Loading
Loading