(run-pathways)=
This guide provides a comprehensive walkthrough for running MaxText workloads on a Google Kubernetes Engine (GKE) cluster using Pathways. Pathways acts as a powerful orchestrator for large-scale JAX jobs on AI Hypercomputer infrastructure.
This document assumes you have already created a Pathways GKE cluster using xpk. If you haven't, follow the instructions at the Google Cloud Pathways & XPK documentation.
We will cover two primary modes of operation:
- Batch workload: Ideal for long-running, non-interactive training jobs.
- Headless workload: Ideal for interactive development, debugging, and running code from a local machine or CPU VM.
Before you can run a MaxText workload, you must complete the following setup steps.
-
Install XPK and its dependencies. Ensure that the
xpkcommand-line tool is installed. -
Create a GKE cluster configured for Pathways.
-
Build and upload a MaxText Docker image to your project's Artifact Registry.
Step 1: Build the Docker image for a TPU device. This image contains MaxText and its dependencies.
bash src/dependencies/scripts/docker_build_dependency_image.sh DEVICE=tpu MODE=stable
Step 2: Configure Docker to authenticate with Google Cloud
gcloud auth configure-docker
Step 3: Upload the image to your project's registry. Replace
$USER_runnerwith your desired image name.bash src/dependencies/scripts/docker_upload_runner.sh CLOUD_IMAGE_NAME=$USER_runner
The following commands use placeholder variables. Before running them, set these environment variables in your shell.
# -- Google Cloud Configuration --
export PROJECT="your-gcp-project-id"
export ZONE="your-gcp-zone"
export CLUSTER="your-gke-cluster-name"
# -- Workload Configuration --
export WORKLOAD_NAME="maxtext-job-$(date +%Y%m%d-%H%M%S)"
export TPU_TYPE="v5p-8" # Or your desired TPU type, e.g., v5e-4
export WORKLOAD_NODEPOOL_COUNT=1 # Number of TPU slices for your job
# -- MaxText & Storage Configuration --
export BUCKET_NAME="your-gcs-bucket-name"
export RUN_NAME="maxtext-run-1"
# The Docker image you pushed in the prerequisite step
export DOCKER_IMAGE="gcr.io/${PROJECT?}/${USER}_runner"A batch workload runs entirely within the GKE cluster. You submit the job definition, and Pathways manages its execution.
Use the xpk workload create-pathways command to start the job.
xpk workload create-pathways \
--workload=${WORKLOAD_NAME?} \
--cluster=${CLUSTER?} \
--num-slices=${WORKLOAD_NODEPOOL_COUNT?} \
--tpu-type=${TPU_TYPE?} \
--project=${PROJECT?} \
--zone=${ZONE?} \
--docker-image=${DOCKER_IMAGE?} \
--command="python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
base_output_directory=gs://${BUCKET_NAME?} \
per_device_batch_size=1 \
enable_checkpointing=false \
dataset_type=synthetic \
enable_single_controller=True \
run_name=${RUN_NAME?}-pathways-batch"You can check the status of your running workloads with the xpk workload list command.
xpk workload list --cluster=${CLUSTER?} --project=${PROJECT?} --zone=${ZONE?}A headless workload reserves TPUs on the cluster and sets up a controller, but the Python script itself runs on a separate machine, like a local laptop or a Compute Engine VM. This is useful for rapid development and debugging. The headless mode refers to launching the Pathways backend services, such as resource manager and IFRT proxy, without a predefined user-workload container.
This command reserves the TPUs and starts the Pathways head service on the cluster. It will wait until the resources are ready.
xpk workload create-pathways \
--headless \
--workload=${WORKLOAD_NAME?} \
--num-slices=${WORKLOAD_NODEPOOL_COUNT?} \
--tpu-type=${TPU_TYPE?} \
--project=${PROJECT?} \
--zone=${ZONE?} \
--cluster=${CLUSTER?}On the machine where you will run your Python script, open a new terminal and create a secure tunnel to the cluster's Pathways controller.
This command forwards local port 29000 to the controller pod in the cluster. It runs in the background.
kubectl port-forward \
"$(kubectl get pods -o name | grep ${WORKLOAD_NAME?}-pathways-head)" \
29000:29000 &> /dev/null &With the port forward active, you can now run your MaxText script. The JAX environment variables direct it to connect to the TPUs through the tunnel.
# Set these environment variables to tell JAX how to connect to the TPUs
export JAX_PLATFORMS=proxy
export JAX_BACKEND_TARGET=grpc://127.0.0.1:29000
# Run the training script
python3 -m maxtext.trainers.pre_train.train src/maxtext/configs/base.yml \
base_output_directory=gs://${BUCKET_NAME?} \
per_device_batch_size=1 \
enable_checkpointing=false \
dataset_type=synthetic \
enable_single_controller=True \
run_name=${RUN_NAME?}-pathways-headlessThe output streams directly to your terminal, just as if you were running on a local accelerator.
- Permission denied errors for Cloud Storage bucket: Check that the service account used by your GKE nodes has "Storage Object Admin" permissions on your GCS bucket.
Image not foundorImagePullBackOff:- Verify your
DOCKER_IMAGEvariable is correct. - Ensure you have successfully pushed the image to your project's Artifact Registry.
- Check that your GKE cluster has permissions to pull from the registry.
- Verify your
kubectl port-forwardfails:- Confirm that the pod from Step 1 is running (
kubectl get pods). The name should match${WORKLOAD_NAME?}-pathways-head-0. - Ensure you are authenticated with
kubectland have the correct context set for your GKE cluster.
- Confirm that the pod from Step 1 is running (
- Make sure you import
pathwaysutilspackage and callpathwaysutils.initialize()in your script when running the workload.
For more advanced configurations and a deeper dive into the Pathways architecture, see the official Pathways on Cloud documentation.