(run-xpk)=
This guide provides the recommended workflow for running MaxText on Google Kubernetes Engine (GKE) using the Accelerated Processing Kit (XPK). For a complete reference on XPK, please see the official XPK repository.
The process involves two main stages. First, you will package the MaxText application and its dependencies into a self-contained Docker image. This is done on your local machine or any environment where Docker is installed. Second, you will use the XPK command-line tool to orchestrate the deployment of this image as a training job on a GKE cluster equipped with accelerators (TPUs or GPUs).
XPK abstracts away the complexity of cluster management and job submission, handling tasks like uploading your Docker image to Artifact Registry and scheduling the workload on the cluster.
+--------------------------+ +--------------------+ +-------------------+
| | | | | |
| Your Development Machine +------> Artifact Registry +------> GKE Cluster |
| (anywhere with Docker) | | (Stores your image)| |(with Accelerators)|
| | | | | |
| 1. Build MaxText Docker | | 2. XPK uploads | | 3. XPK runs job |
| Image | | image for you | | using the image|
+--------------------------+ +--------------------+ +-------------------+
Before you begin, you must have the necessary tools installed and permissions configured.
-
Python >= 3.12 with
pipandvenv. -
Google Cloud CLI (
gcloud): Install it from here and then rungcloud init. -
kubectl: The Kubernetes command-line tool.
-
Docker: Follow the installation instructions and follow the steps to configure sudoless Docker.
Your Google Cloud user account needs the following IAM roles for the project you're using:
-
Artifact Registry Writer
-
Compute Admin
-
Kubernetes Engine Admin
-
Logging Admin
-
Monitoring Admin
-
Service Account User
-
Storage Admin
-
Vertex AI Administrator
These commands configure your local environment to connect to Google Cloud services.
-
Authenticate gcloud
gcloud auth login -
Install GKE auth plugin
sudo apt-get update && sudo apt-get install google-cloud-sdk-gke-gcloud-auth-plugin -
Configure Docker credentials
gcloud auth configure-docker
For instructions on building the MaxText Docker image, please refer to the official documentation.
This section assumes you have an existing GKE cluster with either TPU or GPU nodes.
This guide focuses on submitting workloads to an existing cluster. Cluster creation and management is a separate topic. For a comprehensive guide on all `xpk` commands, including `xpk cluster create`, please refer to the **[official XPK documentation](https://github.com/AI-Hypercomputer/xpk)**.
-
Set your configuration
export PROJECT_ID="your-gcp-project-id" export ZONE="your-gcp-zone" # e.g., us-central1-a export CLUSTER_NAME="your-existing-cluster-name" export BASE_OUTPUT_DIR="gs://your-output-bucket/" export DATASET_PATH="gs://your-dataset-bucket/" -
Configure gcloud CLI
gcloud config set project ${PROJECT_ID?} gcloud config set compute/zone ${ZONE?}
The examples below run on a single TPU slice (--num-slices=1) or a small number of GPU nodes (--num-nodes=2). To scale your job to a larger, multi-host configuration, you simply increase these values.
For instance, to run a job across four TPU slices, you would change --num-slices=1 to --num-slices=4. This tells XPK to allocate four v5litepod-256 slices and orchestrate the training job across all of them as a single workload. Similarly, for GPUs, you would increase the --num-nodes value.
-
Create the workload (run the job)
-
On your TPU cluster:
xpk workload create\ --cluster ${CLUSTER_NAME?}\ --workload ${USER}-tpu-job\ --base-docker-image maxtext_base_image\ --tpu-type v5litepod-256\ --num-slices 1\ --command "python3 -m maxtext.trainers.pre_train.train run_name=${USER}-tpu-job base_output_directory=${BASE_OUTPUT_DIR?} dataset_path=${DATASET_PATH?} steps=100" -
On your GPU cluster:
xpk workload create\ --cluster ${CLUSTER_NAME?}\ --workload ${USER}-gpu-job\ --base-docker-image maxtext_base_image\ --device-type h100-80gb-8\ --num-nodes 2\ --command "python3 -m maxtext.trainers.pre_train.train run_name=${USER}-gpu-job base_output_directory=${BASE_OUTPUT_DIR?} dataset_path=${DATASET_PATH?} steps=100"
-
-
View logs in real-time: The easiest way to see the output of your training job is through the Google Cloud Console.
-
Navigate to the Kubernetes Engine section.
-
Go to Workloads.
-
Find your workload (e.g.,
${USER}-tpu-job) and click on it. -
Select the Logs tab to view the container logs.
-
-
List your jobs:
xpk workload list --cluster ${CLUSTER_NAME?} -
Analyze output: Checkpoints and other artifacts will be saved to the Google Cloud Storage bucket you specified in
BASE_OUTPUT_DIR. -
Delete a job:
xpk workload delete --cluster ${CLUSTER_NAME?} --workload <your-workload-name>