This guide outlines the steps to configure the environment required to run benchmark recipes on a Google Kubernetes Engine (GKE) cluster with A4 node pools.
Before you begin, ensure you have completed the following:
-
Create a Google Cloud project with billing enabled.
a. To create a project, see Creating and managing projects. b. To enable billing, see Verify the billing status of your projects.
-
Enabled the following APIs:
-
Make sure that you have a reservation for the required number of
a4-highgpu-8gmachines using theDENSEdeployment type. -
Ensure that you have been granted the following IAM roles:
- Editor (
roles/editor) - Project IAM Admin (
roles/resourcemanager.projectIamAdmin) - Kubernetes Engine Admin (
roles/container.admin) - Service Account Admin (
roles/serviceAccountAdmin)
- Editor (
The environment comprises of the following components:
- A client workstation: this is used to prepare, submit, and monitor ML workloads.
- An Artifact Registry: serves as a private container registry for storing and managing Docker images used in the deployment.
- A Google Kubernetes Engine (GKE)
cluster configured as follows:
- A GKE regional standard cluster version: v1.32.4-gke.1236000 or later.
- A GPU node pool with the user specified number of a4-highgpu-8g provisioned using the DENSE deployment type.
- Workload Identity Federation for GKE enabled.
- Cloud Storage FUSE CSI driver for GKE enabled.
- DCGM metrics enabled.
- Kueue and JobSet APIs installed.
- Kueue configured to support Topology Aware Scheduling.
- A regional Google Cloud Storage (GCS) Bucket for storing the test environment configuration and state and the execution logs generated by recipes.
- A regional GCS bucket with hierarchical namespace enabled for managing training datasets.
- A regional GCS bucket with hierarchical namespace enabled for managing checkpoints.
You have two options: you can use either your own workstation (e.g., a local machine or Google Cloud VM) or Google Cloud Shell.
Google Cloud Shell comes with all the necessary components pre-installed, so no additional configuration is needed.
IMPORTANT: Make sure that you have at least 2GB of disk space remaining in your home directory.
If you prefer to use your own workstation, ensure you have the following components installed:
- Cluster Toolkit dependencies.
- kubectl with GKE authentication plugin. To install, see the GKE documentation.
- Helm. To install, see the Helm documentation.
Launch your client workstation and set your Project ID.
gcloud config set project PROJECT_ID
Replace the following:
- PROJECT_ID: your project ID.
The bucket is used to manage the state of the Cluster Toolkit blueprint that you'll use to provision a GKE cluster. The bucket is also used by the recipes to manage execution logs.
To create the bucket execute the following command:
gcloud storage buckets create gs://BUCKET_NAME \
--location=BUCKET_LOCATION \
--no-public-access-prevention --uniform-bucket-level-accessReplace the following:
BUCKET_NAME: the name of your bucket. The name must comply with the Cloud Storage bucket naming conventions.BUCKET_LOCATION: the location of your bucket. The bucket must be in the same region as your cluster.
A4 compute recipes access Google Cloud Storage buckets using the Kubernetes default ServiceAccount. You need to grant this account the rights to access the Google Cloud Storage bucket.
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
--role=roles/storage.objectAdmin \
--member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/default/sa/default \
--condition=None
gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
--role=roles/storage.legacyBucketReader \
--member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/default/sa/default \
--condition=None
Replace the following:
- BUCKET_NAME - the name of your bucket
- PROJECT_ID: your Google Cloud project ID.
- PROJECT_NUMBER: your numerical Google Cloud project number.
You can retrieve the project number from Cloud Console or using the following command:
PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")
Replace the following:
- PROJECT_ID: your Google Cloud project ID.
-
If you use Cloud KMS for repository encryption, create your artifact registry by using the instructions here.
-
If you don't use Cloud KMS, you can create your repository by using the following command:
gcloud artifacts repositories create REPOSITORY \ --repository-format=docker \ --location=LOCATION \ --description="DESCRIPTION"Replace the following:
REPOSITORY: the name of the repository. For each repository location in a project, repository names must be unique.LOCATION: the regional or multi-regional location for the repository. You can omit this flag if you set a default region.DESCRIPTION: a description of the repository. Don't include sensitive data because repository descriptions are not encrypted.
You'll use the Cluster Toolkit to create your GKE cluster environment. The Cluster Toolkit blueprint used in this setup creates and configures the following components:
- VPC networks, subnets, routers, and firewall rules.
- A GKE cluster with the required features enabled.
- Service accounts with the required permissions.
- An A4 node pool with
a4-highgpu-8gnodes. - JobSet and Kueue APIs.
- Cloud Storage buckets with hierarchical namespace enabled for training data and checkpoints.
The A4 compute recipes have been validated on a cluster created with the v1.51.1 version of the Cluster Toolkit.
-
Configure Application Default Credentials
Before deploying the Cluster Toolkit blueprint, you need to configure Application Default Credentials (ADC).
gcloud auth application-default loginYou will be prompted to open your web browser and authenticate to Google Cloud.
-
Clone the Cluster Toolkit from the GitHub repository:
git clone --branch v1.51.1 --single-branch https://github.com/GoogleCloudPlatform/cluster-toolkit
-
Install the Cluster Toolkit
cd cluster-toolkit && make -
Deploy the cluster:
./gcluster deploy \ examples/gke-a4/gke-a4.yaml \ --backend-config "bucket=BUCKET_NAME" \ --vars "deployment_name=CLUSTER_NAME" \ --vars "project_id=PROJECT_ID" \ --vars "region=COMPUTE_REGION" \ --vars "zone=COMPUTE_ZONE" \ --vars "authorized_cidr=AUTHORIZED_CIDR" \ --vars "extended_reservation=RESERVATION_NAME" \ --vars "static_node_count=NODE_COUNT" \ --vars "system_node_pool_disk_size_gb=200"
Replace the following:
-
BUCKET_NAME: the name of the Cloud Storage bucket created in the previous step. Don't use the
gs://prefix in the name. -
CLUSTER_NAME: the name for your cluster. Make sure that the name is shorter than 16 characters.
-
PROJECT_ID: the project ID of your project.
-
COMPUTE_REGION: the compute region for the cluster.
-
COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
-
AUTHORIZED_CIDR: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Cluster Toolkit. If you want to allow access from any IP address use
0.0.0.0/0. -
NODE_COUNT: the number of A4 nodes to provision in your cluster.
-
RESERVATION_NAME: the name of your reservation. If you want to target a specific block within your reservation to use when creating a node pool, use the following format :
RESERVATION_NAME/reservationBlocks/BLOCK_NAME. To get the names of the blocks that are available for your reservation, run the following command:gcloud beta compute reservations blocks list RESERVATION_NAME \ --zone=COMPUTE_ZONE --format "value(name)"
After the cluster toolkit blueprint has completed verify key configurations.
-
Get cluster credentials:
gcloud container clusters get-credentials CLUSTER_NAME \ --location COMPUTE_REGIONReplace the following:
- CLUSTER_NAME - the name of your cluster
- COMPUTE_REGION - the region of your cluster
-
List Kueue local queues
kubectl get queuesYou should see the output similar to the following:
NAME CLUSTERQUEUE PENDING WORKLOADS ADMITTED WORKLOADS a4 a4 0 0The blueprint configures Kueue using the
a4-highas a default name for both the local queue and the cluster queue. -
Make sure that all A4 nodes are in the ready state:
kubectl get nodesYou should see the output similar to the following:
NAME STATUS ROLES AGE VERSION gke-dsm-a4-a4-highgpu-8g-a4-pool-79c66879-08rx Ready <none> 161m v1.32.4-gke.1236000 gke-dsm-a4-a4-highgpu-8g-a4-pool-79c66879-4l4m Ready <none> 161m v1.32.4-gke.1236000 gke-dsm-a4-system-6bf3b1fc-7kl7 Ready <none> 165m v1.32.4-gke.1236000 ...
Grant the storage.admin role to the custom IAM node pool service account created by the Cluster Toolkit blueprint. This is required to support some recipes.
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:CLUSTER_NAME-gke-np-sa@PROJECT_ID.iam.gserviceaccount.com" \
--role="roles/storage.admin"Replace the following:
- PROJECT_ID: the project ID of your project.
- CLUSTER_NAME: the name for your cluster.
Once you have set up your GKE cluster with A4 node pools, you can proceed to deploy and run your benchmark recipes.
If you want to remove the resources created when setting up the environment follow the below instructions.
To remove resources created by the Cluster Toolkit blueprint:
cd ~/cluster-toolkit
./gcluster destroy DEPLOYMENT_NAME
Replace the following:
- DEPLOYMENT_NAME: the name you used during the deployment. This is the name of your cluster.
If you want to remove Cloud Storage buckets in your environment execute the following command:
IMPORTANT. This command removes the bucket and all objects within it. You'll not be able to recover them after the command is executed.
gcloud storage rm -r gs://BUCKET_NAME
Replace the following:
- BUCKET_NAME: the name of your bucket
To delete the Artifact Registry:
gcloud artifacts repositories delete REPOSITORY --location=LOCATION
Replace the following:
- REPOSITORY: the name of your repository
- LOCATION: the location of your repository
If you encounter any issues or have questions about this setup, use one of the following resources:
- Consult the official GKE documentation.
- Check the issues section of this repository for known problems and solutions.
- Reach out to Google Cloud support.