Configuring the environment for running benchmark recipes on a GKE Cluster with A4 Node Pools

This guide outlines the steps to configure the environment required to run benchmark recipes on a Google Kubernetes Engine (GKE) cluster with A4 node pools.

Prerequisites

Before you begin, ensure you have completed the following:

Create a Google Cloud project with billing enabled.

a. To create a project, see Creating and managing projects. b. To enable billing, see Verify the billing status of your projects.
Enabled the following APIs:
Make sure that you have a reservation for the required number of a4-highgpu-8g machines using the DENSE deployment type.
Ensure that you have been granted the following IAM roles:
- Editor (roles/editor)
- Project IAM Admin (roles/resourcemanager.projectIamAdmin)
- Kubernetes Engine Admin (roles/container.admin)
- Service Account Admin (roles/serviceAccountAdmin)

The environment

The environment comprises of the following components:

A client workstation: this is used to prepare, submit, and monitor ML workloads.
An Artifact Registry: serves as a private container registry for storing and managing Docker images used in the deployment.
A Google Kubernetes Engine (GKE) cluster configured as follows:
- A GKE regional standard cluster version: v1.32.4-gke.1236000 or later.
- A GPU node pool with the user specified number of a4-highgpu-8g provisioned using the DENSE deployment type.
- Workload Identity Federation for GKE enabled.
- Cloud Storage FUSE CSI driver for GKE enabled.
- DCGM metrics enabled.
- Kueue and JobSet APIs installed.
- Kueue configured to support Topology Aware Scheduling.
A regional Google Cloud Storage (GCS) Bucket for storing the test environment configuration and state and the execution logs generated by recipes.
A regional GCS bucket with hierarchical namespace enabled for managing training datasets.
A regional GCS bucket with hierarchical namespace enabled for managing checkpoints.

Set up the client workstation

You have two options: you can use either your own workstation (e.g., a local machine or Google Cloud VM) or Google Cloud Shell.

Set up Google Cloud Shell

Google Cloud Shell comes with all the necessary components pre-installed, so no additional configuration is needed.

IMPORTANT: Make sure that you have at least 2GB of disk space remaining in your home directory.

Set up your own workstation

If you prefer to use your own workstation, ensure you have the following components installed:

Cluster Toolkit dependencies.
kubectl with GKE authentication plugin. To install, see the GKE documentation.
Helm. To install, see the Helm documentation.

Set you Project ID

Launch your client workstation and set your Project ID.

gcloud config set project PROJECT_ID

Replace the following:

PROJECT_ID: your project ID.

Set up a Google Cloud Storage bucket for environment state and logs

The bucket is used to manage the state of the Cluster Toolkit blueprint that you'll use to provision a GKE cluster. The bucket is also used by the recipes to manage execution logs.

To create the bucket execute the following command:

gcloud storage buckets create gs://BUCKET_NAME \
--location=BUCKET_LOCATION \
--no-public-access-prevention --uniform-bucket-level-access

Replace the following:

BUCKET_NAME: the name of your bucket. The name must comply with the Cloud Storage bucket naming conventions.
BUCKET_LOCATION: the location of your bucket. The bucket must be in the same region as your cluster.

Configure access control to the bucket

A4 compute recipes access Google Cloud Storage buckets using the Kubernetes default ServiceAccount. You need to grant this account the rights to access the Google Cloud Storage bucket.

gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
--role=roles/storage.objectAdmin \
--member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/default/sa/default \
--condition=None

gcloud storage buckets add-iam-policy-binding gs://BUCKET_NAME \
--role=roles/storage.legacyBucketReader \
--member=principal://iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/PROJECT_ID.svc.id.goog/subject/ns/default/sa/default \
--condition=None

Replace the following:

BUCKET_NAME - the name of your bucket
PROJECT_ID: your Google Cloud project ID.
PROJECT_NUMBER: your numerical Google Cloud project number.

You can retrieve the project number from Cloud Console or using the following command:

PROJECT_NUMBER=$(gcloud projects describe PROJECT_ID --format="value(projectNumber)")

Replace the following:

PROJECT_ID: your Google Cloud project ID.

Set up an Artifact Registry

If you use Cloud KMS for repository encryption, create your artifact registry by using the instructions here.
If you don't use Cloud KMS, you can create your repository by using the following command:
```
  gcloud artifacts repositories create REPOSITORY \
      --repository-format=docker \
      --location=LOCATION \
      --description="DESCRIPTION"
```
Replace the following:
- REPOSITORY: the name of the repository. For each repository location in a project, repository names must be unique.
- LOCATION: the regional or multi-regional location for the repository. You can omit this flag if you set a default region.
- DESCRIPTION: a description of the repository. Don't include sensitive data because repository descriptions are not encrypted.

Create a GKE Cluster environment with A4 Node Pools

You'll use the Cluster Toolkit to create your GKE cluster environment. The Cluster Toolkit blueprint used in this setup creates and configures the following components:

VPC networks, subnets, routers, and firewall rules.
A GKE cluster with the required features enabled.
Service accounts with the required permissions.
An A4 node pool with a4-highgpu-8g nodes.
JobSet and Kueue APIs.
Cloud Storage buckets with hierarchical namespace enabled for training data and checkpoints.

The A4 compute recipes have been validated on a cluster created with the v1.51.1 version of the Cluster Toolkit.

Configure Application Default Credentials

Before deploying the Cluster Toolkit blueprint, you need to configure Application Default Credentials (ADC).
```
gcloud auth application-default login
```
You will be prompted to open your web browser and authenticate to Google Cloud.

Clone the Cluster Toolkit from the GitHub repository:

git clone  --branch v1.51.1 --single-branch https://github.com/GoogleCloudPlatform/cluster-toolkit

Install the Cluster Toolkit
```
cd cluster-toolkit && make
```

Deploy the cluster:

./gcluster deploy \
examples/gke-a4/gke-a4.yaml \
--backend-config "bucket=BUCKET_NAME" \
--vars "deployment_name=CLUSTER_NAME" \
--vars "project_id=PROJECT_ID" \
--vars "region=COMPUTE_REGION" \
--vars "zone=COMPUTE_ZONE" \
--vars "authorized_cidr=AUTHORIZED_CIDR" \
--vars "extended_reservation=RESERVATION_NAME" \
--vars "static_node_count=NODE_COUNT" \
--vars "system_node_pool_disk_size_gb=200"

Replace the following:

BUCKET_NAME: the name of the Cloud Storage bucket created in the previous step. Don't use the gs:// prefix in the name.
CLUSTER_NAME: the name for your cluster. Make sure that the name is shorter than 16 characters.
PROJECT_ID: the project ID of your project.
COMPUTE_REGION: the compute region for the cluster.
COMPUTE_ZONE: the compute zone for the node pool of A4 machines.
AUTHORIZED_CIDR: The IP address range that you want to allow to connect with the cluster. This CIDR block must include the IP address of the machine to call Cluster Toolkit. If you want to allow access from any IP address use 0.0.0.0/0.
NODE_COUNT: the number of A4 nodes to provision in your cluster.
RESERVATION_NAME: the name of your reservation. If you want to target a specific block within your reservation to use when creating a node pool, use the following format : RESERVATION_NAME/reservationBlocks/BLOCK_NAME. To get the names of the blocks that are available for your reservation, run the following command:
```
gcloud beta compute reservations blocks list RESERVATION_NAME \
--zone=COMPUTE_ZONE --format "value(name)"
```

Verify cluster settings

After the cluster toolkit blueprint has completed verify key configurations.

Get cluster credentials:
```
gcloud container clusters get-credentials CLUSTER_NAME \
--location COMPUTE_REGION
```
Replace the following:
- CLUSTER_NAME - the name of your cluster
- COMPUTE_REGION - the region of your cluster
List Kueue local queues
```
kubectl get queues
```
You should see the output similar to the following:
```
NAME       CLUSTERQUEUE   PENDING WORKLOADS   ADMITTED WORKLOADS
a4            a4                0                   0
```
The blueprint configures Kueue using the a4-high as a default name for both the local queue and the cluster queue.

Make sure that all A4 nodes are in the ready state:

kubectl get nodes

You should see the output similar to the following:

NAME                                             STATUS   ROLES    AGE    VERSION
gke-dsm-a4-a4-highgpu-8g-a4-pool-79c66879-08rx   Ready    <none>   161m   v1.32.4-gke.1236000
gke-dsm-a4-a4-highgpu-8g-a4-pool-79c66879-4l4m   Ready    <none>   161m   v1.32.4-gke.1236000
gke-dsm-a4-system-6bf3b1fc-7kl7                  Ready    <none>   165m   v1.32.4-gke.1236000
...

Additional permissions for Maxtext recipes

Grant the storage.admin role to the custom IAM node pool service account created by the Cluster Toolkit blueprint. This is required to support some recipes.

gcloud projects add-iam-policy-binding PROJECT_ID \
    --member="serviceAccount:CLUSTER_NAME-gke-np-sa@PROJECT_ID.iam.gserviceaccount.com" \
    --role="roles/storage.admin"

Replace the following:

PROJECT_ID: the project ID of your project.
CLUSTER_NAME: the name for your cluster.

What's next

Once you have set up your GKE cluster with A4 node pools, you can proceed to deploy and run your benchmark recipes.

Clean up the environment

If you want to remove the resources created when setting up the environment follow the below instructions.

Clean up resources created by Cluster Toolkit

To remove resources created by the Cluster Toolkit blueprint:

cd ~/cluster-toolkit
   ./gcluster destroy DEPLOYMENT_NAME

Replace the following:

DEPLOYMENT_NAME: the name you used during the deployment. This is the name of your cluster.

Remove Cloud Storage buckets

If you want to remove Cloud Storage buckets in your environment execute the following command:

IMPORTANT. This command removes the bucket and all objects within it. You'll not be able to recover them after the command is executed.

gcloud storage rm -r gs://BUCKET_NAME

Replace the following:

BUCKET_NAME: the name of your bucket

Remove Artifact Registry

To delete the Artifact Registry:

gcloud artifacts repositories delete REPOSITORY --location=LOCATION

Replace the following:

REPOSITORY: the name of your repository
LOCATION: the location of your repository

Get Help

If you encounter any issues or have questions about this setup, use one of the following resources:

Consult the official GKE documentation.
Check the issues section of this repository for known problems and solutions.
Reach out to Google Cloud support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuring the environment for running benchmark recipes on a GKE Cluster with A4 Node Pools

Prerequisites

The environment

Set up the client workstation

Set up Google Cloud Shell

Set up your own workstation

Set you Project ID

Set up a Google Cloud Storage bucket for environment state and logs

Configure access control to the bucket

Set up an Artifact Registry

Create a GKE Cluster environment with A4 Node Pools

Verify cluster settings

Additional permissions for Maxtext recipes

What's next

Clean up the environment

Clean up resources created by Cluster Toolkit

Remove Cloud Storage buckets

Remove Artifact Registry

Get Help

FilesExpand file tree

configuring-environment-gke-a4.md

Latest commit

History

configuring-environment-gke-a4.md

File metadata and controls

Configuring the environment for running benchmark recipes on a GKE Cluster with A4 Node Pools

Prerequisites

The environment

Set up the client workstation

Set up Google Cloud Shell

Set up your own workstation

Set you Project ID

Set up a Google Cloud Storage bucket for environment state and logs

Configure access control to the bucket

Set up an Artifact Registry

Create a GKE Cluster environment with A4 Node Pools

Verify cluster settings

Additional permissions for Maxtext recipes

What's next

Clean up the environment

Clean up resources created by Cluster Toolkit

Remove Cloud Storage buckets

Remove Artifact Registry

Get Help