|
| 1 | +<!-- mdformat global-off --> |
| 2 | +# Pretrain llama3-1-70b workloads on a4x GKE Node pools with Nvidia NeMo Framework |
| 3 | + |
| 4 | +This recipe outlines the steps for running a llama3-1-70b pretraining |
| 5 | +workload on [a4x GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the |
| 6 | +[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo). |
| 7 | + |
| 8 | +## Orchestration and deployment tools |
| 9 | + |
| 10 | +For this recipe, the following setup is used: |
| 11 | + |
| 12 | +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) |
| 13 | +- Pretraining job configuration and deployment - A Helm chart is used to |
| 14 | + configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the |
| 15 | + [NeMo pretraining workload](https://github.com/NVIDIA/nemo). |
| 16 | + |
| 17 | +## Test environment |
| 18 | + |
| 19 | +This recipe has been optimized for and tested with the following configuration: |
| 20 | + |
| 21 | +- GKE cluster |
| 22 | +Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x) |
| 23 | +to create your a4x GKE cluster. |
| 24 | + |
| 25 | +## Training dataset |
| 26 | + |
| 27 | +This recipe uses a mock pretraining dataset provided by the NeMo framework. |
| 28 | + |
| 29 | +## Docker container image |
| 30 | + |
| 31 | +This recipe uses the following docker images: |
| 32 | + |
| 33 | +- `nvcr.io/nvidia/nemo:26.02.01` |
| 34 | +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.1.0` |
| 35 | + |
| 36 | +## Run the recipe |
| 37 | + |
| 38 | +From your client workstation, complete the following steps: |
| 39 | + |
| 40 | +### Configure environment settings |
| 41 | + |
| 42 | +Set the environment variables to match your environment: |
| 43 | + |
| 44 | + ```bash |
| 45 | + export PROJECT_ID=<PROJECT_ID> |
| 46 | + export CLUSTER_REGION=<CLUSTER_REGION> |
| 47 | + export CLUSTER_NAME=<CLUSTER_NAME> |
| 48 | + export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs:// |
| 49 | + export KUEUE_NAME=<KUEUE_NAME> |
| 50 | + ``` |
| 51 | + |
| 52 | +Replace the following values: |
| 53 | + |
| 54 | + - `<PROJECT_ID>`: your Google Cloud project ID. |
| 55 | + - `<CLUSTER_REGION>`: the region where your cluster is located. |
| 56 | + - `<CLUSTER_NAME>`: the name of your GKE cluster. |
| 57 | + - `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. |
| 58 | + - `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Make sure to verify the name of the local queue in your cluster. |
| 59 | + |
| 60 | +Set the default project: |
| 61 | + |
| 62 | + ```bash |
| 63 | + gcloud config set project $PROJECT_ID |
| 64 | + ``` |
| 65 | + |
| 66 | +### Get the recipe |
| 67 | + |
| 68 | +Clone the `gpu-recipes` repository and set a reference to the recipe folder. |
| 69 | + |
| 70 | +``` |
| 71 | +git clone https://github.com/ai-hypercomputer/gpu-recipes.git |
| 72 | +cd gpu-recipes |
| 73 | +export REPO_ROOT=`git rev-parse --show-toplevel` |
| 74 | +export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3_70b/nemo-gke/nemo2602/64gpus-fp8mx-lustre/recipe |
| 75 | +cd $RECIPE_ROOT |
| 76 | +``` |
| 77 | + |
| 78 | +### Get cluster credentials |
| 79 | + |
| 80 | +``` |
| 81 | +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION |
| 82 | +``` |
| 83 | +### Setup Lustre PVCs |
| 84 | +1) Create Lustre instance |
| 85 | + |
| 86 | +Follow [Create Lustre Instance](https://docs.cloud.google.com/managed-lustre/docs/create-instance) to create a lustre instance. |
| 87 | + |
| 88 | +**Create the instance in the same zone and network as your cluster.** |
| 89 | + |
| 90 | +Run `gcloud lustre instances list --location $ZONE --project $PROJECT_ID` and take note of the filesystem, name and network. It will look something like this: |
| 91 | +``` |
| 92 | +capacityGib: '126000' |
| 93 | +createTime: '2026-01-07T00:24:57.572296415Z' |
| 94 | +filesystem: <FILESYSTEM> |
| 95 | +mountPoint: <LUSTRE_IP>@tcp:/<FILESYSTEM> |
| 96 | +name: projects/<PROJECT_ID>/locations/<ZONE>/instances/<NAME> |
| 97 | +network: projects/<PROJECT_ID>/global/networks/<NETWORK> |
| 98 | +perUnitStorageThroughput: '1000' |
| 99 | +state: ACTIVE |
| 100 | +uid: 9b9400c4-a669-48d0-89de-fe4bce4664fb |
| 101 | +updateTime: '2026-04-27T20:38:55.794373239Z' |
| 102 | +``` |
| 103 | + |
| 104 | +2) Upload training dataset to the instance |
| 105 | +For commands, see: TODO |
| 106 | + |
| 107 | +3) Create PersistentVolumes for the instance and claim the volume |
| 108 | + |
| 109 | +Replace the following variables in `lustre_pv.yaml`: |
| 110 | + - `<FILESYSTEM>`: Filesystem name of the instance |
| 111 | + - `<LUSTRE_IP>`: IP of Lustre instance |
| 112 | + - `<PROJECT_ID>`: Project ID |
| 113 | + - `<ZONE>`: Zone of your compute nodes and lustre instance |
| 114 | + - `<NAME>`: Lustre instance name |
| 115 | + - `<NETWORK>`: Network name of your cluster and instance |
| 116 | + |
| 117 | +``` |
| 118 | +kubectl apply -f ./lustre_pv.yaml |
| 119 | +``` |
| 120 | + |
| 121 | +4) Add volume claims to values.yaml |
| 122 | + |
| 123 | +``` |
| 124 | +... |
| 125 | +pvcMounts: |
| 126 | + - claimName: "asq-0106-lustre-pvc" |
| 127 | + mountPath: "/lustrefs" |
| 128 | +... |
| 129 | +``` |
| 130 | + |
| 131 | +### Configure and submit a pretraining job |
| 132 | + |
| 133 | +#### Using 32 node (64 gpus) fp8 precision with Lustre dataload and checkpointing |
| 134 | +To execute the job with dataloading and checkpoint saving, run the following: |
| 135 | + |
| 136 | +```bash |
| 137 | +cd $RECIPE_ROOT |
| 138 | +export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node |
| 139 | +export DATASET_TYPE=<DATASET_TYPE> |
| 140 | +export DATASET_PATHS=<DATASET_PATHS> |
| 141 | +export INDEX_MAPPING_DIR=<INDEX_MAPPING_DIR> |
| 142 | +export CKPT_SAVE_DIR=<CKPT_SAVE_DIR> |
| 143 | +export CKPT_SAVE_INTERVAL=<CKPT_SAVE_INTERVAL> |
| 144 | +helm install $WORKLOAD_NAME . -f values.yaml \ |
| 145 | +--set-file workload_launcher=launcher.sh \ |
| 146 | +--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \ |
| 147 | +--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \ |
| 148 | +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ |
| 149 | +--set volumes.gcsMounts[0].mountPath=/job-logs \ |
| 150 | +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ |
| 151 | +--set workload.envs[5].value=${DATASET_TYPE} \ |
| 152 | +--set workload.envs[6].value="/lustrefs/data/${DATASET_PATHS}" \ |
| 153 | +--set workload.envs[7].value=/lustrefs/data/${INDEX_MAPPING_DIR} \ |
| 154 | +--set workload.envs[8].value=/lustrefs/ckpt/${CKPT_SAVE_DIR} \ |
| 155 | +--set workload.envs[9].value=/lustrefs/ckpt/${CKPT_SAVE_INTERVAL} \ |
| 156 | +--set queue=${KUEUE_NAME} |
| 157 | +``` |
| 158 | + |
| 159 | +Replace the following values: |
| 160 | + - `<DATASET_TYPE>`: The type of dataset used (see [Megatron-Bridge data arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#data-arguments)) |
| 161 | + - `<DATASET_PATHS>`: Paths to your dataset (for rp2 dataset) |
| 162 | + - `<INDEX_MAPPING_DIR>`: Index mapping dir (for rp2 dataset) |
| 163 | + - `<CKPT_SAVE_DIR>`: The directory where you wish you save checkpoints |
| 164 | + - `<CKPT_SAVE_INTERVAL>`: Save checkpoint every CKPT_SAVE_INTERVAL train steps |
| 165 | + |
| 166 | +Note: Edit `recipe.dataset.num_workers = 8` in `run_script.py` to change the number of dataloading workers. |
| 167 | + |
| 168 | +This recipe uses Nvidia NeMo 26.02.01 container, which uses Megatron-Bridge for checkpointing and dataloading. See [checkpointing arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#checkpointing-arguments) and [data arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#data-arguments) for explanations on dataset types. |
| 169 | + |
| 170 | +You can also edit values.yaml directly. |
| 171 | +``` |
| 172 | +- name: DATASET_TYPE |
| 173 | + value: "rp2" |
| 174 | +- name: DATASET_PATHS |
| 175 | + value: "/lustrefs/data/${DATASET_PATHS}" |
| 176 | +- name: INDEX_MAPPING_DIR |
| 177 | + value: "/lustrefs/data/${INDEX_MAPPING_DIR}" |
| 178 | +- name: CKPT_SAVE_DIR |
| 179 | + value: "/lustrefs/ckpt/${CKPT_SAVE_DIR}" |
| 180 | +- name: CKPT_SAVE_INTERVAL |
| 181 | + value: "10" |
| 182 | +``` |
| 183 | + |
| 184 | +To load a checkpoint, add the following: |
| 185 | +```bash |
| 186 | +cd $RECIPE_ROOT |
| 187 | +export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node |
| 188 | +export CKPT_LOAD_DIR=<CKPT_LOAD_DIR> |
| 189 | +export CKPT_LOAD_STEP=<CKPT_LOAD_STEP> |
| 190 | +helm install $WORKLOAD_NAME . -f values.yaml \ |
| 191 | +--set-file workload_launcher=launcher.sh \ |
| 192 | +--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \ |
| 193 | +--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \ |
| 194 | +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ |
| 195 | +--set volumes.gcsMounts[0].mountPath=/job-logs \ |
| 196 | +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ |
| 197 | +--set workload.envs[10].value=/lustrefs/ckpt/${CKPT_LOAD_DIR} \ |
| 198 | +--set workload.envs[11].value="${CKPT_LOAD_STEP}" \ |
| 199 | +--set queue=${KUEUE_NAME} |
| 200 | +``` |
| 201 | + |
| 202 | +Or edit values.yaml directly |
| 203 | +``` |
| 204 | +- name: CKPT_LOAD_DIR |
| 205 | + value: "/lustrefs/ckpt/${CKPT_LOAD_DIR}" |
| 206 | +- name: CKPT_LOAD_STEP |
| 207 | + value: "10" |
| 208 | +``` |
| 209 | + |
| 210 | +**Examples** |
| 211 | + |
| 212 | +- To set the number of training steps to 100, run the following command from your client: |
| 213 | + |
| 214 | +```bash |
| 215 | +cd $RECIPE_ROOT |
| 216 | +export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node |
| 217 | +helm install $WORKLOAD_NAME . -f values.yaml \ |
| 218 | +--set-file workload_launcher=launcher.sh \ |
| 219 | +--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \ |
| 220 | +--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \ |
| 221 | +--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ |
| 222 | +--set volumes.gcsMounts[0].mountPath=/job-logs \ |
| 223 | +--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ |
| 224 | +--set workload.envs[5].value=${DATASET_TYPE} \ |
| 225 | +--set workload.envs[6].value="/lustrefs/data/${DATASET_PATHS}" \ |
| 226 | +--set workload.envs[7].value=/lustrefs/data/${INDEX_MAPPING_DIR} \ |
| 227 | +--set workload.envs[8].value=/lustrefs/ckpt/${CKPT_SAVE_DIR} \ |
| 228 | +--set workload.envs[9].value=/lustrefs/ckpt/${CKPT_SAVE_INTERVAL} \ |
| 229 | +--set queue=${KUEUE_NAME} \ |
| 230 | +--set workload.arguments[0]="trainer.max_steps=100" |
| 231 | +``` |
| 232 | + |
| 233 | + |
| 234 | +### Monitor the job |
| 235 | + |
| 236 | +To check the status of pods in your job, run the following command: |
| 237 | + |
| 238 | +``` |
| 239 | +kubectl get pods | grep $USER-a4x-llama3-1-70b-32node |
| 240 | +``` |
| 241 | + |
| 242 | +Replace the following: |
| 243 | + |
| 244 | +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-llama3-1-70b-32node. |
| 245 | + |
| 246 | +To get the logs for one of the pods, run the following command: |
| 247 | + |
| 248 | +``` |
| 249 | +kubectl logs POD_NAME |
| 250 | +``` |
| 251 | + |
| 252 | +Information about the training job's progress, including crucial details such as |
| 253 | +loss, step count, and step time, is generated by the rank 0 process. |
| 254 | +This process runs on the pod whose name begins with |
| 255 | +`JOB_NAME_PREFIX-workload-0-0`. |
| 256 | +For example: `$USER-a4x-llama3-1-70b-32node-workload-0-0-s9zrv`. |
| 257 | + |
| 258 | +Logs will display checkpoint save directory: |
| 259 | +``` |
| 260 | +28_worker0/0 successfully saved checkpoint from iteration 40 to /lustrefs/ckpt/asq-llama70b-ckpt-64gpu-lustre-2026-04-03-03-51-29 [ t 1/2, p 1/4 ] |
| 261 | +``` |
| 262 | + |
| 263 | +### Uninstall the Helm release |
| 264 | + |
| 265 | +You can delete the job and other resources created by the Helm chart. To |
| 266 | +uninstall Helm, run the following command from your client: |
| 267 | + |
| 268 | +```bash |
| 269 | +helm uninstall $USER-a4x-llama3-1-70b-32node |
| 270 | +``` |
| 271 | + |
0 commit comments