Skip to content

Commit 15e5c6c

Browse files
committed
Move a4x storage recipes
1 parent f1a5bb6 commit 15e5c6c

48 files changed

Lines changed: 5807 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Copyright 2026 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4x_jobset_workload
17+
description: a4x_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
21+
Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,264 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain llama3-1-70b workloads on a4x GKE Node pools with Nvidia NeMo Framework
3+
4+
This recipe outlines the steps for running a llama3-1-70b pretraining
5+
workload on [a4x GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
6+
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
7+
8+
## Orchestration and deployment tools
9+
10+
For this recipe, the following setup is used:
11+
12+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
13+
- Pretraining job configuration and deployment - A Helm chart is used to
14+
configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
15+
[NeMo pretraining workload](https://github.com/NVIDIA/nemo).
16+
17+
## Test environment
18+
19+
This recipe has been optimized for and tested with the following configuration:
20+
21+
- GKE cluster
22+
Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x)
23+
to create your a4x GKE cluster.
24+
25+
## Training dataset
26+
27+
This recipe uses a mock pretraining dataset provided by the NeMo framework.
28+
29+
## Docker container image
30+
31+
This recipe uses the following docker images:
32+
33+
- `nvcr.io/nvidia/nemo:26.02.01`
34+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.1.0`
35+
36+
## Run the recipe
37+
38+
From your client workstation, complete the following steps:
39+
40+
### Configure environment settings
41+
42+
Set the environment variables to match your environment:
43+
44+
```bash
45+
export PROJECT_ID=<PROJECT_ID>
46+
export CLUSTER_REGION=<CLUSTER_REGION>
47+
export CLUSTER_NAME=<CLUSTER_NAME>
48+
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
49+
export KUEUE_NAME=<KUEUE_NAME>
50+
```
51+
52+
Replace the following values:
53+
54+
- `<PROJECT_ID>`: your Google Cloud project ID.
55+
- `<CLUSTER_REGION>`: the region where your cluster is located.
56+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
57+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
58+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Make sure to verify the name of the local queue in your cluster.
59+
60+
Set the default project:
61+
62+
```bash
63+
gcloud config set project $PROJECT_ID
64+
```
65+
66+
### Get the recipe
67+
68+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
69+
70+
```
71+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
72+
cd gpu-recipes
73+
export REPO_ROOT=`git rev-parse --show-toplevel`
74+
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3_70b/nemo-gke/nemo2602/128gpus-fp8mx-gcs/recipe
75+
cd $RECIPE_ROOT
76+
```
77+
78+
### Get cluster credentials
79+
80+
```
81+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
82+
```
83+
### Setup GCS bucket PVCs
84+
1) Create GCS buckets for dataload and checkpointing.
85+
86+
Ensure the buckets have `--uniform-bucket-level-access` and `--enable-hirearchical-namespace`
87+
88+
```
89+
export CKPT_BUCKET_NAME=a4x-storage-ckpt
90+
export DL_BUCKET_NAME=a4x-storage-data
91+
92+
gcloud storage buckets create "gs://${CKPT_BUCKET_NAME}" --location=$CLUSTER_REGION --uniform-bucket-level-access --enable-hierarchical-namespace
93+
94+
gcloud storage buckets create "gs://${DL_BUCKET_NAME}" --location=$CLUSTER_REGION --uniform-bucket-level-access --enable-hierarchical-namespace
95+
```
96+
97+
2) Upload training dataset to dataload bucket
98+
For commands, see: [Upload opjects from a file system](https://docs.cloud.google.com/storage/docs/uploading-objects)
99+
100+
3) Create PersistentVolumes for the buckets and claim these volumes
101+
102+
Replace `<DL_BUCKET_NAME>` and `<CKPT_BUCKET_NAME>` in gcs_pv.yamk and run the following command.
103+
Ensure file cache is enabled for optimal performance.
104+
105+
```
106+
kubectl apply -f ./gcs_pv.yaml
107+
```
108+
109+
4) Add volume claims to values.yaml
110+
111+
```
112+
...
113+
gcsVolumes: true
114+
psVolumes: false
115+
pvcMounts:
116+
- claimName: "a4x-storage-ckpt-pvc"
117+
mountPath: "/gcsckpt"
118+
- claimName: "a4x-storage-data-pvc"
119+
mountPath: "/gcsdata"
120+
...
121+
```
122+
123+
Using PVC optimizes performance, but you can also mount the GCS bucket using gcsMounts option.
124+
125+
### Configure and submit a pretraining job
126+
127+
#### Using 32 node (128 gpus) fp8 precision with GCS dataload and checkpointing
128+
To execute the job with dataloading and checkpoint saving, run the following:
129+
130+
```bash
131+
cd $RECIPE_ROOT
132+
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node
133+
export DATASET_TYPE=<DATASET_TYPE>
134+
export DATASET_PATHS=<DATASET_PATHS>
135+
export INDEX_MAPPING_DIR=<INDEX_MAPPING_DIR>
136+
export CKPT_SAVE_DIR=<CKPT_SAVE_DIR>
137+
export CKPT_SAVE_INTERVAL=<CKPT_SAVE_INTERVAL>
138+
helm install $WORKLOAD_NAME . -f values.yaml \
139+
--set-file workload_launcher=launcher.sh \
140+
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus128.py \
141+
--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \
142+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
143+
--set volumes.gcsMounts[0].mountPath=/job-logs \
144+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
145+
--set workload.envs[5].value=${DATASET_TYPE} \
146+
--set workload.envs[6].value="/gcsdata/${DATASET_PATHS}" \
147+
--set workload.envs[7].value=/gcsdata/${INDEX_MAPPING_DIR} \
148+
--set workload.envs[8].value=/gcsckpt/${CKPT_SAVE_DIR} \
149+
--set workload.envs[9].value=/gcsckpt/${CKPT_SAVE_INTERVAL} \
150+
--set queue=${KUEUE_NAME}
151+
```
152+
153+
Replace the following values:
154+
- `<DATASET_TYPE>`: The type of dataset used (see [Megatron-Bridge data arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#data-arguments))
155+
- `<DATASET_PATHS>`: Paths to your dataset (for rp2 dataset)
156+
- `<INDEX_MAPPING_DIR>`: Index mapping dir (for rp2 dataset)
157+
- `<CKPT_SAVE_DIR>`: The directory where you wish you save checkpoints
158+
- `<CKPT_SAVE_INTERVAL>`: Save checkpoint every CKPT_SAVE_INTERVAL train steps
159+
160+
Note: Edit `recipe.dataset.num_workers = 8` in `run_script.py` to change the number of dataloading workers.
161+
162+
This recipe uses Nvidia NeMo 26.02.01 container, which uses Megatron-Bridge for checkpointing and dataloading. See [checkpointing arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#checkpointing-arguments) and [data arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#data-arguments) for explanations on dataset types.
163+
164+
You can also edit values.yaml directly.
165+
```
166+
- name: DATASET_TYPE
167+
value: "rp2"
168+
- name: DATASET_PATHS
169+
value: "/gcsdata/${DATASET_PATHS}"
170+
- name: INDEX_MAPPING_DIR
171+
value: "/gcsdata/${INDEX_MAPPING_DIR}"
172+
- name: CKPT_SAVE_DIR
173+
value: "/gcsckpt/${CKPT_SAVE_DIR}"
174+
- name: CKPT_SAVE_INTERVAL
175+
value: "10"
176+
```
177+
178+
To load a checkpoint, add the following:
179+
```bash
180+
cd $RECIPE_ROOT
181+
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node
182+
export CKPT_LOAD_DIR=<CKPT_LOAD_DIR>
183+
export CKPT_LOAD_STEP=<CKPT_LOAD_STEP>
184+
helm install $WORKLOAD_NAME . -f values.yaml \
185+
--set-file workload_launcher=launcher.sh \
186+
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus128.py \
187+
--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \
188+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
189+
--set volumes.gcsMounts[0].mountPath=/job-logs \
190+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
191+
--set workload.envs[10].value=/gcsckpt/${CKPT_LOAD_DIR} \
192+
--set workload.envs[11].value="/gcsckpt/${CKPT_LOAD_STEP}" \
193+
--set queue=${KUEUE_NAME}
194+
```
195+
196+
Or edit values.yaml directly
197+
```
198+
- name: CKPT_LOAD_DIR
199+
value: "/gcsckpt/${CKPT_LOAD_DIR}"
200+
- name: CKPT_LOAD_STEP
201+
value: "10"
202+
```
203+
204+
**Examples**
205+
206+
- To set the number of training steps to 100, run the following command from your client:
207+
208+
```bash
209+
cd $RECIPE_ROOT
210+
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node
211+
helm install $WORKLOAD_NAME . -f values.yaml \
212+
--set-file workload_launcher=launcher.sh \
213+
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus128.py \
214+
--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \
215+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
216+
--set volumes.gcsMounts[0].mountPath=/job-logs \
217+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
218+
--set workload.envs[5].value=${DATASET_TYPE} \
219+
--set workload.envs[6].value="/gcsdata/${DATASET_PATHS}" \
220+
--set workload.envs[7].value=/gcsdata/${INDEX_MAPPING_DIR} \
221+
--set workload.envs[8].value=/gcsckpt/${CKPT_SAVE_DIR} \
222+
--set workload.envs[9].value=/gcsckpt/${CKPT_SAVE_INTERVAL} \
223+
--set queue=${KUEUE_NAME} \
224+
--set workload.arguments[0]="trainer.max_steps=100"
225+
```
226+
227+
228+
### Monitor the job
229+
230+
To check the status of pods in your job, run the following command:
231+
232+
```
233+
kubectl get pods | grep $USER-a4x-llama3-1-70b-32node
234+
```
235+
236+
Replace the following:
237+
238+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-llama3-1-70b-32node.
239+
240+
To get the logs for one of the pods, run the following command:
241+
242+
```
243+
kubectl logs POD_NAME
244+
```
245+
246+
Information about the training job's progress, including crucial details such as
247+
loss, step count, and step time, is generated by the rank 0 process.
248+
This process runs on the pod whose name begins with
249+
`JOB_NAME_PREFIX-workload-0-0`.
250+
For example: `$USER-a4x-llama3-1-70b-32node-workload-0-0-s9zrv`.
251+
252+
Logs will display checkpoint save directory:
253+
```
254+
28_worker0/0 successfully saved checkpoint from iteration 40 to /gcsckpt/ckpt/asq-llama70b-ckpt-128gpu-gcs-2026-04-03-03-51-29 [ t 1/2, p 1/4 ]
255+
```
256+
257+
### Uninstall the Helm release
258+
259+
You can delete the job and other resources created by the Helm chart. To
260+
uninstall Helm, run the following command from your client:
261+
262+
```bash
263+
helm uninstall $USER-a4x-llama3-1-70b-32node
264+
```

0 commit comments

Comments
 (0)