Skip to content

Commit d926d6b

Browse files
committed
Reorg recipe files
1 parent 7ba7ee7 commit d926d6b

14 files changed

Lines changed: 1436 additions & 3 deletions

File tree

training/a4x/llama3_70b/nemo-gke/nemo2602/checkpoint/128gpu-fp8mx-lustre/recipe/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@
1313
# limitations under the License.
1414

1515
apiVersion: v2
16-
name: a4x_max_jobset_workload
17-
description: a4x_max_jobset_workload
16+
name: a4x_jobset_workload
17+
description: a4x_jobset_workload
1818
type: application
1919
version: 0.1.0
2020
appVersion: "1.16.0"

training/a4x/llama3_70b/nemo-gke/nemo2602/checkpoint/128gpu-fp8mx-lustre/recipe/launcher.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ worker_command=$(cat <<- EOM
114114
nice -10 \
115115
python scripts/performance/custom_setup_experiment.py \
116116
--gpu gb200 \
117-
--account asq_google_com --partition a4xmaxpartition \
117+
--account asq_google_com --partition a4xpartition \
118118
--model_family_name llama \
119119
--model_recipe_name llama3_70b \
120120
--list_config_variants \
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Copyright 2026 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4x_jobset_workload
17+
description: a4x_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
21+
Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain llama3-1-70b workloads on a4x GKE Node pools with Nvidia NeMo Framework
3+
4+
This recipe outlines the steps for running a llama3-1-70b pretraining
5+
workload on [a4x GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
6+
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
7+
8+
## Orchestration and deployment tools
9+
10+
For this recipe, the following setup is used:
11+
12+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
13+
- Pretraining job configuration and deployment - A Helm chart is used to
14+
configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
15+
[NeMo pretraining workload](https://github.com/NVIDIA/nemo).
16+
17+
## Test environment
18+
19+
This recipe has been optimized for and tested with the following configuration:
20+
21+
- GKE cluster
22+
Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x)
23+
to create your a4x GKE cluster.
24+
25+
## Training dataset
26+
27+
This recipe uses a mock pretraining dataset provided by the NeMo framework.
28+
29+
## Docker container image
30+
31+
This recipe uses the following docker images:
32+
33+
- `nvcr.io/nvidia/nemo:26.02.01`
34+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-diagnostic-arm64:v1.1.0`
35+
36+
## Run the recipe
37+
38+
From your client workstation, complete the following steps:
39+
40+
### Configure environment settings
41+
42+
Set the environment variables to match your environment:
43+
44+
```bash
45+
export PROJECT_ID=<PROJECT_ID>
46+
export CLUSTER_REGION=<CLUSTER_REGION>
47+
export CLUSTER_NAME=<CLUSTER_NAME>
48+
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
49+
export KUEUE_NAME=<KUEUE_NAME>
50+
```
51+
52+
Replace the following values:
53+
54+
- `<PROJECT_ID>`: your Google Cloud project ID.
55+
- `<CLUSTER_REGION>`: the region where your cluster is located.
56+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
57+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
58+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Make sure to verify the name of the local queue in your cluster.
59+
60+
Set the default project:
61+
62+
```bash
63+
gcloud config set project $PROJECT_ID
64+
```
65+
66+
### Get the recipe
67+
68+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
69+
70+
```
71+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
72+
cd gpu-recipes
73+
export REPO_ROOT=`git rev-parse --show-toplevel`
74+
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3_70b/nemo-gke/nemo2602/64gpus-fp8mx-lustre/recipe
75+
cd $RECIPE_ROOT
76+
```
77+
78+
### Get cluster credentials
79+
80+
```
81+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
82+
```
83+
### Setup Lustre PVCs
84+
1) Create Lustre instance
85+
86+
Follow [Create Lustre Instance](https://docs.cloud.google.com/managed-lustre/docs/create-instance) to create a lustre instance.
87+
88+
**Create the instance in the same zone and network as your cluster.**
89+
90+
Run `gcloud lustre instances list --location $ZONE --project $PROJECT_ID` and take note of the filesystem, name and network. It will look something like this:
91+
```
92+
capacityGib: '126000'
93+
createTime: '2026-01-07T00:24:57.572296415Z'
94+
filesystem: <FILESYSTEM>
95+
mountPoint: <LUSTRE_IP>@tcp:/<FILESYSTEM>
96+
name: projects/<PROJECT_ID>/locations/<ZONE>/instances/<NAME>
97+
network: projects/<PROJECT_ID>/global/networks/<NETWORK>
98+
perUnitStorageThroughput: '1000'
99+
state: ACTIVE
100+
uid: 9b9400c4-a669-48d0-89de-fe4bce4664fb
101+
updateTime: '2026-04-27T20:38:55.794373239Z'
102+
```
103+
104+
2) Upload training dataset to the instance
105+
For commands, see: TODO
106+
107+
3) Create PersistentVolumes for the instance and claim the volume
108+
109+
Replace the following variables in `lustre_pv.yaml`:
110+
- `<FILESYSTEM>`: Filesystem name of the instance
111+
- `<LUSTRE_IP>`: IP of Lustre instance
112+
- `<PROJECT_ID>`: Project ID
113+
- `<ZONE>`: Zone of your compute nodes and lustre instance
114+
- `<NAME>`: Lustre instance name
115+
- `<NETWORK>`: Network name of your cluster and instance
116+
117+
```
118+
kubectl apply -f ./lustre_pv.yaml
119+
```
120+
121+
4) Add volume claims to values.yaml
122+
123+
```
124+
...
125+
pvcMounts:
126+
- claimName: "asq-0106-lustre-pvc"
127+
mountPath: "/lustrefs"
128+
...
129+
```
130+
131+
### Configure and submit a pretraining job
132+
133+
#### Using 32 node (64 gpus) fp8 precision with Lustre dataload and checkpointing
134+
To execute the job with dataloading and checkpoint saving, run the following:
135+
136+
```bash
137+
cd $RECIPE_ROOT
138+
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node
139+
export DATASET_TYPE=<DATASET_TYPE>
140+
export DATASET_PATHS=<DATASET_PATHS>
141+
export INDEX_MAPPING_DIR=<INDEX_MAPPING_DIR>
142+
export CKPT_SAVE_DIR=<CKPT_SAVE_DIR>
143+
export CKPT_SAVE_INTERVAL=<CKPT_SAVE_INTERVAL>
144+
helm install $WORKLOAD_NAME . -f values.yaml \
145+
--set-file workload_launcher=launcher.sh \
146+
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \
147+
--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \
148+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
149+
--set volumes.gcsMounts[0].mountPath=/job-logs \
150+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
151+
--set workload.envs[5].value=${DATASET_TYPE} \
152+
--set workload.envs[6].value="/lustrefs/data/${DATASET_PATHS}" \
153+
--set workload.envs[7].value=/lustrefs/data/${INDEX_MAPPING_DIR} \
154+
--set workload.envs[8].value=/lustrefs/ckpt/${CKPT_SAVE_DIR} \
155+
--set workload.envs[9].value=/lustrefs/ckpt/${CKPT_SAVE_INTERVAL} \
156+
--set queue=${KUEUE_NAME}
157+
```
158+
159+
Replace the following values:
160+
- `<DATASET_TYPE>`: The type of dataset used (see [Megatron-Bridge data arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#data-arguments))
161+
- `<DATASET_PATHS>`: Paths to your dataset (for rp2 dataset)
162+
- `<INDEX_MAPPING_DIR>`: Index mapping dir (for rp2 dataset)
163+
- `<CKPT_SAVE_DIR>`: The directory where you wish you save checkpoints
164+
- `<CKPT_SAVE_INTERVAL>`: Save checkpoint every CKPT_SAVE_INTERVAL train steps
165+
166+
Note: Edit `recipe.dataset.num_workers = 8` in `run_script.py` to change the number of dataloading workers.
167+
168+
This recipe uses Nvidia NeMo 26.02.01 container, which uses Megatron-Bridge for checkpointing and dataloading. See [checkpointing arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#checkpointing-arguments) and [data arguments](https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/r0.3.0/scripts/performance#data-arguments) for explanations on dataset types.
169+
170+
You can also edit values.yaml directly.
171+
```
172+
- name: DATASET_TYPE
173+
value: "rp2"
174+
- name: DATASET_PATHS
175+
value: "/lustrefs/data/${DATASET_PATHS}"
176+
- name: INDEX_MAPPING_DIR
177+
value: "/lustrefs/data/${INDEX_MAPPING_DIR}"
178+
- name: CKPT_SAVE_DIR
179+
value: "/lustrefs/ckpt/${CKPT_SAVE_DIR}"
180+
- name: CKPT_SAVE_INTERVAL
181+
value: "10"
182+
```
183+
184+
To load a checkpoint, add the following:
185+
```bash
186+
cd $RECIPE_ROOT
187+
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node
188+
export CKPT_LOAD_DIR=<CKPT_LOAD_DIR>
189+
export CKPT_LOAD_STEP=<CKPT_LOAD_STEP>
190+
helm install $WORKLOAD_NAME . -f values.yaml \
191+
--set-file workload_launcher=launcher.sh \
192+
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \
193+
--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \
194+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
195+
--set volumes.gcsMounts[0].mountPath=/job-logs \
196+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
197+
--set workload.envs[10].value=/lustrefs/ckpt/${CKPT_LOAD_DIR} \
198+
--set workload.envs[11].value="${CKPT_LOAD_STEP}" \
199+
--set queue=${KUEUE_NAME}
200+
```
201+
202+
Or edit values.yaml directly
203+
```
204+
- name: CKPT_LOAD_DIR
205+
value: "/lustrefs/ckpt/${CKPT_LOAD_DIR}"
206+
- name: CKPT_LOAD_STEP
207+
value: "10"
208+
```
209+
210+
**Examples**
211+
212+
- To set the number of training steps to 100, run the following command from your client:
213+
214+
```bash
215+
cd $RECIPE_ROOT
216+
export WORKLOAD_NAME=$USER-a4x-llama3-1-70b-32node
217+
helm install $WORKLOAD_NAME . -f values.yaml \
218+
--set-file workload_launcher=launcher.sh \
219+
--set-file workload_config=llama3-1-70b-fp8cs-gbs2048-gpus64.py \
220+
--set workload.image=nvcr.io/nvidia/nemo:26.02.01 \
221+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
222+
--set volumes.gcsMounts[0].mountPath=/job-logs \
223+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
224+
--set workload.envs[5].value=${DATASET_TYPE} \
225+
--set workload.envs[6].value="/lustrefs/data/${DATASET_PATHS}" \
226+
--set workload.envs[7].value=/lustrefs/data/${INDEX_MAPPING_DIR} \
227+
--set workload.envs[8].value=/lustrefs/ckpt/${CKPT_SAVE_DIR} \
228+
--set workload.envs[9].value=/lustrefs/ckpt/${CKPT_SAVE_INTERVAL} \
229+
--set queue=${KUEUE_NAME} \
230+
--set workload.arguments[0]="trainer.max_steps=100"
231+
```
232+
233+
234+
### Monitor the job
235+
236+
To check the status of pods in your job, run the following command:
237+
238+
```
239+
kubectl get pods | grep $USER-a4x-llama3-1-70b-32node
240+
```
241+
242+
Replace the following:
243+
244+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-llama3-1-70b-32node.
245+
246+
To get the logs for one of the pods, run the following command:
247+
248+
```
249+
kubectl logs POD_NAME
250+
```
251+
252+
Information about the training job's progress, including crucial details such as
253+
loss, step count, and step time, is generated by the rank 0 process.
254+
This process runs on the pod whose name begins with
255+
`JOB_NAME_PREFIX-workload-0-0`.
256+
For example: `$USER-a4x-llama3-1-70b-32node-workload-0-0-s9zrv`.
257+
258+
Logs will display checkpoint save directory:
259+
```
260+
28_worker0/0 successfully saved checkpoint from iteration 40 to /lustrefs/ckpt/asq-llama70b-ckpt-64gpu-lustre-2026-04-03-03-51-29 [ t 1/2, p 1/4 ]
261+
```
262+
263+
### Uninstall the Helm release
264+
265+
You can delete the job and other resources created by the Helm chart. To
266+
uninstall Helm, run the following command from your client:
267+
268+
```bash
269+
helm uninstall $USER-a4x-llama3-1-70b-32node
270+
```
271+

0 commit comments

Comments
 (0)