Skip to content

Commit a3aadeb

Browse files
authored
Recipe for llama3.1-8b 16nodes with gbs 256/seq 8192 (#155)
* Recipe for llama3.1-8b 16nodes with gbs 256/seq 8192 * Update WORKLOAD_NAME in README for consistency * feat: add 8-node bf16 recipe for llama3-1-8b * Remove '-8node' suffix from workload name * feat: add 1-node bf16 recipe for llama3-1-8b * Remove '-1node' suffix from workload name * feat: add 8-node configuration for seq4096 * feat: add 4-node bf16 recipe for llama3-1-8b * feat: add 8-node bf16 recipe for seq8192 gbs2048
1 parent a221911 commit a3aadeb

60 files changed

Lines changed: 5203 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4_jobset_workload
17+
description: a4_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain llama3-1-8b workloads on a4 GKE Node pools with Nvidia NeMo Framework
3+
4+
This recipe outlines the steps for running a llama3-1-8b pretraining
5+
workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
6+
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
7+
8+
## Orchestration and deployment tools
9+
10+
For this recipe, the following setup is used:
11+
12+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
13+
- Pretraining job configuration and deployment - A Helm chart is used to
14+
configure and deploy the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) resource which manages the execution of the
15+
[NeMo pretraining workload](https://github.com/NVIDIA/nemo).
16+
17+
## Test environment
18+
19+
This recipe has been optimized for and tested with the following configuration:
20+
21+
- GKE cluster
22+
Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4)
23+
to create your a4 GKE cluster.
24+
25+
## Training dataset
26+
27+
This recipe uses a mock pretraining dataset provided by the NeMo framework.
28+
29+
## Docker container image
30+
31+
This recipe uses the following docker images:
32+
33+
- `nvcr.io/nvidia/nemo:25.07`
34+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.0`
35+
36+
## Run the recipe
37+
38+
From your client workstation, complete the following steps:
39+
40+
### Configure environment settings
41+
42+
Set the environment variables to match your environment:
43+
44+
```bash
45+
export PROJECT_ID=<PROJECT_ID>
46+
export CLUSTER_REGION=<CLUSTER_REGION>
47+
export CLUSTER_NAME=<CLUSTER_NAME>
48+
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
49+
export KUEUE_NAME=<KUEUE_NAME>
50+
```
51+
52+
Replace the following values:
53+
54+
- `<PROJECT_ID>`: your Google Cloud project ID.
55+
- `<CLUSTER_REGION>`: the region where your cluster is located.
56+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
57+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
58+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster.
59+
60+
Set the default project:
61+
62+
```bash
63+
gcloud config set project $PROJECT_ID
64+
```
65+
66+
### Get the recipe
67+
68+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
69+
70+
```
71+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
72+
cd gpu-recipes
73+
export REPO_ROOT=`git rev-parse --show-toplevel`
74+
export RECIPE_ROOT=$REPO_ROOT/training/a4/llama3-1-8b/nemo-pretraining-gke/16_nodes
75+
cd $RECIPE_ROOT
76+
```
77+
78+
### Get cluster credentials
79+
80+
```
81+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
82+
```
83+
84+
### Configure and submit a pretraining job
85+
86+
#### Using 16 node (128 gpus) bf16 precision
87+
To execute the job with the default settings, run the following command from
88+
your client:
89+
90+
```bash
91+
cd $RECIPE_ROOT
92+
export WORKLOAD_NAME=$USER-a4-llama3-1-8b
93+
helm install $WORKLOAD_NAME . -f values.yaml \
94+
--set-file workload_launcher=launcher.sh \
95+
--set-file workload_config=llama3-1-8b-bf16-seq8192-gbs256-gpus128.py \
96+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
97+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
98+
--set volumes.gcsMounts[0].mountPath=/job-logs \
99+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
100+
--set queue=${KUEUE_NAME}
101+
```
102+
103+
**Examples**
104+
105+
- To set the number of training steps to 100, run the following command from
106+
your client:
107+
108+
```bash
109+
cd $RECIPE_ROOT
110+
export WORKLOAD_NAME=$USER-a4-llama3-1-8b
111+
helm install $WORKLOAD_NAME . -f values.yaml \
112+
--set-file workload_launcher=launcher.sh \
113+
--set-file workload_config=llama3-1-8b-bf16-seq8192-gbs256-gpus128.py \
114+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
115+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
116+
--set volumes.gcsMounts[0].mountPath=/job-logs \
117+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
118+
--set queue=${KUEUE_NAME} \
119+
--set workload.arguments[0]="trainer.max_steps=100"
120+
```
121+
122+
### Monitor the job
123+
124+
To check the status of pods in your job, run the following command:
125+
126+
```
127+
kubectl get pods | grep $USER-a4-llama3-1-8b
128+
```
129+
130+
Replace the following:
131+
132+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4-llama3-1-8b-16node.
133+
134+
To get the logs for one of the pods, run the following command:
135+
136+
```
137+
kubectl logs POD_NAME
138+
```
139+
140+
Information about the training job's progress, including crucial details such as
141+
loss, step count, and step time, is generated by the rank 0 process.
142+
This process runs on the pod whose name begins with
143+
`JOB_NAME_PREFIX-workload-0-0`.
144+
For example: `$USER-a4-llama3-1-8b-workload-0-0-s9zrv`.
145+
146+
### Uninstall the Helm release
147+
148+
You can delete the job and other resources created by the Helm chart. To
149+
uninstall Helm, run the following command from your client:
150+
151+
```bash
152+
helm uninstall $USER-a4-llama3-1-8b
153+
```
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
usage()
2+
{
3+
cat << EOF
4+
usage: bash ./launcher.sh [config-override [config-override ...]]
5+
config-override (Optional) A NeMo configuration override. E.g. trainer.max_steps=10000.
6+
EOF
7+
}
8+
9+
parse_args() {
10+
while [ "$1" != "" ]; do
11+
case $(grep -o "=" <<< "$1" | wc -l) in
12+
1 )
13+
config_overrides+=("$1")
14+
;;
15+
* )
16+
echo "Invalid config override: $1"
17+
usage
18+
exit 1
19+
esac
20+
shift
21+
done
22+
config_overrides="${config_overrides[*]}"
23+
}
24+
25+
config_overrides=()
26+
parse_args "$@"
27+
28+
if [ -z "${config_overrides}" ]; then
29+
echo "No NeMo config overrides specified"
30+
else
31+
echo "NeMo config overrides:"
32+
echo " ${config_overrides}"
33+
fi
34+
35+
if [[ -n "${NCCL_PLUGIN_PATH}" ]]; then
36+
export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
37+
ldconfig $LD_LIBRARY_PATH
38+
echo "Added $LD_LIBRARY_PATH to ldconfig:"
39+
ldconfig -p | grep libcuda | sed 's/^/ /'
40+
echo ""
41+
fi
42+
43+
if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
44+
explicit_log_dir=${EXPLICIT_LOG_DIR}
45+
else
46+
explicit_log_dir=workload_logs
47+
fi
48+
echo "Logging to ${explicit_log_dir}"
49+
50+
if [[ -n "${TOKENIZER_PATH}" ]]; then
51+
echo "Getting tokenizer files"
52+
cp ${TOKENIZER_PATH}/* .
53+
echo ""
54+
fi
55+
56+
echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"
57+
58+
59+
pip install git+https://github.com/NVIDIA/dllogger#egg=dllogger
60+
61+
export HF_TOKEN="<YOUR_HF_TOKEN>"
62+
63+
# Export the nemo2 config to yaml.
64+
python ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
65+
trainer.num_nodes="$NNODES" \
66+
log.explicit_log_dir="${explicit_log_dir}" \
67+
trainer.max_steps=30 \
68+
trainer.num_nodes=16 \
69+
trainer.devices=8 \
70+
${config_overrides} \
71+
--to-yaml exported_nemo_config.yaml
72+
73+
# Create the nsys directory.
74+
mkdir -p ${explicit_log_dir}/nsys
75+
76+
OMP_NUM_THREADS=12 NSYS_CONFIG_DIRECTIVES="AgentLaunchTimeoutSec=240;AppLaunchTimeoutSec=240" TORCH_NCCL_ENABLE_MONITORING=0 \
77+
/usr/local/bin/nsys profile -s none -t nvtx,cuda --capture-range=cudaProfilerApi --capture-range-end=stop \
78+
-o ${explicit_log_dir}/nsys/noderank-${JOB_COMPLETION_INDEX} \
79+
--session-new "nemo-rank${JOB_COMPLETION_INDEX}"-$RANDOM \
80+
--wait all \
81+
torchrun \
82+
--nproc-per-node="8" \
83+
--nnodes="${NNODES}" \
84+
--node_rank="${JOB_COMPLETION_INDEX}" \
85+
--rdzv_id="${JOB_IDENTIFIER}" \
86+
--master_addr="${MASTER_ADDR}" \
87+
--master_port="${MASTER_PORT}" \
88+
${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
89+
trainer.num_nodes="$NNODES" \
90+
log.explicit_log_dir="${explicit_log_dir}" \
91+
trainer.max_steps=30 \
92+
trainer.num_nodes=16 \
93+
trainer.devices=8 \
94+
${config_overrides}
95+
96+
if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
97+
mkdir -p ${ARTIFACT_DIR}
98+
cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/
99+
cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py
100+
cp dllogger.json ${ARTIFACT_DIR}/dllogger.json
101+
cp exported_nemo_config.yaml ${ARTIFACT_DIR}/nemo-configuration.yaml
102+
env > ${ARTIFACT_DIR}/environ.txt
103+
ls ${ARTIFACT_DIR}
104+
fi
105+
echo "Training completed"
106+
echo "Pod on $(hostname --fqdn) is exiting"

0 commit comments

Comments
 (0)