Skip to content

Commit d3672e2

Browse files
Add DeepSeek-V3 and DeepSeek-V3-671B Megatron-Bridge recipes
1 parent e68a087 commit d3672e2

20 files changed

Lines changed: 2333 additions & 0 deletions

File tree

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4_jobset_workload
17+
description: a4_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
<!-- mdformat global-off -->
2+
# Pretrain deepseek_v3-bf16-gbs256-gpus256 workloads on a4 GKE Node pools with Megatron-Bridge
3+
4+
This recipe outlines the steps for running a deepseek_v3 pretraining
5+
workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
6+
[NVIDIA Megatron-Bridge framework](https://github.com/NVIDIA-NeMo/Megatron-Bridge).
7+
8+
## Orchestration and deployment tools
9+
10+
For this recipe, the following setup is used:
11+
12+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
13+
- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy the Kubernetes Jobset resource which manages the execution of the [Megatron-Bridge pretraining workload](https://github.com/NVIDIA-NeMo/Megatron-Bridge).
14+
15+
## Test environment
16+
17+
This recipe has been optimized for and tested with the following configuration:
18+
19+
- GKE cluster: Please follow Cluster Toolkit [instructions](https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4) to create your a4 GKE cluster.
20+
- Node Configuration: 32 nodes (8 GPUs per node, 256 GPUs total).
21+
- GPU Architecture: NVIDIA Blackwell (B200).
22+
23+
## Training dataset
24+
25+
This recipe uses a mock pretraining dataset provided by [Megatron Bridge Framework Datasets utils](https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/scripts/performance/utils/datasets.py)
26+
27+
## Docker container image
28+
29+
This recipe uses the following docker images:
30+
31+
- `nvcr.io/nvidia/nemo:25.11.01`
32+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib:v1.1.1`
33+
34+
## Run the recipe
35+
36+
From your client workstation, complete the following steps:
37+
38+
### Configure environment settings
39+
40+
Set the environment variables to match your environment:
41+
42+
```bash
43+
export PROJECT_ID=<PROJECT_ID>
44+
export CLUSTER_REGION=<CLUSTER_REGION>
45+
export CLUSTER_NAME=<CLUSTER_NAME>
46+
export GCS_BUCKET=<GCS_BUCKET> # Note: path should not be prefixed with gs://
47+
export KUEUE_NAME=<KUEUE_NAME>
48+
```
49+
50+
Replace the following values:
51+
52+
- `<PROJECT_ID>`: your Google Cloud project ID.
53+
- `<CLUSTER_REGION>`: the region where your cluster is located.
54+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
55+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the gs:// prefix.
56+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is a4.
57+
58+
Set the default project:
59+
60+
```bash
61+
gcloud config set project $PROJECT_ID
62+
```
63+
64+
### Get cluster credentials
65+
66+
```bash
67+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
68+
```
69+
70+
### Get the recipe
71+
72+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
73+
74+
```
75+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
76+
cd gpu-recipes
77+
export REPO_ROOT=`git rev-parse --show-toplevel`
78+
export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256/recipe
79+
cd $RECIPE_ROOT
80+
```
81+
82+
### Configure and submit a pretraining job
83+
84+
#### Using 32 nodes (256 gpus) bf16 precision
85+
86+
To execute the job with the default settings, run the following command from your client:
87+
88+
```bash
89+
cd $RECIPE_ROOT
90+
export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs256
91+
helm install $WORKLOAD_NAME . -f values.yaml \
92+
--set-file workload_launcher=launcher.sh \
93+
--set-file workload_config=custom_setup_experiment.py \
94+
--set workload.image=nvcr.io/nvidia/nemo:25.11.01 \
95+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
96+
--set volumes.gcsMounts[0].mountPath=/job-logs \
97+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
98+
--set queue=${KUEUE_NAME}
99+
```
100+
101+
**Examples**
102+
103+
- To set the number of training steps to 100, run the following command from
104+
your client:
105+
106+
```bash
107+
cd $RECIPE_ROOT
108+
export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs256
109+
helm install $WORKLOAD_NAME . -f values.yaml \
110+
--set-file workload_launcher=launcher.sh \
111+
--set-file workload_config=custom_setup_experiment.py \
112+
--set workload.image=nvcr.io/nvidia/nemo:25.11.01 \
113+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
114+
--set volumes.gcsMounts[0].mountPath=/job-logs \
115+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
116+
--set queue=${KUEUE_NAME} \
117+
--set workload.arguments[0]="trainer.max_steps=100"
118+
```
119+
120+
### Monitor the job
121+
122+
To check the status of pods in your job, run the following command:
123+
124+
```
125+
kubectl get pods | grep $USER-deepseek-v3-32node-bf16-seq4096-gbs256
126+
```
127+
128+
Replace the following:
129+
130+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-deepseek-v3-32node-bf16-seq4096-gbs256.
131+
132+
To get the logs for one of the pods, run the following command:
133+
134+
```
135+
kubectl logs POD_NAME
136+
```
137+
138+
Information about the training job's progress, including crucial details such as
139+
loss, step count, and step time, is generated by the rank 0 process.
140+
This process runs on the pod whose name begins with
141+
`JOB_NAME_PREFIX-workload-0-0`.
142+
For example: `$USER-deepseek-v3-32node-bf16-seq4096-gbs256-workload-0-0-s9zrv`.
143+
144+
### Uninstall the Helm release
145+
146+
You can delete the job and other resources created by the Helm chart. To
147+
uninstall Helm, run the following command from your client:
148+
149+
```bash
150+
helm uninstall $USER-deepseek-v3-32node-bf16-seq4096-gbs256
151+
```

0 commit comments

Comments
 (0)