Skip to content

Commit fb072fb

Browse files
Update references from GBS 256 to GBS 2048 in directory paths and READMEs
1 parent 65f49e0 commit fb072fb

20 files changed

Lines changed: 16 additions & 16 deletions

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/Chart.yaml renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/Chart.yaml

File renamed without changes.

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/README.md renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!-- mdformat global-off -->
2-
# Pretrain deepseek_v3-bf16-gbs256-gpus256 workloads on a4 GKE Node pools with Megatron-Bridge
2+
# Pretrain deepseek_v3-bf16-gbs2048-gpus256 workloads on a4 GKE Node pools with Megatron-Bridge
33

44
This recipe outlines the steps for running a deepseek_v3 pretraining
55
workload on [a4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
@@ -75,7 +75,7 @@ Clone the `gpu-recipes` repository and set a reference to the recipe folder.
7575
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
7676
cd gpu-recipes
7777
export REPO_ROOT=`git rev-parse --show-toplevel`
78-
export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe
78+
export RECIPE_ROOT=$REPO_ROOT/training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe
7979
cd $RECIPE_ROOT
8080
```
8181

@@ -87,7 +87,7 @@ To execute the job with the default settings, run the following command from you
8787

8888
```bash
8989
cd $RECIPE_ROOT
90-
export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs256
90+
export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs2048
9191
helm install $WORKLOAD_NAME . -f values.yaml \
9292
--set-file workload_launcher=launcher.sh \
9393
--set-file workload_config=custom_setup_experiment.py \
@@ -105,7 +105,7 @@ helm install $WORKLOAD_NAME . -f values.yaml \
105105

106106
```bash
107107
cd $RECIPE_ROOT
108-
export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs256
108+
export WORKLOAD_NAME=$USER-deepseek-v3-32node-bf16-seq4096-gbs2048
109109
helm install $WORKLOAD_NAME . -f values.yaml \
110110
--set-file workload_launcher=launcher.sh \
111111
--set-file workload_config=custom_setup_experiment.py \
@@ -122,12 +122,12 @@ helm install $WORKLOAD_NAME . -f values.yaml \
122122
To check the status of pods in your job, run the following command:
123123

124124
```
125-
kubectl get pods | grep $USER-deepseek-v3-32node-bf16-seq4096-gbs256
125+
kubectl get pods | grep $USER-deepseek-v3-32node-bf16-seq4096-gbs2048
126126
```
127127
128128
Replace the following:
129129
130-
- JOB_NAME_PREFIX - your job name prefix. For example $USER-deepseek-v3-32node-bf16-seq4096-gbs256.
130+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-deepseek-v3-32node-bf16-seq4096-gbs2048.
131131
132132
To get the logs for one of the pods, run the following command:
133133
@@ -139,13 +139,13 @@ Information about the training job's progress, including crucial details such as
139139
loss, step count, and step time, is generated by the rank 0 process.
140140
This process runs on the pod whose name begins with
141141
`JOB_NAME_PREFIX-workload-0-0`.
142-
For example: `$USER-deepseek-v3-32node-bf16-seq4096-gbs256-workload-0-0-s9zrv`.
142+
For example: `$USER-deepseek-v3-32node-bf16-seq4096-gbs2048-workload-0-0-s9zrv`.
143143
144144
### Uninstall the Helm release
145145
146146
You can delete the job and other resources created by the Helm chart. To
147147
uninstall Helm, run the following command from your client:
148148
149149
```bash
150-
helm uninstall $USER-deepseek-v3-32node-bf16-seq4096-gbs256
150+
helm uninstall $USER-deepseek-v3-32node-bf16-seq4096-gbs2048
151151
```

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/custom_setup_experiment.py renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/custom_setup_experiment.py

File renamed without changes.

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/launcher.sh renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/launcher.sh

File renamed without changes.

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/recipe_launch_command.sh renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/recipe_launch_command.sh

File renamed without changes.

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/templates/workload-config-configmap.yaml renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-config-configmap.yaml

File renamed without changes.

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/templates/workload-job.yaml renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-job.yaml

File renamed without changes.

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/templates/workload-launcher-configmap.yaml renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-launcher-configmap.yaml

File renamed without changes.

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/templates/workload-svc.yaml renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/templates/workload-svc.yaml

File renamed without changes.

training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS256-NEMO25.11/recipe/values.yaml renamed to training/a4/deepseek_v3/megatron-bridge-pretraining-gke/32node-BF16-SEQ4096-GBS2048-NEMO25.11/recipe/values.yaml

File renamed without changes.

0 commit comments

Comments
 (0)