Skip to content

Commit cabe055

Browse files
savitha-engclaude
andcommitted
Switch torchrun to static mode matching proven Lepton scripts
Use --node_rank=$NODE_RANK --master_addr --master_port instead of rdzv mode. This matches submit_og2_lepton_eden.py which has been running multi-node training successfully. Also clarify in the Multi-Node Launch Protocol that workers must use single-quoted heredocs to preserve $NODE_RANK/$MASTER_ADDR as variables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 2c503d0 commit cabe055

1 file changed

Lines changed: 25 additions & 9 deletions

File tree

bionemo-recipes/recipes/opengenome2_llama_native_te/OG2_FP8_AGENT_GUIDE.md

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,9 @@ cd $(dirname $TRAINING_SCRIPT)
158158
torchrun \
159159
--nproc_per_node=$NPROC_PER_NODE \
160160
--nnodes=$NNODES \
161-
--rdzv_backend=c10d \
162-
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
161+
--node_rank=$NODE_RANK \
162+
--master_addr=$MASTER_ADDR \
163+
--master_port=$MASTER_PORT \
163164
$(basename $TRAINING_SCRIPT) \
164165
--config-name $CONFIG_NAME \
165166
num_train_steps=$NUM_TRAIN_STEPS \
@@ -227,23 +228,38 @@ Training runs on `$NNODES` nodes. This agent runs on rank 0 only. Worker nodes (
227228
Example (first launch):
228229

229230
```bash
230-
# Write launch script for workers
231+
# Step 1: Write launch script for workers (use single-quoted heredoc to preserve $variables)
231232
cat > $LAUNCH_DIR/1.sh << 'LAUNCH_EOF'
232233
#!/bin/bash
233234
cd /path/to/training/dir
234-
torchrun --nproc_per_node=8 --nnodes=6 --rdzv_backend=c10d \
235-
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
236-
train_fsdp2.py --config-name og2_7b_thd_gqa_fp8 ...
235+
torchrun \
236+
--nproc_per_node=8 \
237+
--nnodes=6 \
238+
--node_rank=$NODE_RANK \
239+
--master_addr=$MASTER_ADDR \
240+
--master_port=$MASTER_PORT \
241+
train_fsdp2.py --config-name og2_7b_thd_gqa_fp8 \
242+
fp8_layers='[3,4,...,30]' \
243+
... all other args ...
237244
LAUNCH_EOF
238245

239-
# Then run the same command on rank 0
246+
# Step 2: Run the same torchrun command on rank 0
240247
cd /path/to/training/dir
241-
torchrun --nproc_per_node=8 --nnodes=6 --rdzv_backend=c10d ...
248+
torchrun \
249+
--nproc_per_node=8 \
250+
--nnodes=6 \
251+
--node_rank=$NODE_RANK \
252+
--master_addr=$MASTER_ADDR \
253+
--master_port=$MASTER_PORT \
254+
train_fsdp2.py --config-name og2_7b_thd_gqa_fp8 \
255+
fp8_layers='[3,4,...,30]' \
256+
... all other args ...
242257
```
243258

244259
**CRITICAL rules:**
245260

246-
- Write the launch script BEFORE starting torchrun on rank 0. Workers poll every 5 seconds. Torchrun waits for all nodes at the rendezvous point, so workers joining a few seconds later is fine.
261+
- **Use single-quoted heredoc** (`<< 'LAUNCH_EOF'`) when writing the script. This preserves `$NODE_RANK`, `$MASTER_ADDR`, and `$MASTER_PORT` as literal variables. Each worker node has these set to its own values by the Lepton environment. If you expand them, all workers will think they are rank 0.
262+
- Write the launch script BEFORE starting torchrun on rank 0. Workers poll every 5 seconds. Torchrun waits for all nodes to connect, so workers joining a few seconds later is fine.
247263
- The launch script must contain the EXACT same torchrun command you run on rank 0 (same arguments, same working directory).
248264
- When killing training: just kill the torchrun process on rank 0. Workers detect the disconnection (NCCL timeout) and their processes exit automatically. Workers then poll for the next numbered script.
249265
- Each relaunch (after demotion/recovery) uses the next number: `1.sh`, `2.sh`, `3.sh`, etc.

0 commit comments

Comments
 (0)