You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Switch torchrun to static mode matching proven Lepton scripts
Use --node_rank=$NODE_RANK --master_addr --master_port instead of
rdzv mode. This matches submit_og2_lepton_eden.py which has been
running multi-node training successfully. Also clarify in the
Multi-Node Launch Protocol that workers must use single-quoted
heredocs to preserve $NODE_RANK/$MASTER_ADDR as variables.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Write the launch script BEFORE starting torchrun on rank 0. Workers poll every 5 seconds. Torchrun waits for all nodes at the rendezvous point, so workers joining a few seconds later is fine.
261
+
-**Use single-quoted heredoc** (`<< 'LAUNCH_EOF'`) when writing the script. This preserves `$NODE_RANK`, `$MASTER_ADDR`, and `$MASTER_PORT` as literal variables. Each worker node has these set to its own values by the Lepton environment. If you expand them, all workers will think they are rank 0.
262
+
- Write the launch script BEFORE starting torchrun on rank 0. Workers poll every 5 seconds. Torchrun waits for all nodes to connect, so workers joining a few seconds later is fine.
247
263
- The launch script must contain the EXACT same torchrun command you run on rank 0 (same arguments, same working directory).
248
264
- When killing training: just kill the torchrun process on rank 0. Workers detect the disconnection (NCCL timeout) and their processes exit automatically. Workers then poll for the next numbered script.
249
265
- Each relaunch (after demotion/recovery) uses the next number: `1.sh`, `2.sh`, `3.sh`, etc.
0 commit comments