Skip to content

Implement startup_order and stop_criteria#2714

Merged
r4victor merged 8 commits intomasterfrom
issue_2467_simpler_mpi
Jun 2, 2025
Merged

Implement startup_order and stop_criteria#2714
r4victor merged 8 commits intomasterfrom
issue_2467_simpler_mpi

Conversation

@r4victor
Copy link
Copy Markdown
Collaborator

A part of #2467

The PR introduces new run configuration properties:

  • startup_order allows specifying the order in which master and workers jobs are started.
  • stop_criteria allows specifying the criteria determining when a multi-node run should be considered finished.

They simplify running mpirun with dstack:

type: task
name: nccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

image: dstackai/efa
commands:
  - |
    if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
      cd /root/nccl-tests/build
      : > hostfile
      for ip in ${DSTACK_NODES_IPS}; do
        echo "${ip} slots=${DSTACK_GPUS_PER_NODE}" >> hostfile
      done
      MPIRUN='mpirun --allow-run-as-root --hostfile hostfile'
      # Run NCCL Tests
      ${MPIRUN} \
        -n ${DSTACK_GPUS_NUM} -N ${DSTACK_GPUS_PER_NODE} \
        --mca btl_tcp_if_exclude lo,docker0 \
        --bind-to none \
        ./all_reduce_perf -b 8 -e 8G -f 2 -g 1
    else
      sleep infinity
    fi

Other multi-node tasks such as iperf may require startup_order: master-first. Most such as pytorch will work with the default startup_order: any.

TODO:

  • use the new properties in the NCCL tests examples

@peterschmidt85
Copy link
Copy Markdown
Contributor

Do you plan to also support DSTACK_MPI_HOSTFILE? I guess we just need to mount this file and pass the environment variable on container start.

@r4victor
Copy link
Copy Markdown
Collaborator Author

@peterschmidt85 in a separate PR

@r4victor
Copy link
Copy Markdown
Collaborator Author

and then we can update the NCCL tests example

@r4victor r4victor requested a review from jvstme May 30, 2025 10:45
Comment thread src/dstack/_internal/server/background/tasks/process_runs.py Outdated
Comment thread src/dstack/_internal/server/background/tasks/process_runs.py
if run.run_spec.merged_profile.stop_criteria != StopCriteria.MASTER_DONE:
return False
for job in run.jobs:
if job.job_spec.job_num == 0 and job.job_submissions[-1].status == JobStatus.DONE:
Copy link
Copy Markdown
Collaborator

@jvstme jvstme Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Can also check for termination_reason == JobTerminationReason.DONE_BY_RUNNER to terminate the run faster, without waiting for the terminating -> done master job transition. See line 241

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to terminate the run before the master is really done.

class StartupOrder(str, Enum):
ANY = "any"
MASTER_FIRST = "master-first"
WORKERS_FIRST = "workers-first"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I'm not sure about calling non-master nodes "workers", because the master node is also a "worker" - it performs the same work other nodes do.

I can suggest to use "secondary" (secondary-first) or avoid any names (master-last). Although we might still need a name to use in the code

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master/worker is standard terminalogy used for pytorch, mpi, etc. Let's not reinvent.

@r4victor r4victor merged commit 2ddae6e into master Jun 2, 2025
25 checks passed
@r4victor r4victor deleted the issue_2467_simpler_mpi branch June 2, 2025 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants