-
-
Notifications
You must be signed in to change notification settings - Fork 223
Implement startup_order and stop_criteria #2714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
199d4a3
8214ce9
1a1eb4e
9156dbe
6b1f71a
7fbc7af
f33ef4a
2ed20b7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,7 +10,7 @@ | |
| import dstack._internal.server.services.gateways as gateways | ||
| import dstack._internal.server.services.services.autoscalers as autoscalers | ||
| from dstack._internal.core.errors import ServerError | ||
| from dstack._internal.core.models.profiles import RetryEvent | ||
| from dstack._internal.core.models.profiles import RetryEvent, StopCriteria | ||
| from dstack._internal.core.models.runs import ( | ||
| Job, | ||
| JobStatus, | ||
|
|
@@ -313,6 +313,10 @@ async def _process_active_run(session: AsyncSession, run_model: RunModel): | |
| termination_reason = RunTerminationReason.RETRY_LIMIT_EXCEEDED | ||
| else: | ||
| raise ValueError(f"Unexpected termination reason {run_termination_reasons}") | ||
| elif _should_stop_on_master_done(run): | ||
| new_status = RunStatus.TERMINATING | ||
| # ALL_JOBS_DONE is used for all DONE reasons including master-done | ||
| termination_reason = RunTerminationReason.ALL_JOBS_DONE | ||
|
jvstme marked this conversation as resolved.
|
||
| elif RunStatus.RUNNING in run_statuses: | ||
| new_status = RunStatus.RUNNING | ||
| elif RunStatus.PROVISIONING in run_statuses: | ||
|
|
@@ -434,3 +438,12 @@ def _can_retry_single_job(run_spec: RunSpec) -> bool: | |
| # We could make partial retry in some multi-node cases. | ||
| # E.g. restarting a worker node, independent jobs. | ||
| return False | ||
|
|
||
|
|
||
| def _should_stop_on_master_done(run: Run) -> bool: | ||
| if run.run_spec.merged_profile.stop_criteria != StopCriteria.MASTER_DONE: | ||
| return False | ||
| for job in run.jobs: | ||
| if job.job_spec.job_num == 0 and job.job_submissions[-1].status == JobStatus.DONE: | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (nit) Can also check for
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if we want to terminate the run before the master is really done. |
||
| return True | ||
| return False | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I'm not sure about calling non-master nodes "workers", because the master node is also a "worker" - it performs the same work other nodes do.
I can suggest to use "secondary" (
secondary-first) or avoid any names (master-last). Although we might still need a name to use in the codeThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
master/worker is standard terminalogy used for pytorch, mpi, etc. Let's not reinvent.