You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Drop deprecated scheduled tasks
* Clean up dead code
* Update AUTOSCALING.md and RUNS-AND-JOBS.md for pipelines
* Make JobSubmittedPipeline to wait for master election
* Fix tests
Copy file name to clipboardExpand all lines: contributing/AUTOSCALING.md
+10-7Lines changed: 10 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,13 +4,16 @@
4
4
5
5
- STEP 1: `dstack-gateway` parses nginx `access.log` to collect per-second statistics about requests to the service and request times.
6
6
- STEP 2: `dstack-gateway` aggregates statistics over a 1-minute window.
7
-
- STEP 3: The dstack server pulls all service statistics in the `process_gateways` background task.
8
-
- STEP 4: The `process_runs` background task passes statistics and current replicas to the autoscaler.
9
-
- STEP 5: The autoscaler (configured via the `dstack.yml` file) returns the replica change as an int.
10
-
- STEP 6: `process_runs` calls `scale_run_replicas` to add or remove replicas.
11
-
- STEP 7: `scale_run_replicas` terminates or starts replicas.
12
-
-`SUBMITTED` and `PROVISIONING` replicas get terminated before `RUNNING`.
13
-
- Replicas are terminated by descending `replica_num` and launched by ascending `replica_num`.
7
+
- STEP 3: The server keeps gateway connections alive in the scheduled `process_gateways_connections` task and continuously collects stats from active gateways. This is separate from `GatewayPipeline`, which handles gateway provisioning and deletion.
8
+
- STEP 4: When `RunPipeline` processes a service run, it loads the latest collected gateway stats for that service.
9
+
- STEP 5: The autoscaler (configured via `dstack.yml`) computes the desired replica count for each replica group.
10
+
- STEP 6: `RunPipeline` applies that desired state.
11
+
- For scale-up, it creates new `SUBMITTED` jobs. `JobSubmittedPipeline` then assigns existing capacity or provisions new capacity for them.
12
+
- For scale-down, it marks the least-important active replicas as `TERMINATING` with `SCALED_DOWN`. `JobTerminatingPipeline` unregisters and cleans them up.
13
+
- STEP 7: If the service is in rolling deployment, `RunPipeline` handles that in the same active-run processing path.
14
+
- It allows only a limited surge of replacement replicas.
15
+
- It delays teardown of old replicas until replacement capacity is available.
16
+
- It also cleans up replicas that belong to replica groups removed from the configuration.
- If any job is `RUNNING`, the run becomes `RUNNING`.
23
-
- If any job is `PROVISIONING` or `PULLING`, the run becomes `PROVISIONING`.
24
-
- If any job fails and cannot be retried, the run becomes `TERMINATING`, and after processing, `FAILED`.
25
-
- If all jobs are `DONE`, the run becomes `TERMINATING`, and after processing, `DONE`.
26
-
- If any job fails, can be retried, and there is any other active job, the failed job will be resubmitted in-place.
27
-
- If any jobs in a replica fail and can be retried and there is other active replicas, the jobs of the failed replica are resubmitted in-place (without stopping other replicas). But if some jobs in a replica fail, then all the jobs in a replica are terminated and resubmitted. This include multi-node tasks that represent one replica with multiple jobs.
28
-
- If all jobs fail and can be resubmitted, the run becomes `PENDING`.
29
-
- STEP 3: If the run is `TERMINATING`, the server makes all jobs `TERMINATING`. `background.tasks.process_runs` sets their status to `TERMINATING`, assigns `JobTerminationReason`, and sends a graceful stop command to `dstack-runner`. `process_terminating_jobs` then ensures that jobs are terminated assigns a finished status.
30
-
- STEP 4: Once all jobs are finished, the run becomes `TERMINATED`, `DONE`, or `FAILED` based on `RunTerminationReason`.
31
-
- STEP 0: If the run is `PENDING`, `background.tasks.process_runs` will resubmit jobs. The run becomes `SUBMITTED` again.
32
-
33
-
> Use `switch_run_status()` for all status transitions. Do not set `RunModel.status` directly.
34
-
35
-
> No one must assign the finished status to the run, except `services.runs.process_terminating_run`. To terminate the run, assign `TERMINATING` status and `RunTerminationReason`.
20
+
- STEP 1: The user submits the run. `services.runs.submit_run` creates jobs with status `SUBMITTED`. The run starts in `SUBMITTED`.
- For active runs, it derives the run status from the latest job states in priority order:
23
+
1. If any non-retryable failure is present, the run becomes `TERMINATING` with the relevant `RunTerminationReason`.
24
+
2. If `stop_criteria == MASTER_DONE` and the master job is done, the run becomes `TERMINATING` with `ALL_JOBS_DONE`.
25
+
3. Otherwise, if any job is `RUNNING`, the run becomes `RUNNING`.
26
+
4. Otherwise, if any job is `PROVISIONING` or `PULLING`, the run becomes `PROVISIONING`.
27
+
5. Otherwise, if jobs are still waiting for placement or provisioning, the run stays `SUBMITTED`.
28
+
6. Otherwise, if all contributing jobs are `DONE`, the run becomes `TERMINATING` with `ALL_JOBS_DONE`.
29
+
7. Otherwise, if no active replicas remain and the run should be retried, the run becomes `PENDING`.
30
+
- Retryable replica failures are handled before the final transition is applied:
31
+
- If a replica fails with a retryable reason while other replicas are still active, `RunPipeline` creates a new `SUBMITTED` submission for that replica and terminates the old jobs in that replica.
32
+
- If all remaining work is retryable, the run ends up in `PENDING`.
33
+
- STEP 3: If the run is `PENDING`, `RunPipeline` processes it in the pending phase.
34
+
- For retrying runs, it waits for an exponential backoff before resubmitting.
35
+
- For scheduled runs, it waits until `next_triggered_at`.
36
+
- For scaled-to-zero services, it can keep the run in `PENDING` until autoscaling wants replicas again.
37
+
- Once the run is ready to continue, `RunPipeline` creates new `SUBMITTED` jobs and moves the run back to `SUBMITTED`.
38
+
- STEP 4: If the run is `TERMINATING`, `RunPipeline` marks active jobs as `TERMINATING` and assigns the corresponding `JobTerminationReason`.
39
+
- STEP 5: Once all jobs are finished, the terminating phase of `RunPipeline` either:
40
+
- assigns the final run status (`TERMINATED`, `DONE`, or `FAILED`), or
41
+
- for scheduled runs that were not stopped or aborted by the user, returns the run to `PENDING` and computes a new `next_triggered_at`.
36
42
37
43
### Services
38
44
39
-
Services' lifecycle has some modifications:
45
+
Services' run lifecycle has some modifications:
40
46
41
-
- During STEP 1, the service is registered on the gateway. If the gateway is not accessible or the domain name is taken, the run submission fails.
42
-
- During STEP 2, downscaled jobs are ignored.
43
-
- During STEP 4, the service is unregistered on the gateway.
44
-
- During STEP 0, the service can stay in `PENDING` status if it was downscaled to zero (WIP).
47
+
- During STEP 1, the service itself is registered on the gateway or the in-server proxy. If the gateway is not accessible or the domain name is taken, submission fails.
48
+
- During STEP 2, active run processing also computes desired replica counts from gateway stats and handles scale-up, scale-down, rolling deployment, and cleanup of removed replica groups.
49
+
- During STEP 2, jobs already marked `SCALED_DOWN` do not contribute to the run status.
50
+
- During STEP 3, a service can stay in `PENDING` when autoscaling currently wants zero replicas.
51
+
- During STEP 5, the terminating phase of `RunPipeline` unregisters the service from the gateway.
45
52
46
53
### When can the job be retried?
47
54
@@ -54,29 +61,25 @@ Services' lifecycle has some modifications:
54
61
## Job's Lifecycle
55
62
56
63
- STEP 1: A newly submitted job has status `SUBMITTED`. It is not assigned to any instance yet.
57
-
- STEP 2: `background.tasks.process_submitted_jobs` tries to assign an existing instance or provision a new one.
58
-
- On success, the job becomes `PROVISIONING`.
59
-
- On failure, the job becomes `TERMINATING`, and after processing, `FAILED` because of `FAILED_TO_START_DUE_TO_NO_CAPACITY`.
60
-
- STEP 3: `background.tasks.process_running_jobs` periodically pulls unfinished jobs and processes them.
61
-
- While `dstack-shim`/`dstack-runner` is not responding, the job stays `PROVISIONING`.
62
-
- Once `dstack-shim` (for VM-featured backends) becomes available, it submits the docker image name, and the job becomes `PULLING`.
63
-
- Once `dstack-runner` inside a docker container becomes available, it submits the code and the job spec, and the job becomes `RUNNING`.
64
-
- If `dstack-shim` or `dstack-runner` don't respond for a long time or fail to respond after successful connection and multiple retries, the job becomes `TERMINATING`, and after processing, `FAILED`.
- If the job has `remove_at` in the future, nothing happens. This is to give the job some time for a graceful stop.
70
-
- Once `remove_at` is in the past, it stops the container via `dstack-shim`, detaches instance volumes, and releases the instance. The job becomes `TERMINATED`, `DONE`, `FAILED`, or `ABORTED` based on `JobTerminationReason`.
71
-
- If some volumes fail to detach, it keeps the job `TERMINATING` and checks volumes attachment status.
72
-
73
-
> Use `switch_job_status()` for all status transitions. Do not set `JobModel.status` directly.
74
-
75
-
> No one must assign the finished status to the job, except `services.jobs.process_terminating_job`. To terminate the job, assign `TERMINATING` status and `JobTerminationReason`.
64
+
- STEP 2: `JobSubmittedPipeline` tries to assign an existing instance or provision new capacity.
65
+
- On success, the job becomes `PROVISIONING`.
66
+
- On failure, the job becomes `TERMINATING`. `JobTerminatingPipeline` later assigns the final failed status.
67
+
- STEP 3: `JobRunningPipeline` processes `PROVISIONING`, `PULLING`, and `RUNNING` jobs.
68
+
- While `dstack-shim` / `dstack-runner` is not responding, the job stays `PROVISIONING`.
69
+
- Once `dstack-shim` (for VM-featured backends) becomes available, the pipeline submits the image and the job becomes `PULLING`.
70
+
- Once `dstack-runner` inside the container becomes available, the pipeline uploads the code and job spec, and the job becomes `RUNNING`.
71
+
- While the job is `RUNNING`, the pipeline keeps collecting logs and runner status.
72
+
- If startup, runner communication, or replica registration fails, the job becomes `TERMINATING`.
73
+
- STEP 4: Once the job is actually ready, `JobRunningPipeline` initializes probes.
- If the job has `remove_at` in the future, it waits. This gives the job time for a graceful stop.
76
+
- Once `remove_at` is in the past, it stops the container, detaches volumes, unregisters service replicas if needed, and releases the instance assignment.
77
+
- If some volumes are not detached yet, the job stays `TERMINATING` and is retried.
78
+
- When cleanup is complete, the job becomes `TERMINATED`, `DONE`, `FAILED`, or `ABORTED` based on `JobTerminationReason`.
76
79
77
80
### Services' Jobs
78
81
79
82
Services' jobs lifecycle has some modifications:
80
83
81
-
- During STEP 3, once the job becomes `RUNNING`, it is registered on the gateway as a replica. If the gateway is not accessible, the job fails.
82
-
- During STEP 5, the job is unregistered on the gateway (WIP).
84
+
- During STEP 3, once the primary job of a replica is `RUNNING` and ready to receive traffic, `JobRunningPipeline` registers that replica on the gateway. If the gateway is not accessible, the job fails with a gateway-related termination reason.
85
+
- During STEP 5, `JobTerminatingPipeline` unregisters the replica from receiving requests before the job is fully cleaned up.
Copy file name to clipboardExpand all lines: docs/docs/reference/environment-variables.md
-1Lines changed: 0 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -130,7 +130,6 @@ For more details on the options below, refer to the [server deployment](../guide
130
130
- `DSTACK_SERVER_GCS_BUCKET`{ #DSTACK_SERVER_GCS_BUCKET } - The bucket that repo diffs will be uploaded to if set. If unset, diffs are uploaded to the database.
131
131
- `DSTACK_DB_POOL_SIZE`{ #DSTACK_DB_POOL_SIZE } - The client DB connections pool size. Defaults to `20`,
132
132
- `DSTACK_DB_MAX_OVERFLOW`{ #DSTACK_DB_MAX_OVERFLOW } - The client DB connections pool allowed overflow. Defaults to `20`.
133
-
- `DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR`{ #DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR } - The number of background jobs for processing server resources. Increase if you need to process more resources per server replica quickly. Defaults to `1`.
134
133
- `DSTACK_SERVER_BACKGROUND_PROCESSING_DISABLED`{ #DSTACK_SERVER_BACKGROUND_PROCESSING_DISABLED } - Disables background processing if set to any value. Useful to run only web frontend and API server.
135
134
- `DSTACK_SERVER_MAX_PROBES_PER_JOB`{ #DSTACK_SERVER_MAX_PROBES_PER_JOB } - Maximum number of probes allowed in a run configuration. Validated at apply time.
136
135
- `DSTACK_SERVER_MAX_PROBE_TIMEOUT`{ #DSTACK_SERVER_MAX_PROBE_TIMEOUT } - Maximum allowed timeout for a probe. Validated at apply time.
0 commit comments