Skip to content

Commit 5eacbef

Browse files
committed
Merge remote-tracking branch 'origin/master' into issue_3727_default_docker_registry
2 parents 423eba6 + dd6234e commit 5eacbef

File tree

99 files changed

+1759
-15523
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

99 files changed

+1759
-15523
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,5 +26,6 @@ uv.lock
2626
/src/dstack/_internal/server/statics
2727

2828
profiling_results.html
29+
docs/docs/reference/api/http/openapi.json
2930
docs/docs/reference/api/rest/openapi.json
3031
docs/docs/reference/plugins/rest/rest_plugin_openapi.json

contributing/AUTOSCALING.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,16 @@
44

55
- STEP 1: `dstack-gateway` parses nginx `access.log` to collect per-second statistics about requests to the service and request times.
66
- STEP 2: `dstack-gateway` aggregates statistics over a 1-minute window.
7-
- STEP 3: The dstack server pulls all service statistics in the `process_gateways` background task.
8-
- STEP 4: The `process_runs` background task passes statistics and current replicas to the autoscaler.
9-
- STEP 5: The autoscaler (configured via the `dstack.yml` file) returns the replica change as an int.
10-
- STEP 6: `process_runs` calls `scale_run_replicas` to add or remove replicas.
11-
- STEP 7: `scale_run_replicas` terminates or starts replicas.
12-
- `SUBMITTED` and `PROVISIONING` replicas get terminated before `RUNNING`.
13-
- Replicas are terminated by descending `replica_num` and launched by ascending `replica_num`.
7+
- STEP 3: The server keeps gateway connections alive in the scheduled `process_gateways_connections` task and continuously collects stats from active gateways. This is separate from `GatewayPipeline`, which handles gateway provisioning and deletion.
8+
- STEP 4: When `RunPipeline` processes a service run, it loads the latest collected gateway stats for that service.
9+
- STEP 5: The autoscaler (configured via `dstack.yml`) computes the desired replica count for each replica group.
10+
- STEP 6: `RunPipeline` applies that desired state.
11+
- For scale-up, it creates new `SUBMITTED` jobs. `JobSubmittedPipeline` then assigns existing capacity or provisions new capacity for them.
12+
- For scale-down, it marks the least-important active replicas as `TERMINATING` with `SCALED_DOWN`. `JobTerminatingPipeline` unregisters and cleans them up.
13+
- STEP 7: If the service is in rolling deployment, `RunPipeline` handles that in the same active-run processing path.
14+
- It allows only a limited surge of replacement replicas.
15+
- It delays teardown of old replicas until replacement capacity is available.
16+
- It also cleans up replicas that belong to replica groups removed from the configuration.
1417

1518
## RPSAutoscaler
1619

contributing/RUNNER-AND-SHIM.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ A container is started in either `host` or `bridge` network mode depending on th
3131

3232
In `bridge` mode, container ports are mapped to ephemeral host ports. `dstack-shim` stores port mapping as a part of task's state. Currently, the default `bridge` network is used for all containers, but this could be changed in the future to improve container isolation.
3333

34-
All communication between the `dstack` server and `dstack-shim` happens via REST API through an SSH tunnel. `dstack-shim` doesn't collect logs. Usually, it is run from a `cloud-init` user-data script.
34+
All communication between the `dstack` server and `dstack-shim` happens via HTTP API through an SSH tunnel. `dstack-shim` doesn't collect logs. Usually, it is run from a `cloud-init` user-data script.
3535

3636
The entrypoint for the container:
3737
- Installs `openssh-server`
@@ -52,7 +52,7 @@ The entrypoint for the container:
5252
- Wait for the signal to terminate the commands
5353
- STEP 5: Wait until all logs are read by the server and the CLI. Or exit after a timeout
5454

55-
All communication between the `dstack` server and `dstack-runner` happens via REST API through an SSH tunnel. `dstack-runner` collects the job logs and its own logs. Only the job logs are served via WebSocket.
55+
All communication between the `dstack` server and `dstack-runner` happens via HTTP API through an SSH tunnel. `dstack-runner` collects the job logs and its own logs. Only the job logs are served via WebSocket.
5656

5757
## SSH tunnels
5858

contributing/RUNS-AND-JOBS.md

Lines changed: 45 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -17,31 +17,38 @@ A run can spawn one or multiple jobs, depending on the configuration. A task tha
1717

1818
## Run's Lifecycle
1919

20-
- STEP 1: The user submits the run. `services.runs.submit_run` creates jobs with status `SUBMITTED`. Now the run has status `SUBMITTED`.
21-
- STEP 2: `background.tasks.process_runs` periodically pulls unfinished runs and processes them:
22-
- If any job is `RUNNING`, the run becomes `RUNNING`.
23-
- If any job is `PROVISIONING` or `PULLING`, the run becomes `PROVISIONING`.
24-
- If any job fails and cannot be retried, the run becomes `TERMINATING`, and after processing, `FAILED`.
25-
- If all jobs are `DONE`, the run becomes `TERMINATING`, and after processing, `DONE`.
26-
- If any job fails, can be retried, and there is any other active job, the failed job will be resubmitted in-place.
27-
- If any jobs in a replica fail and can be retried and there is other active replicas, the jobs of the failed replica are resubmitted in-place (without stopping other replicas). But if some jobs in a replica fail, then all the jobs in a replica are terminated and resubmitted. This include multi-node tasks that represent one replica with multiple jobs.
28-
- If all jobs fail and can be resubmitted, the run becomes `PENDING`.
29-
- STEP 3: If the run is `TERMINATING`, the server makes all jobs `TERMINATING`. `background.tasks.process_runs` sets their status to `TERMINATING`, assigns `JobTerminationReason`, and sends a graceful stop command to `dstack-runner`. `process_terminating_jobs` then ensures that jobs are terminated assigns a finished status.
30-
- STEP 4: Once all jobs are finished, the run becomes `TERMINATED`, `DONE`, or `FAILED` based on `RunTerminationReason`.
31-
- STEP 0: If the run is `PENDING`, `background.tasks.process_runs` will resubmit jobs. The run becomes `SUBMITTED` again.
32-
33-
> Use `switch_run_status()` for all status transitions. Do not set `RunModel.status` directly.
34-
35-
> No one must assign the finished status to the run, except `services.runs.process_terminating_run`. To terminate the run, assign `TERMINATING` status and `RunTerminationReason`.
20+
- STEP 1: The user submits the run. `services.runs.submit_run` creates jobs with status `SUBMITTED`. The run starts in `SUBMITTED`.
21+
- STEP 2: `RunPipeline` continuously processes unfinished runs.
22+
- For active runs, it derives the run status from the latest job states in priority order:
23+
1. If any non-retryable failure is present, the run becomes `TERMINATING` with the relevant `RunTerminationReason`.
24+
2. If `stop_criteria == MASTER_DONE` and the master job is done, the run becomes `TERMINATING` with `ALL_JOBS_DONE`.
25+
3. Otherwise, if any job is `RUNNING`, the run becomes `RUNNING`.
26+
4. Otherwise, if any job is `PROVISIONING` or `PULLING`, the run becomes `PROVISIONING`.
27+
5. Otherwise, if jobs are still waiting for placement or provisioning, the run stays `SUBMITTED`.
28+
6. Otherwise, if all contributing jobs are `DONE`, the run becomes `TERMINATING` with `ALL_JOBS_DONE`.
29+
7. Otherwise, if no active replicas remain and the run should be retried, the run becomes `PENDING`.
30+
- Retryable replica failures are handled before the final transition is applied:
31+
- If a replica fails with a retryable reason while other replicas are still active, `RunPipeline` creates a new `SUBMITTED` submission for that replica and terminates the old jobs in that replica.
32+
- If all remaining work is retryable, the run ends up in `PENDING`.
33+
- STEP 3: If the run is `PENDING`, `RunPipeline` processes it in the pending phase.
34+
- For retrying runs, it waits for an exponential backoff before resubmitting.
35+
- For scheduled runs, it waits until `next_triggered_at`.
36+
- For scaled-to-zero services, it can keep the run in `PENDING` until autoscaling wants replicas again.
37+
- Once the run is ready to continue, `RunPipeline` creates new `SUBMITTED` jobs and moves the run back to `SUBMITTED`.
38+
- STEP 4: If the run is `TERMINATING`, `RunPipeline` marks active jobs as `TERMINATING` and assigns the corresponding `JobTerminationReason`.
39+
- STEP 5: Once all jobs are finished, the terminating phase of `RunPipeline` either:
40+
- assigns the final run status (`TERMINATED`, `DONE`, or `FAILED`), or
41+
- for scheduled runs that were not stopped or aborted by the user, returns the run to `PENDING` and computes a new `next_triggered_at`.
3642

3743
### Services
3844

39-
Services' lifecycle has some modifications:
45+
Services' run lifecycle has some modifications:
4046

41-
- During STEP 1, the service is registered on the gateway. If the gateway is not accessible or the domain name is taken, the run submission fails.
42-
- During STEP 2, downscaled jobs are ignored.
43-
- During STEP 4, the service is unregistered on the gateway.
44-
- During STEP 0, the service can stay in `PENDING` status if it was downscaled to zero (WIP).
47+
- During STEP 1, the service itself is registered on the gateway or the in-server proxy. If the gateway is not accessible or the domain name is taken, submission fails.
48+
- During STEP 2, active run processing also computes desired replica counts from gateway stats and handles scale-up, scale-down, rolling deployment, and cleanup of removed replica groups.
49+
- During STEP 2, jobs already marked `SCALED_DOWN` do not contribute to the run status.
50+
- During STEP 3, a service can stay in `PENDING` when autoscaling currently wants zero replicas.
51+
- During STEP 5, the terminating phase of `RunPipeline` unregisters the service from the gateway.
4552

4653
### When can the job be retried?
4754

@@ -54,29 +61,25 @@ Services' lifecycle has some modifications:
5461
## Job's Lifecycle
5562

5663
- STEP 1: A newly submitted job has status `SUBMITTED`. It is not assigned to any instance yet.
57-
- STEP 2: `background.tasks.process_submitted_jobs` tries to assign an existing instance or provision a new one.
58-
- On success, the job becomes `PROVISIONING`.
59-
- On failure, the job becomes `TERMINATING`, and after processing, `FAILED` because of `FAILED_TO_START_DUE_TO_NO_CAPACITY`.
60-
- STEP 3: `background.tasks.process_running_jobs` periodically pulls unfinished jobs and processes them.
61-
- While `dstack-shim`/`dstack-runner` is not responding, the job stays `PROVISIONING`.
62-
- Once `dstack-shim` (for VM-featured backends) becomes available, it submits the docker image name, and the job becomes `PULLING`.
63-
- Once `dstack-runner` inside a docker container becomes available, it submits the code and the job spec, and the job becomes `RUNNING`.
64-
- If `dstack-shim` or `dstack-runner` don't respond for a long time or fail to respond after successful connection and multiple retries, the job becomes `TERMINATING`, and after processing, `FAILED`.
65-
- STEP 4: `background.tasks.process_running_jobs` processes `RUNNING` jobs, pulling job logs, runner logs, and job status.
66-
- If the pulled status is `DONE`, the job becomes `TERMINATING`, and after processing, `DONE`.
67-
- Otherwise, the job becomes `TERMINATING`, and after processing, `FAILED`.
68-
- STEP 5: `background.tasks.process_terminating_jobs` processes `TERMINATING` jobs.
69-
- If the job has `remove_at` in the future, nothing happens. This is to give the job some time for a graceful stop.
70-
- Once `remove_at` is in the past, it stops the container via `dstack-shim`, detaches instance volumes, and releases the instance. The job becomes `TERMINATED`, `DONE`, `FAILED`, or `ABORTED` based on `JobTerminationReason`.
71-
- If some volumes fail to detach, it keeps the job `TERMINATING` and checks volumes attachment status.
72-
73-
> Use `switch_job_status()` for all status transitions. Do not set `JobModel.status` directly.
74-
75-
> No one must assign the finished status to the job, except `services.jobs.process_terminating_job`. To terminate the job, assign `TERMINATING` status and `JobTerminationReason`.
64+
- STEP 2: `JobSubmittedPipeline` tries to assign an existing instance or provision new capacity.
65+
- On success, the job becomes `PROVISIONING`.
66+
- On failure, the job becomes `TERMINATING`. `JobTerminatingPipeline` later assigns the final failed status.
67+
- STEP 3: `JobRunningPipeline` processes `PROVISIONING`, `PULLING`, and `RUNNING` jobs.
68+
- While `dstack-shim` / `dstack-runner` is not responding, the job stays `PROVISIONING`.
69+
- Once `dstack-shim` (for VM-featured backends) becomes available, the pipeline submits the image and the job becomes `PULLING`.
70+
- Once `dstack-runner` inside the container becomes available, the pipeline uploads the code and job spec, and the job becomes `RUNNING`.
71+
- While the job is `RUNNING`, the pipeline keeps collecting logs and runner status.
72+
- If startup, runner communication, or replica registration fails, the job becomes `TERMINATING`.
73+
- STEP 4: Once the job is actually ready, `JobRunningPipeline` initializes probes.
74+
- STEP 5: `JobTerminatingPipeline` processes `TERMINATING` jobs.
75+
- If the job has `remove_at` in the future, it waits. This gives the job time for a graceful stop.
76+
- Once `remove_at` is in the past, it stops the container, detaches volumes, unregisters service replicas if needed, and releases the instance assignment.
77+
- If some volumes are not detached yet, the job stays `TERMINATING` and is retried.
78+
- When cleanup is complete, the job becomes `TERMINATED`, `DONE`, `FAILED`, or `ABORTED` based on `JobTerminationReason`.
7679

7780
### Services' Jobs
7881

7982
Services' jobs lifecycle has some modifications:
8083

81-
- During STEP 3, once the job becomes `RUNNING`, it is registered on the gateway as a replica. If the gateway is not accessible, the job fails.
82-
- During STEP 5, the job is unregistered on the gateway (WIP).
84+
- During STEP 3, once the primary job of a replica is `RUNNING` and ready to receive traffic, `JobRunningPipeline` registers that replica on the gateway. If the gateway is not accessible, the job fails with a gateway-related termination reason.
85+
- During STEP 5, `JobTerminatingPipeline` unregisters the replica from receiving requests before the job is fully cleaned up.

docs/blog/posts/dstack-metrics.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,9 @@ difference is that `dstack stats` includes GPU VRAM usage and GPU utilization pe
3131
Similar to `kubectl top`, if a run consists of multiple jobs (such as distributed training or an auto-scalable service),
3232
`dstack stats` will display metrics per job.
3333

34-
!!! info "REST API"
34+
!!! info "HTTP API"
3535
In addition to the `dstack stats` CLI commands, metrics can be obtained via the
36-
[`/api/project/{project_name}/metrics/job/{run_name}`](../../docs/reference/api/rest/#operations-tag-metrics) REST endpoint.
36+
[`/api/project/{project_name}/metrics/job/{run_name}`](../../docs/reference/api/http/#operations-tag-metrics) HTTP endpoint.
3737

3838
## Why monitor GPU usage
3939

docs/docs/concepts/projects.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,4 +66,4 @@ You can find the command on the project’s settings page:
6666
<img src="https://dstack.ai/static-assets/static-assets/images/dstack-projects-project-cli-v2.png" width="750px" />
6767

6868
??? info "API"
69-
In addition to the UI, managing projects, users, and user permissions can also be done via the [REST API](../reference/api/rest/index.md).
69+
In addition to the UI, managing projects, users, and user permissions can also be done via the [HTTP API](../reference/api/http/index.md).

docs/docs/guides/migration/slurm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Both Slurm and `dstack` follow a client-server architecture with a control plane
2727
|---|---------------|-------------------|
2828
| **Control plane** | `slurmctld` (controller) | `dstack-server` |
2929
| **State persistence** | `slurmdbd` (database) | `dstack-server` (SQLite/PostgreSQL) |
30-
| **REST API** | `slurmrestd` (REST API) | `dstack-server` (HTTP API) |
30+
| **API** | `slurmrestd` (REST API) | `dstack-server` (HTTP API) |
3131
| **Compute plane** | `slurmd` (compute agent) | `dstack-shim` (on VMs/hosts) and/or `dstack-runner` (inside containers) |
3232
| **Client** | CLI from login nodes | CLI from anywhere |
3333
| **High availability** | Active-passive failover (typically 2 controller nodes) | Horizontal scaling with multiple server replicas (requires PostgreSQL) |

docs/docs/guides/server-deployment.md

Lines changed: 27 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,11 @@ To store the server state in Postgres, set the `DSTACK_DATABASE_URL` environment
135135
$ DSTACK_DATABASE_URL=postgresql+asyncpg://user:password@db-host:5432/dstack dstack server
136136
```
137137

138+
The minimum requirements for the DB instance are 2 CPU, 2GB of RAM, and at least 50 `max_connections` per server replica
139+
or a configured connection pooler to handle that many connections.
140+
If you're using a smaller DB instance, you may need to set lower `DSTACK_DB_POOL_SIZE` and `DSTACK_DB_MAX_OVERFLOW`, e.g.
141+
`DSTACK_DB_POOL_SIZE=10` and `DSTACK_DB_MAX_OVERFLOW=0`.
142+
138143
??? info "Migrate from SQLite to PostgreSQL"
139144
You can migrate the existing state from SQLite to PostgreSQL using `pgloader`:
140145

@@ -349,6 +354,22 @@ The bucket must be created beforehand. `dstack` won't try to create it.
349354
storage.objects.update
350355
```
351356

357+
## SSH proxy
358+
359+
[`dstack-sshproxy`](https://github.com/dstackai/sshproxy) is an optional component that provides direct SSH access to workloads.
360+
361+
Without SSH proxy, in order to connect to a job via SSH or use an IDE URL, the `dstack attach` CLI command must be used, which configures user's SSH client in a backend-specific way for each job.
362+
363+
When SSH proxy is deployed, there is one well-known entry point – a proxy address – for all `dstack` jobs, which can be used for SSH access without any additional steps on the user's side (such as installing `dstack` and executing `dstack attach` each time). All the user has to do is to upload their public key to the `dstack` server once – there is a dedicated “SSH keys” tab on the user's page of the control plane UI.
364+
365+
366+
To deploy SSH proxy, see `dstack-sshproxy` [Deployment guide](https://github.com/dstackai/sshproxy/blob/main/DEPLOYMENT.md).
367+
368+
To enable SSH proxy integration on the `dstack` server side, set the following environment variables:
369+
370+
* `DSTACK_SSHPROXY_API_TOKEN` – a token used to authenticate SSH proxy API requests, must be the same value as when deploying `dstack-sshproxy`.
371+
* `DSTACK_SERVER_SSHPROXY_ADDRESS` – an address where SSH proxy is available to `dstack` users, in the `HOSTNAME[:PORT]` form, where `HOSTNAME` is a domain name or an IP address, and `PORT`, if not specified, defaults to 22.
372+
352373
## Encryption
353374

354375
By default, `dstack` stores data in plaintext. To enforce encryption, you
@@ -456,26 +477,14 @@ Backward compatibility is maintained based on these principles:
456477

457478
## Server limits
458479

459-
A single `dstack` server replica can support:
460-
461-
* Up to 150 active runs.
462-
* Up to 150 active jobs.
463-
* Up to 150 active instances.
480+
A single `dstack` server replica can support at least
464481

465-
Having more active resources will work but can affect server performance.
466-
If you hit these limits, consider using Postgres with multiple server replicas.
467-
You can also increase processing rates of a replica by setting the `DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR` environment variable.
468-
You should also increase `DSTACK_DB_POOL_SIZE` and `DSTACK_DB_MAX_OVERFLOW` proportionally.
469-
For example, to increase processing rates 4 times, set:
470-
471-
```
472-
export DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR=4
473-
export DSTACK_DB_POOL_SIZE=80
474-
export DSTACK_DB_MAX_OVERFLOW=80
475-
```
482+
* 1000 active instances
483+
* 1000 active runs
484+
* 1000 active jobs.
476485

477-
You have to ensure your Postgres installation supports that many connections by
478-
configuring [`max_connections`](https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-MAX-CONNECTIONS) and/or using connection pooler.
486+
If you hit server performance limits, try scale up server instances and/or configure Postgres with multiple server replicas.
487+
Also, please [submit a GitHub issue](https://github.com/dstackai/dstack/issues) describing your setup – we strive to improve `dstack` scalability and efficiency.
479488

480489
## Server upgrades
481490

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
2-
title: REST API
2+
title: HTTP API
33
---
44

5-
The REST API enables running tasks, services, and managing runs programmatically.
5+
The HTTP API enables running tasks, services, and managing runs programmatically.
66

77
## Usage example
88

0 commit comments

Comments
 (0)