Skip to content

Commit 62e5b87

Browse files
authored
Service probes (#2927)
Allow configuring HTTP probes for services and use probe results to avoid downtime during deployments.
1 parent 756bf5b commit 62e5b87

28 files changed

Lines changed: 1108 additions & 43 deletions

File tree

docs/docs/concepts/services.md

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,53 @@ port: 8000
187187

188188
</div>
189189

190+
### Probes
191+
192+
Configure one or more HTTP probes to periodically check the health of the service.
193+
194+
<div editor-title="service.dstack.yml">
195+
196+
```yaml
197+
type: service
198+
name: my-service
199+
port: 80
200+
image: my-app:latest
201+
probes:
202+
- type: http
203+
url: /health
204+
interval: 15s
205+
```
206+
207+
</div>
208+
209+
You can track probe statuses in `dstack ps --verbose`.
210+
211+
<div class="termy">
212+
213+
```shell
214+
$ dstack ps --verbose
215+
216+
NAME BACKEND STATUS PROBES SUBMITTED
217+
my-service deployment=1 running 11 mins ago
218+
replica=0 job=0 deployment=0 aws (us-west-2) running ✓ 11 mins ago
219+
replica=1 job=0 deployment=1 aws (us-west-2) running × 1 min ago
220+
```
221+
222+
</div>
223+
224+
??? info "Probe statuses"
225+
The following symbols are used for probe statuses:
226+
227+
- `×` &mdash; the last probe execution failed.
228+
- `~` &mdash; the last probe execution succeeded, but the [`ready_after`](../reference/dstack.yml/service.md#ready_after) threshold is not yet reached.
229+
- `` &mdash; the last `ready_after` probe executions succeeded.
230+
231+
If multiple probes are configured for the service, their statuses are displayed in the order in which the probes appear in the configuration.
232+
233+
Probes are executed for each service replica while the replica is `running`. Probe statuses do not affect how `dstack` handles replicas, except during [rolling deployments](#rolling-deployment).
234+
235+
See the [reference](../reference/dstack.yml/service.md#probes) for more probe configuration options.
236+
190237
### Path prefix { #path-prefix }
191238

192239
If your `dstack` project doesn't have a [gateway](gateways.md), services are hosted with the
@@ -758,7 +805,7 @@ Update the run? [y/n]:
758805

759806
</div>
760807

761-
If approved, `dstack` gradually updates the service replicas. To update a replica, `dstack` starts a new replica, waits for it to become `running`, then terminates the old replica. This process is repeated for each replica, one at a time.
808+
If approved, `dstack` gradually updates the service replicas. To update a replica, `dstack` starts a new replica, waits for it to become `running` and for all of its [probes](#probes) to pass, then terminates the old replica. This process is repeated for each replica, one at a time.
762809

763810
You can track the progress of rolling deployment in both `dstack apply` or `dstack ps`.
764811
Older replicas have lower `deployment` numbers; newer ones have higher.
@@ -783,7 +830,7 @@ The rolling deployment stops when all replicas are updated or when a new deploym
783830
??? info "Supported properties"
784831
<!-- NOTE: should be in sync with constants in server/services/runs.py -->
785832

786-
Rolling deployment supports changes to the following properties: `port`, `resources`, `volumes`, `docker`, `files`, `image`, `user`, `privileged`, `entrypoint`, `working_dir`, `python`, `nvcc`, `single_branch`, `env`, `shell`, `commands`, as well as changes to [repo](repos.md) or [file](#files) contents.
833+
Rolling deployment supports changes to the following properties: `port`, `probes`, `resources`, `volumes`, `docker`, `files`, `image`, `user`, `privileged`, `entrypoint`, `working_dir`, `python`, `nvcc`, `single_branch`, `env`, `shell`, `commands`, as well as changes to [repo](repos.md) or [file](#files) contents.
787834

788835
Changes to `replicas` and `scaling` can be applied without redeploying replicas.
789836

docs/docs/reference/dstack.yml/service.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,16 @@ The `service` configuration type allows running [services](../../concepts/servic
106106
type:
107107
required: true
108108

109+
### `probes`
110+
111+
#### `probes[n]`
112+
113+
#SCHEMA# dstack._internal.core.models.configurations.ProbeConfig
114+
overrides:
115+
show_root_heading: false
116+
type:
117+
required: true
118+
109119
### `retry`
110120

111121
#SCHEMA# dstack._internal.core.models.profiles.ProfileRetry

docs/docs/reference/environment-variables.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,8 @@ For more details on the options below, refer to the [server deployment](../guide
124124
- `DSTACK_DB_MAX_OVERFLOW`{ #DSTACK_DB_MAX_OVERFLOW } - The client DB connections pool allowed overflow. Defaults to `20`.
125125
- `DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR`{ #DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR } - The number of background jobs for processing server resources. Increase if you need to process more resources per server replica quickly. Defaults to `1`.
126126
- `DSTACK_SERVER_BACKGROUND_PROCESSING_DISABLED`{ #DSTACK_SERVER_BACKGROUND_PROCESSING_DISABLED } - Disables background processing if set to any value. Useful to run only web frontend and API server.
127+
- `DSTACK_SERVER_MAX_PROBES_PER_JOB`{ #DSTACK_SERVER_MAX_PROBES_PER_JOB } - Maximum number of probes allowed in a run configuration. Validated at apply time.
128+
- `DSTACK_SERVER_MAX_PROBE_TIMEOUT`{ #DSTACK_SERVER_MAX_PROBE_TIMEOUT } - Maximum allowed timeout for a probe. Validated at apply time.
127129

128130
??? info "Internal environment variables"
129131
The following environment variables are intended for development purposes:

src/dstack/_internal/cli/utils/run.py

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,15 @@
1212
TerminationPolicy,
1313
)
1414
from dstack._internal.core.models.runs import (
15+
JobStatus,
16+
Probe,
17+
ProbeSpec,
1518
RunPlan,
1619
)
1720
from dstack._internal.core.services.profiles import get_termination
1821
from dstack._internal.utils.common import (
1922
DateFormatter,
23+
batched,
2024
format_duration_multiunit,
2125
format_pretty_duration,
2226
pretty_date,
@@ -156,6 +160,12 @@ def get_runs_table(
156160
table.add_column("INSTANCE TYPE", no_wrap=True, ratio=1)
157161
table.add_column("PRICE", style="grey58", ratio=1)
158162
table.add_column("STATUS", no_wrap=True, ratio=1)
163+
if verbose or any(
164+
run._run.is_deployment_in_progress()
165+
and any(job.job_submissions[-1].probes for job in run._run.jobs)
166+
for run in runs
167+
):
168+
table.add_column("PROBES", ratio=1)
159169
table.add_column("SUBMITTED", style="grey58", no_wrap=True, ratio=1)
160170
if verbose:
161171
table.add_column("ERROR", no_wrap=True, ratio=2)
@@ -198,6 +208,9 @@ def get_runs_table(
198208
else ""
199209
),
200210
"STATUS": latest_job_submission.status_message,
211+
"PROBES": _format_job_probes(
212+
job.job_spec.probes, latest_job_submission.probes, latest_job_submission.status
213+
),
201214
"SUBMITTED": format_date(latest_job_submission.submitted_at),
202215
"ERROR": latest_job_submission.error,
203216
}
@@ -226,3 +239,22 @@ def get_runs_table(
226239
add_row_from_dict(table, job_row, style="secondary" if len(run.jobs) != 1 else None)
227240

228241
return table
242+
243+
244+
def _format_job_probes(
245+
probe_specs: list[ProbeSpec], probes: list[Probe], job_status: JobStatus
246+
) -> str:
247+
if not probes or job_status != JobStatus.RUNNING:
248+
return ""
249+
statuses = []
250+
for probe_spec, probe in zip(probe_specs, probes):
251+
# NOTE: the symbols are documented in concepts/services.md, keep in sync.
252+
if probe.success_streak >= probe_spec.ready_after:
253+
status = "[code]✓[/]"
254+
elif probe.success_streak > 0:
255+
status = "[warning]~[/]"
256+
else:
257+
status = "[error]×[/]"
258+
statuses.append(status)
259+
# split into whitespace-delimited batches to allow column wrapping
260+
return " ".join("".join(batch) for batch in batched(statuses, 5))

src/dstack/_internal/core/compatibility/runs.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ def get_apply_plan_excludes(plan: ApplyRunPlanInput) -> Optional[IncludeExcludeD
5353
job_submissions_excludes["exit_status"] = True
5454
if all(js.deployment_num == 0 for js in job_submissions):
5555
job_submissions_excludes["deployment_num"] = True
56+
if all(not js.probes for js in job_submissions):
57+
job_submissions_excludes["probes"] = True
5658
latest_job_submission = current_resource.latest_job_submission
5759
if latest_job_submission is not None:
5860
latest_job_submission_excludes: IncludeExcludeDictType = {}
@@ -69,6 +71,8 @@ def get_apply_plan_excludes(plan: ApplyRunPlanInput) -> Optional[IncludeExcludeD
6971
latest_job_submission_excludes["exit_status"] = True
7072
if latest_job_submission.deployment_num == 0:
7173
latest_job_submission_excludes["deployment_num"] = True
74+
if not latest_job_submission.probes:
75+
latest_job_submission_excludes["probes"] = True
7276
return {"plan": apply_plan_excludes}
7377

7478

@@ -120,6 +124,8 @@ def get_run_spec_excludes(run_spec: RunSpec) -> IncludeExcludeDictType:
120124
profile_excludes.add("startup_order")
121125
if configuration.stop_criteria is None:
122126
configuration_excludes["stop_criteria"] = True
127+
if isinstance(configuration, ServiceConfiguration) and not configuration.probes:
128+
configuration_excludes["probes"] = True
123129
if profile is not None and profile.stop_criteria is None:
124130
profile_excludes.add("stop_criteria")
125131
if not configuration.files:
@@ -154,6 +160,8 @@ def get_job_spec_excludes(job_specs: list[JobSpec]) -> IncludeExcludeDictType:
154160
spec_excludes["file_archives"] = True
155161
if all(s.service_port is None for s in job_specs):
156162
spec_excludes["service_port"] = True
163+
if all(not s.probes for s in job_specs):
164+
spec_excludes["probes"] = True
157165

158166
return spec_excludes
159167

src/dstack/_internal/core/models/configurations.py

Lines changed: 90 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
from dstack._internal.core.models.files import FilePathMapping
1515
from dstack._internal.core.models.fleets import FleetConfiguration
1616
from dstack._internal.core.models.gateways import GatewayConfiguration
17-
from dstack._internal.core.models.profiles import ProfileParams, parse_off_duration
17+
from dstack._internal.core.models.profiles import ProfileParams, parse_duration, parse_off_duration
1818
from dstack._internal.core.models.resources import Range, ResourcesSpec
1919
from dstack._internal.core.models.services import AnyModel, OpenAIChatModel
2020
from dstack._internal.core.models.unix import UnixUser
@@ -32,6 +32,13 @@
3232
RUN_PRIOTIRY_MAX = 100
3333
RUN_PRIORITY_DEFAULT = 0
3434
DEFAULT_REPO_DIR = "/workflow"
35+
MIN_PROBE_TIMEOUT = 1
36+
MIN_PROBE_INTERVAL = 1
37+
DEFAULT_PROBE_URL = "/"
38+
DEFAULT_PROBE_TIMEOUT = 10
39+
DEFAULT_PROBE_INTERVAL = 15
40+
DEFAULT_PROBE_READY_AFTER = 1
41+
MAX_PROBE_URL_LEN = 2048
3542

3643

3744
class RunConfigurationType(str, Enum):
@@ -162,6 +169,74 @@ class RateLimit(CoreModel):
162169
] = 0
163170

164171

172+
class ProbeConfig(CoreModel):
173+
type: Literal["http"] # expect other probe types in the future, namely `exec`
174+
url: Annotated[
175+
Optional[str], Field(description=f"The URL to request. Defaults to `{DEFAULT_PROBE_URL}`")
176+
] = None
177+
timeout: Annotated[
178+
Optional[Union[int, str]],
179+
Field(
180+
description=(
181+
f"Maximum amount of time the HTTP request is allowed to take. Defaults to `{DEFAULT_PROBE_TIMEOUT}s`"
182+
)
183+
),
184+
] = None
185+
interval: Annotated[
186+
Optional[Union[int, str]],
187+
Field(
188+
description=(
189+
"Minimum amount of time between the end of one probe execution"
190+
f" and the start of the next. Defaults to `{DEFAULT_PROBE_INTERVAL}s`"
191+
)
192+
),
193+
] = None
194+
ready_after: Annotated[
195+
Optional[int],
196+
Field(
197+
ge=1,
198+
description=(
199+
"The number of consecutive successful probe executions required for the replica"
200+
" to be considered ready. Used during rolling deployments."
201+
f" Defaults to `{DEFAULT_PROBE_READY_AFTER}`"
202+
),
203+
),
204+
] = None
205+
206+
class Config:
207+
frozen = True
208+
209+
@validator("timeout")
210+
def parse_timeout(cls, v: Optional[Union[int, str]]) -> Optional[int]:
211+
if v is None:
212+
return v
213+
parsed = parse_duration(v)
214+
if parsed < MIN_PROBE_TIMEOUT:
215+
raise ValueError(f"Probe timeout cannot be shorter than {MIN_PROBE_TIMEOUT}s")
216+
return parsed
217+
218+
@validator("interval")
219+
def parse_interval(cls, v: Optional[Union[int, str]]) -> Optional[int]:
220+
if v is None:
221+
return v
222+
parsed = parse_duration(v)
223+
if parsed < MIN_PROBE_INTERVAL:
224+
raise ValueError(f"Probe interval cannot be shorter than {MIN_PROBE_INTERVAL}s")
225+
return parsed
226+
227+
@validator("url")
228+
def validate_url(cls, v: Optional[str]) -> Optional[str]:
229+
if v is None:
230+
return v
231+
if not v.startswith("/"):
232+
raise ValueError("Must start with `/`")
233+
if len(v) > MAX_PROBE_URL_LEN:
234+
raise ValueError(f"Cannot be longer than {MAX_PROBE_URL_LEN} characters")
235+
if not v.isprintable():
236+
raise ValueError("Cannot contain non-printable characters")
237+
return v
238+
239+
165240
class BaseRunConfiguration(CoreModel):
166241
type: Literal["none"]
167242
name: Annotated[
@@ -448,6 +523,10 @@ class ServiceConfigurationParams(CoreModel):
448523
Field(description="The auto-scaling rules. Required if `replicas` is set to a range"),
449524
] = None
450525
rate_limits: Annotated[list[RateLimit], Field(description="Rate limiting rules")] = []
526+
probes: Annotated[
527+
list[ProbeConfig],
528+
Field(description="List of probes used to determine job health"),
529+
] = []
451530

452531
@validator("port")
453532
def convert_port(cls, v) -> PortMapping:
@@ -511,6 +590,16 @@ def validate_rate_limits(cls, v: list[RateLimit]) -> list[RateLimit]:
511590
)
512591
return v
513592

593+
@validator("probes")
594+
def validate_probes(cls, v: list[ProbeConfig]) -> list[ProbeConfig]:
595+
if len(v) != len(set(v)):
596+
# Using a custom validator instead of Field(unique_items=True) to avoid Pydantic bug:
597+
# https://github.com/pydantic/pydantic/issues/3765
598+
# Because of the bug, our gen_schema_reference.py fails to determine the type of
599+
# ServiceConfiguration.probes and insert the correct hyperlink.
600+
raise ValueError("Probes must be unique")
601+
return v
602+
514603

515604
class ServiceConfiguration(
516605
ProfileParams, BaseRunConfigurationWithCommands, ServiceConfigurationParams

src/dstack/_internal/core/models/runs.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from datetime import datetime, timedelta
22
from enum import Enum
3-
from typing import Any, Dict, List, Optional, Type
3+
from typing import Any, Dict, List, Literal, Optional, Type
44

55
from pydantic import UUID4, Field, root_validator
66
from typing_extensions import Annotated
@@ -223,6 +223,14 @@ class JobSSHKey(CoreModel):
223223
public: str
224224

225225

226+
class ProbeSpec(CoreModel):
227+
type: Literal["http"] # expect other probe types in the future, namely `exec`
228+
url: str
229+
timeout: int
230+
interval: int
231+
ready_after: int
232+
233+
226234
class JobSpec(CoreModel):
227235
replica_num: int = 0 # default value for backward compatibility
228236
job_num: int
@@ -256,6 +264,7 @@ class JobSpec(CoreModel):
256264
file_archives: list[FileArchiveMapping] = []
257265
# None for non-services and pre-0.19.19 services. See `get_service_port`
258266
service_port: Optional[int] = None
267+
probes: list[ProbeSpec] = []
259268

260269

261270
class JobProvisioningData(CoreModel):
@@ -325,6 +334,10 @@ class ClusterInfo(CoreModel):
325334
gpus_per_job: int
326335

327336

337+
class Probe(CoreModel):
338+
success_streak: int
339+
340+
328341
class JobSubmission(CoreModel):
329342
id: UUID4
330343
submission_num: int
@@ -341,6 +354,7 @@ class JobSubmission(CoreModel):
341354
job_provisioning_data: Optional[JobProvisioningData]
342355
job_runtime_data: Optional[JobRuntimeData]
343356
error: Optional[str] = None
357+
probes: list[Probe] = []
344358

345359
@property
346360
def age(self) -> timedelta:

src/dstack/_internal/core/services/ssh/tunnel.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,13 @@ def __enter__(self):
236236
def __exit__(self, exc_type, exc_val, exc_tb):
237237
self.close()
238238

239+
async def __aenter__(self):
240+
await self.aopen()
241+
return self
242+
243+
async def __aexit__(self, exc_type, exc_val, exc_tb):
244+
await self.aclose()
245+
239246
def _get_proxy_command(self) -> Optional[str]:
240247
proxy_command: Optional[str] = None
241248
for params, identity_path in self.ssh_proxies:

0 commit comments

Comments
 (0)