Skip to content

Commit 4870df9

Browse files
committed
Review and TODO fixes
- Docs - Compatibility - More tests - Optimized DB queries - Limit max probes and probe timeout - Defaults out of `ProbeConfig` - Stricter validation - Migration rebase - Locking fix
1 parent 2c05b93 commit 4870df9

22 files changed

Lines changed: 497 additions & 91 deletions

File tree

docs/docs/concepts/services.md

Lines changed: 48 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -187,6 +187,53 @@ port: 8000
187187

188188
</div>
189189

190+
### Probes
191+
192+
Configure one or more HTTP probes to periodically check the health of the service.
193+
194+
<div editor-title="service.dstack.yml">
195+
196+
```yaml
197+
type: service
198+
name: my-service
199+
port: 80
200+
image: my-app:latest
201+
probes:
202+
- type: http
203+
url: /health
204+
interval: 15s
205+
```
206+
207+
</div>
208+
209+
You can track probe statuses in `dstack ps --verbose`.
210+
211+
<div class="termy">
212+
213+
```shell
214+
$ dstack ps --verbose
215+
216+
NAME BACKEND STATUS PROBES SUBMITTED
217+
my-service deployment=1 running 11 mins ago
218+
replica=0 job=0 deployment=0 aws (us-west-2) running ✓ 11 mins ago
219+
replica=1 job=0 deployment=1 aws (us-west-2) running × 1 min ago
220+
```
221+
222+
</div>
223+
224+
??? info "Probe statuses"
225+
The following symbols are used for probe statuses:
226+
227+
- `×` &mdash; the last probe execution failed.
228+
- `~` &mdash; the last probe execution succeeded, but the [`ready_after`](../reference/dstack.yml/service.md#ready_after) threshold is not yet reached.
229+
- `` &mdash; the last `ready_after` probe executions succeeded.
230+
231+
If multiple probes are configured for the service, their statuses are displayed in the order in which the probes appear in the configuration.
232+
233+
Probes are executed for each service replica while the replica is `running`. Probe statuses do not affect how `dstack` handles replicas, except during [rolling deployments](#rolling-deployment).
234+
235+
See the [reference](../reference/dstack.yml/service.md#probes) for more probe configuration options.
236+
190237
### Path prefix { #path-prefix }
191238

192239
If your `dstack` project doesn't have a [gateway](gateways.md), services are hosted with the
@@ -758,7 +805,7 @@ Update the run? [y/n]:
758805

759806
</div>
760807

761-
If approved, `dstack` gradually updates the service replicas. To update a replica, `dstack` starts a new replica, waits for it to become `running`, then terminates the old replica. This process is repeated for each replica, one at a time.
808+
If approved, `dstack` gradually updates the service replicas. To update a replica, `dstack` starts a new replica, waits for it to become `running` and for all of its [probes](#probes) to pass, then terminates the old replica. This process is repeated for each replica, one at a time.
762809

763810
You can track the progress of rolling deployment in both `dstack apply` or `dstack ps`.
764811
Older replicas have lower `deployment` numbers; newer ones have higher.
@@ -792,8 +839,6 @@ The rolling deployment stops when all replicas are updated or when a new deploym
792839
To trigger a rolling deployment when no properties have changed (e.g., after updating [secrets](secrets.md) or to restart all replicas),
793840
make a minor config change, such as adding a dummy [environment variable](#environment-variables).
794841

795-
<!-- TODO: probes -->
796-
797842
--8<-- "docs/concepts/snippets/manage-runs.ext"
798843

799844
!!! info "What's next?"

docs/docs/reference/dstack.yml/service.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,16 @@ The `service` configuration type allows running [services](../../concepts/servic
106106
type:
107107
required: true
108108

109+
### `probes`
110+
111+
#### `probes[n]`
112+
113+
#SCHEMA# dstack._internal.core.models.configurations.ProbeConfig
114+
overrides:
115+
show_root_heading: false
116+
type:
117+
required: true
118+
109119
### `retry`
110120

111121
#SCHEMA# dstack._internal.core.models.profiles.ProfileRetry

docs/docs/reference/environment-variables.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,8 @@ For more details on the options below, refer to the [server deployment](../guide
124124
- `DSTACK_DB_MAX_OVERFLOW`{ #DSTACK_DB_MAX_OVERFLOW } - The client DB connections pool allowed overflow. Defaults to `20`.
125125
- `DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR`{ #DSTACK_SERVER_BACKGROUND_PROCESSING_FACTOR } - The number of background jobs for processing server resources. Increase if you need to process more resources per server replica quickly. Defaults to `1`.
126126
- `DSTACK_SERVER_BACKGROUND_PROCESSING_DISABLED`{ #DSTACK_SERVER_BACKGROUND_PROCESSING_DISABLED } - Disables background processing if set to any value. Useful to run only web frontend and API server.
127+
- `DSTACK_SERVER_MAX_PROBES_PER_JOB`{ #DSTACK_SERVER_MAX_PROBES_PER_JOB } - Maximum number of probes allowed in a run configuration. Validated at apply time.
128+
- `DSTACK_SERVER_MAX_PROBE_TIMEOUT`{ #DSTACK_SERVER_MAX_PROBE_TIMEOUT } - Maximum allowed timeout for a probe. Validated at apply time.
127129

128130
??? info "Internal environment variables"
129131
The following environment variables are intended for development purposes:

src/dstack/_internal/cli/utils/run.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from rich.table import Table
66

77
from dstack._internal.cli.utils.common import NO_OFFERS_WARNING, add_row_from_dict, console
8-
from dstack._internal.core.models.configurations import DevEnvironmentConfiguration, ProbeConfig
8+
from dstack._internal.core.models.configurations import DevEnvironmentConfiguration
99
from dstack._internal.core.models.instances import InstanceAvailability
1010
from dstack._internal.core.models.profiles import (
1111
DEFAULT_RUN_TERMINATION_IDLE_TIME,
@@ -14,6 +14,7 @@
1414
from dstack._internal.core.models.runs import (
1515
JobStatus,
1616
Probe,
17+
ProbeSpec,
1718
RunPlan,
1819
)
1920
from dstack._internal.core.services.profiles import get_termination
@@ -241,13 +242,14 @@ def get_runs_table(
241242

242243

243244
def _format_job_probes(
244-
probe_configs: list[ProbeConfig], probes: list[Probe], job_status: JobStatus
245+
probe_specs: list[ProbeSpec], probes: list[Probe], job_status: JobStatus
245246
) -> str:
246247
if not probes or job_status != JobStatus.RUNNING:
247248
return ""
248249
statuses = []
249-
for probe_config, probe in zip(probe_configs, probes):
250-
if probe.success_streak >= probe_config.ready_after:
250+
for probe_spec, probe in zip(probe_specs, probes):
251+
# NOTE: the symbols are documented in concepts/services.md, keep in sync.
252+
if probe.success_streak >= probe_spec.ready_after:
251253
status = "[code]✓[/]"
252254
elif probe.success_streak > 0:
253255
status = "[warning]~[/]"

src/dstack/_internal/core/compatibility/runs.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,8 @@ def get_apply_plan_excludes(plan: ApplyRunPlanInput) -> Optional[IncludeExcludeD
5353
job_submissions_excludes["exit_status"] = True
5454
if all(js.deployment_num == 0 for js in job_submissions):
5555
job_submissions_excludes["deployment_num"] = True
56+
if all(not js.probes for js in job_submissions):
57+
job_submissions_excludes["probes"] = True
5658
latest_job_submission = current_resource.latest_job_submission
5759
if latest_job_submission is not None:
5860
latest_job_submission_excludes: IncludeExcludeDictType = {}
@@ -69,6 +71,8 @@ def get_apply_plan_excludes(plan: ApplyRunPlanInput) -> Optional[IncludeExcludeD
6971
latest_job_submission_excludes["exit_status"] = True
7072
if latest_job_submission.deployment_num == 0:
7173
latest_job_submission_excludes["deployment_num"] = True
74+
if not latest_job_submission.probes:
75+
latest_job_submission_excludes["probes"] = True
7276
return {"plan": apply_plan_excludes}
7377

7478

@@ -120,7 +124,8 @@ def get_run_spec_excludes(run_spec: RunSpec) -> IncludeExcludeDictType:
120124
profile_excludes.add("startup_order")
121125
if configuration.stop_criteria is None:
122126
configuration_excludes["stop_criteria"] = True
123-
# TODO: probes
127+
if isinstance(configuration, ServiceConfiguration) and not configuration.probes:
128+
configuration_excludes["probes"] = True
124129
if profile is not None and profile.stop_criteria is None:
125130
profile_excludes.add("stop_criteria")
126131
if not configuration.files:
@@ -155,7 +160,8 @@ def get_job_spec_excludes(job_specs: list[JobSpec]) -> IncludeExcludeDictType:
155160
spec_excludes["file_archives"] = True
156161
if all(s.service_port is None for s in job_specs):
157162
spec_excludes["service_port"] = True
158-
# TODO: probes
163+
if all(not s.probes for s in job_specs):
164+
spec_excludes["probes"] = True
159165

160166
return spec_excludes
161167

src/dstack/_internal/core/models/configurations.py

Lines changed: 47 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,11 @@
3434
DEFAULT_REPO_DIR = "/workflow"
3535
MIN_PROBE_TIMEOUT = 1
3636
MIN_PROBE_INTERVAL = 1
37+
DEFAULT_PROBE_URL = "/"
38+
DEFAULT_PROBE_TIMEOUT = 10
39+
DEFAULT_PROBE_INTERVAL = 15
40+
DEFAULT_PROBE_READY_AFTER = 1
41+
MAX_PROBE_URL_LEN = 2048
3742

3843

3944
class RunConfigurationType(str, Enum):
@@ -166,53 +171,69 @@ class RateLimit(CoreModel):
166171

167172
class ProbeConfig(CoreModel):
168173
type: Literal["http"] # expect other probe types in the future, namely `exec`
169-
url: Annotated[str, Field(description="The URL to request")] = "/"
174+
url: Annotated[
175+
Optional[str], Field(description=f"The URL to request. Defaults to `{DEFAULT_PROBE_URL}`")
176+
] = None
170177
timeout: Annotated[
171-
Union[int, str],
172-
Field(description=("Maximum amount of time the HTTP request is allowed to take")),
173-
] = "10s"
178+
Optional[Union[int, str]],
179+
Field(
180+
description=(
181+
f"Maximum amount of time the HTTP request is allowed to take. Defaults to `{DEFAULT_PROBE_TIMEOUT}s`"
182+
)
183+
),
184+
] = None
174185
interval: Annotated[
175-
Union[int, str],
186+
Optional[Union[int, str]],
176187
Field(
177188
description=(
178189
"Minimum amount of time between the end of one probe execution"
179-
" and the start of the next"
190+
f" and the start of the next. Defaults to `{DEFAULT_PROBE_INTERVAL}s`"
180191
)
181192
),
182-
] = "15s"
193+
] = None
183194
ready_after: Annotated[
184-
int,
195+
Optional[int],
185196
Field(
186197
ge=1,
187198
description=(
188-
"The number of consecutive successful probe executions required for the job"
189-
" to be considered ready. Used during rolling deployments"
199+
"The number of consecutive successful probe executions required for the replica"
200+
" to be considered ready. Used during rolling deployments."
201+
f" Defaults to `{DEFAULT_PROBE_READY_AFTER}`"
190202
),
191203
),
192-
] = 1
204+
] = None
193205

194206
class Config:
195207
frozen = True
196208

197209
@validator("timeout")
198-
def parse_timeout(cls, v: Union[int, str]) -> int:
210+
def parse_timeout(cls, v: Optional[Union[int, str]]) -> Optional[int]:
211+
if v is None:
212+
return v
199213
parsed = parse_duration(v)
200214
if parsed < MIN_PROBE_TIMEOUT:
201215
raise ValueError(f"Probe timeout cannot be shorter than {MIN_PROBE_TIMEOUT}s")
202216
return parsed
203217

204218
@validator("interval")
205-
def parse_interval(cls, v: Union[int, str]) -> int:
219+
def parse_interval(cls, v: Optional[Union[int, str]]) -> Optional[int]:
220+
if v is None:
221+
return v
206222
parsed = parse_duration(v)
207223
if parsed < MIN_PROBE_INTERVAL:
208224
raise ValueError(f"Probe interval cannot be shorter than {MIN_PROBE_INTERVAL}s")
209225
return parsed
210226

211227
@validator("url")
212-
def validate_url(cls, v: str) -> str:
213-
# TODO: stricter constraints to avoid HTTPX URL parsing errors
228+
def validate_url(cls, v: Optional[str]) -> Optional[str]:
229+
if v is None:
230+
return v
214231
if not v.startswith("/"):
215232
raise ValueError("Must start with `/`")
233+
if len(v) > MAX_PROBE_URL_LEN:
234+
raise ValueError(f"Cannot be longer than {MAX_PROBE_URL_LEN} characters")
235+
if not v.isprintable():
236+
raise ValueError("Cannot contain non-printable characters")
216237
return v
217238

218239

@@ -504,7 +525,7 @@ class ServiceConfigurationParams(CoreModel):
504525
rate_limits: Annotated[list[RateLimit], Field(description="Rate limiting rules")] = []
505526
probes: Annotated[
506527
list[ProbeConfig],
507-
Field(unique_items=True, description="List of probes used to determine job health"),
528+
Field(description="List of probes used to determine job health"),
508529
] = []
509530

510531
@validator("port")
@@ -569,6 +590,16 @@ def validate_rate_limits(cls, v: list[RateLimit]) -> list[RateLimit]:
569590
)
570591
return v
571592

593+
@validator("probes")
594+
def validate_probes(cls, v: list[ProbeConfig]) -> list[ProbeConfig]:
595+
if len(v) != len(set(v)):
596+
# Using a custom validator instead of Field(unique_items=True) to avoid Pydantic bug:
597+
# https://github.com/pydantic/pydantic/issues/3765
598+
# Because of the bug, our gen_schema_reference.py fails to determine the type of
599+
# ServiceConfiguration.probes and insert the correct hyperlink.
600+
raise ValueError("Probes must be unique")
601+
return v
602+
572603

573604
class ServiceConfiguration(
574605
ProfileParams, BaseRunConfigurationWithCommands, ServiceConfigurationParams

src/dstack/_internal/core/models/runs.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from datetime import datetime, timedelta
22
from enum import Enum
3-
from typing import Any, Dict, List, Optional, Type
3+
from typing import Any, Dict, List, Literal, Optional, Type
44

55
from pydantic import UUID4, Field, root_validator
66
from typing_extensions import Annotated
@@ -10,7 +10,6 @@
1010
from dstack._internal.core.models.configurations import (
1111
DEFAULT_REPO_DIR,
1212
AnyRunConfiguration,
13-
ProbeConfig,
1413
RunConfiguration,
1514
ServiceConfiguration,
1615
)
@@ -224,6 +223,14 @@ class JobSSHKey(CoreModel):
224223
public: str
225224

226225

226+
class ProbeSpec(CoreModel):
227+
type: Literal["http"] # expect other probe types in the future, namely `exec`
228+
url: str
229+
timeout: int
230+
interval: int
231+
ready_after: int
232+
233+
227234
class JobSpec(CoreModel):
228235
replica_num: int = 0 # default value for backward compatibility
229236
job_num: int
@@ -257,7 +264,7 @@ class JobSpec(CoreModel):
257264
file_archives: list[FileArchiveMapping] = []
258265
# None for non-services and pre-0.19.19 services. See `get_service_port`
259266
service_port: Optional[int] = None
260-
probes: list[ProbeConfig] = []
267+
probes: list[ProbeSpec] = []
261268

262269

263270
class JobProvisioningData(CoreModel):

0 commit comments

Comments
 (0)