Skip to content

Commit 2916120

Browse files
authored
Clarify how retry works for tasks and services (#2600)
1 parent e420791 commit 2916120

File tree

2 files changed

+24
-3
lines changed

2 files changed

+24
-3
lines changed

docs/docs/concepts/services.md

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -518,12 +518,30 @@ via the [`spot_policy`](../reference/dstack.yml/service.md#spot_policy) property
518518

519519
### Retry policy
520520

521-
By default, if `dstack` can't find capacity, the task exits with an error, or the instance is interrupted,
522-
the run will fail.
521+
By default, if `dstack` can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail.
523522

524523
If you'd like `dstack` to automatically retry, configure the
525524
[retry](../reference/dstack.yml/service.md#retry) property accordingly:
526525

526+
<div editor-title="service.dstack.yml">
527+
528+
```yaml
529+
type: service
530+
image: my-app:latest
531+
port: 80
532+
533+
retry:
534+
# Retry on specific events
535+
on_events: [no-capacity, error, interruption]
536+
# Retry for up to 1 hour
537+
duration: 1h
538+
```
539+
540+
</div>
541+
542+
If one replica of a multi-replica service fails with retry enabled,
543+
`dstack` will resubmit only the failed replica while keeping active replicas running.
544+
527545
--8<-- "docs/concepts/snippets/manage-fleets.ext"
528546

529547
--8<-- "docs/concepts/snippets/manage-runs.ext"

docs/docs/concepts/tasks.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -387,7 +387,7 @@ via the [`spot_policy`](../reference/dstack.yml/task.md#spot_policy) property. I
387387

388388
### Retry policy
389389

390-
By default, if `dstack` can't find capacity, the task exits with an error, or the instance is interrupted,
390+
By default, if `dstack` can't find capacity, or the task exits with an error, or the instance is interrupted,
391391
the run will fail.
392392

393393
If you'd like `dstack` to automatically retry, configure the
@@ -416,6 +416,9 @@ retry:
416416

417417
</div>
418418

419+
If one job of a multi-node task fails with retry enabled,
420+
`dstack` will stop all the jobs and resubmit the run.
421+
419422
--8<-- "docs/concepts/snippets/manage-fleets.ext"
420423

421424
--8<-- "docs/concepts/snippets/manage-runs.ext"

0 commit comments

Comments
 (0)