From 120c4141a7f95337921c0c53830b545d1ed6cdea Mon Sep 17 00:00:00 2001 From: Victor Skvortsov Date: Mon, 5 May 2025 10:32:20 +0500 Subject: [PATCH] Clarify how retry works for tasks and services --- docs/docs/concepts/services.md | 22 ++++++++++++++++++++-- docs/docs/concepts/tasks.md | 5 ++++- 2 files changed, 24 insertions(+), 3 deletions(-) diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md index 95b4fb79c5..4d13bc3e68 100644 --- a/docs/docs/concepts/services.md +++ b/docs/docs/concepts/services.md @@ -518,12 +518,30 @@ via the [`spot_policy`](../reference/dstack.yml/service.md#spot_policy) property ### Retry policy -By default, if `dstack` can't find capacity, the task exits with an error, or the instance is interrupted, -the run will fail. +By default, if `dstack` can't find capacity, or the service exits with an error, or the instance is interrupted, the run will fail. If you'd like `dstack` to automatically retry, configure the [retry](../reference/dstack.yml/service.md#retry) property accordingly: +
+ +```yaml +type: service +image: my-app:latest +port: 80 + +retry: + # Retry on specific events + on_events: [no-capacity, error, interruption] + # Retry for up to 1 hour + duration: 1h +``` + +
+ +If one replica of a multi-replica service fails with retry enabled, +`dstack` will resubmit only the failed replica while keeping active replicas running. + --8<-- "docs/concepts/snippets/manage-fleets.ext" --8<-- "docs/concepts/snippets/manage-runs.ext" diff --git a/docs/docs/concepts/tasks.md b/docs/docs/concepts/tasks.md index 05b9dbf340..cf3c6fdd50 100644 --- a/docs/docs/concepts/tasks.md +++ b/docs/docs/concepts/tasks.md @@ -387,7 +387,7 @@ via the [`spot_policy`](../reference/dstack.yml/task.md#spot_policy) property. I ### Retry policy -By default, if `dstack` can't find capacity, the task exits with an error, or the instance is interrupted, +By default, if `dstack` can't find capacity, or the task exits with an error, or the instance is interrupted, the run will fail. If you'd like `dstack` to automatically retry, configure the @@ -416,6 +416,9 @@ retry: +If one job of a multi-node task fails with retry enabled, +`dstack` will stop all the jobs and resubmit the run. + --8<-- "docs/concepts/snippets/manage-fleets.ext" --8<-- "docs/concepts/snippets/manage-runs.ext"