You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/saga/recovery.md
+19-6Lines changed: 19 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -104,23 +104,34 @@ Each saga in storage has a **recovery_attempts** counter. It is used to:
104
104
105
105
**Automatic increment:** When `recover_saga()` fails (exception during resume), the storage's `increment_recovery_attempts(saga_id, new_status=SagaStatus.FAILED)` is called automatically. Callers do **not** need to call `increment_recovery_attempts` themselves.
106
106
107
+
**Explicit set:** Use `storage.set_recovery_attempts(saga_id, attempts)` to set the counter to a specific value: e.g. `0` after successfully recovering one of the steps, or the maximum value so the saga is excluded from further recovery without changing its status.
108
+
107
109
**Getting sagas for recovery:** Use `storage.get_sagas_for_recovery()` instead of a custom query:
108
110
109
111
```python
112
+
# All saga types (default)
110
113
ids =await storage.get_sagas_for_recovery(
111
114
limit=50,
112
115
max_recovery_attempts=5, # Only sagas with recovery_attempts < 5
113
116
stale_after_seconds=120, # Only sagas not updated in last 2 minutes (avoids picking active sagas)
114
117
)
118
+
119
+
# Only sagas of a specific type (e.g. one recovery job per saga name)
120
+
ids =await storage.get_sagas_for_recovery(
121
+
limit=50,
122
+
max_recovery_attempts=5,
123
+
saga_name="OrderSaga",
124
+
)
115
125
```
116
126
117
127
| Parameter | Description |
118
128
|-----------|-------------|
119
129
|`limit`| Maximum number of saga IDs to return |
120
130
|`max_recovery_attempts`| Only include sagas with `recovery_attempts` strictly less than this value (default: 5) |
121
131
|`stale_after_seconds`| If set, only include sagas whose `updated_at` is older than (now − this value). Use to avoid picking sagas currently being executed. `None` = no filter |
132
+
|`saga_name`| If set, only include sagas with this name (e.g. handler/type name). `None` (default) = return all saga types |
122
133
123
-
Returns saga IDs in status RUNNING, COMPENSATING, or FAILED, ordered by `updated_at` ascending (oldest first).
134
+
Returns saga IDs in status RUNNINGor COMPENSATING, ordered by `updated_at` ascending (oldest first).
124
135
125
136
## Strict Backward Recovery
126
137
@@ -132,18 +143,19 @@ This prevents "zombie states" where compensation actions conflict with new execu
132
143
133
144
### Background Recovery Job
134
145
135
-
Use `storage.get_sagas_for_recovery()` to get saga IDs that need recovery. On recovery failure, `recover_saga()` calls `increment_recovery_attempts` internally — no extra code needed.
146
+
Use `storage.get_sagas_for_recovery()` to get saga IDs that need recovery. On recovery failure, `recover_saga()` calls `increment_recovery_attempts` internally — no extra code needed. You can pass `saga_name` to run separate recovery jobs per saga type.
stale_after_seconds=120, # Avoid sagas currently being executed
158
+
saga_name=saga_name, # None = all types; or e.g. "OrderSaga" for one type
147
159
)
148
160
for saga_id in ids:
149
161
try:
@@ -182,6 +194,7 @@ scheduler.start()
182
194
1.**Run recovery periodically** — Background job using `get_sagas_for_recovery()` to scan for incomplete sagas
183
195
2.**Use `max_recovery_attempts`** — Exclude sagas that fail recovery too many times (e.g. 5) to avoid infinite retries
184
196
3.**Use `stale_after_seconds`** — Avoid picking sagas that are currently being executed by another worker
185
-
4.**Handle failures** — Log errors and send alerts; `increment_recovery_attempts` is called automatically by `recover_saga`
186
-
5.**Monitor metrics** — Track recovery rate, duration, failures, and sagas exceeding max attempts
187
-
6.**Use persistent storage** — Memory storage loses data on restart
197
+
4.**Use `saga_name` for per-type recovery** — When running separate recovery jobs per saga type, pass `saga_name` so each job only processes its own sagas
198
+
5.**Handle failures** — Log errors and send alerts; `increment_recovery_attempts` is called automatically by `recover_saga`
199
+
6.**Monitor metrics** — Track recovery rate, duration, failures, and sagas exceeding max attempts
200
+
7.**Use persistent storage** — Memory storage loses data on restart
-**get_sagas_for_recovery** — Returns saga IDs that need recovery (RUNNING, COMPENSATING, FAILED) with`recovery_attempts`< `max_recovery_attempts`, optionally filtered by staleness. Used by recovery jobs.
39
+
-**get_sagas_for_recovery** — Returns saga IDs that need recovery (RUNNING, COMPENSATING) with`recovery_attempts`< `max_recovery_attempts`, optionally filtered by stalenessand by saga name. When `saga_name`is`None` (default), returns all saga types; when set, only sagas with that name. Used by recovery jobs.
39
40
-**increment_recovery_attempts** — Called automatically by `recover_saga()` on recovery failure; increments `recovery_attempts`and optionally updates status (e.g. to FAILED).
41
+
-**set_recovery_attempts** — Sets the recovery attempt counter to an explicit value. Use to reset after successfully recovering a step (e.g. set to `0`) or to set to the maximum so the saga is excluded from further recovery (e.g. mark as permanently failed without changing status).
40
42
41
43
## Memory Storage
42
44
@@ -83,7 +85,7 @@ Database-backed implementation for production. It uses a session factory to mana
-`version` (INTEGER) - Optimistic locking version (default: 1)
86
-
-`recovery_attempts` (INTEGER) - Number of failed recovery attempts (default: 0); used by `get_sagas_for_recovery`and`increment_recovery_attempts`
88
+
-`recovery_attempts` (INTEGER) - Number of failed recovery attempts (default: 0); used by `get_sagas_for_recovery`, `increment_recovery_attempts`, and`set_recovery_attempts`
0 commit comments