You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-`heartbeat_delay` : Time in seconds to wait before expecting a heartbeat from the application.
78
99
-`heartbeat_interval` : Maximum time period in seconds between heartbeats (`0`:disables heartbeat checks).
79
100
-`cmd` : Command to start the application.
101
+
-`max_retries` : Optional. Maximum number of retry attempts for restart policy. Set to `0` to disable backoff (current behavior - immediate restart). Default: `0`.
102
+
-`base_delay` : Optional. Base delay in seconds for exponential backoff calculation. Only used when `max_retries > 0`. Default: `1`.
103
+
104
+
### Restart Policy with Exponential Backoff
105
+
106
+
The Process Watchdog supports a configurable restart policy with exponential backoff to prevent rapid restart loops when a process consistently fails to start.
107
+
108
+
#### Configuration
109
+
110
+
Add `max_retries` and `base_delay` to your app configuration:
111
+
112
+
```ini
113
+
[app:MyApp]
114
+
start_delay = 10
115
+
heartbeat_delay = 60
116
+
heartbeat_interval = 20
117
+
cmd = /usr/bin/python myapp.py
118
+
max_retries = 5
119
+
base_delay = 1
120
+
```
121
+
122
+
For a practical test example, see the `BackoffTest` app in `config.ini`:
123
+
124
+
```ini
125
+
[app:BackoffTest]
126
+
start_delay = 5
127
+
heartbeat_delay = 10
128
+
heartbeat_interval = 60
129
+
cmd = /usr/bin/python test_child.py 5 crash
130
+
max_retries = 3
131
+
base_delay = 2
132
+
```
133
+
134
+
#### How It Works
135
+
136
+
-**Backoff Disabled (default)**: When `max_retries` is `0` or not specified, the watchdog immediately restarts failed processes (current behavior).
137
+
-**Backoff Enabled**: When `max_retries > 0`, the watchdog implements exponential backoff:
138
+
- On failure, the retry count is incremented
139
+
- The delay before the next restart is calculated as: `delay = base_delay * 2^(retry_count - 1)`
140
+
- Maximum delay is capped at 3600 seconds (1 hour)
141
+
- When `retry_count` reaches `max_retries`, permanent failure is logged and retries stop
142
+
- On successful start, retry count is reset to `0`
143
+
144
+
#### Example Backoff Delays
145
+
146
+
With `base_delay = 1`:
147
+
- 1st retry: 1 second delay (2^0)
148
+
- 2nd retry: 2 seconds delay (2^1)
149
+
- 3rd retry: 4 seconds delay (2^2)
150
+
- 4th retry: 8 seconds delay (2^3)
151
+
- 5th retry: 16 seconds delay (2^4), capped at 3600s if higher
152
+
153
+
#### Behavior
154
+
155
+
-**Retry Count**: Incremented on process failure, reset to `0` on successful start
156
+
-**Failure Logging**: When max retries reached, logs permanent failure with retry count
157
+
-**Backoff Logging**: During backoff wait, logs remaining delay time
80
158
81
159
## Heartbeat Message
82
160
For heartbeat sending processes : A heartbeat message is a UDP packet with the process ID (`PID`) prefixed by `p` (e.g., `p12345` for PID `12345`). It is sent periodically by every managed process to a specified UDP port.
@@ -414,7 +492,6 @@ Or just `./run.sh &` which is recommended.
414
492
- Add periodic server health reporting
415
493
- Add IPC and TCP support
416
494
- Add json support
417
-
- PID reuse detection (validate process start time to prevent false positives when PIDs are recycled)
0 commit comments