Skip to content

Commit 50a296c

Browse files
committed
Update Readme
1 parent a4366d5 commit 50a296c

2 files changed

Lines changed: 100 additions & 2 deletions

File tree

README.md

Lines changed: 78 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,24 +43,45 @@ start_delay = 10
4343
heartbeat_delay = 60
4444
heartbeat_interval = 20
4545
cmd = /usr/bin/python test_child.py 1 crash
46+
# Restart policy with exponential backoff (optional)
47+
# max_retries = 5
48+
# base_delay = 1
4649

4750
[app:Bot]
4851
start_delay = 20
4952
heartbeat_delay = 90
5053
heartbeat_interval = 30
5154
cmd = /usr/bin/python test_child.py 2 noheartbeat
55+
# Restart policy with exponential backoff (optional)
56+
# max_retries = 5
57+
# base_delay = 1
5258

5359
[app:Publisher]
5460
start_delay = 35
5561
heartbeat_delay = 70
5662
heartbeat_interval = 16
5763
cmd = /usr/bin/python test_child.py 3 crash
64+
# Restart policy with exponential backoff (optional)
65+
# max_retries = 5
66+
# base_delay = 1
5867

5968
[app:Alert]
6069
start_delay = 35
6170
heartbeat_delay = 130
6271
heartbeat_interval = 13
6372
cmd = /usr/bin/python test_child.py 4 noheartbeat
73+
# Restart policy with exponential backoff (optional)
74+
# max_retries = 5
75+
# base_delay = 1
76+
77+
[app:BackoffTest]
78+
start_delay = 5
79+
heartbeat_delay = 10
80+
heartbeat_interval = 120
81+
cmd = /usr/bin/python test_child.py 5 crash
82+
# Restart policy with exponential backoff
83+
max_retries = 3
84+
base_delay = 2
6485
```
6586

6687
### Fields
@@ -77,6 +98,63 @@ cmd = /usr/bin/python test_child.py 4 noheartbeat
7798
- `heartbeat_delay` : Time in seconds to wait before expecting a heartbeat from the application.
7899
- `heartbeat_interval` : Maximum time period in seconds between heartbeats (`0`:disables heartbeat checks).
79100
- `cmd` : Command to start the application.
101+
- `max_retries` : Optional. Maximum number of retry attempts for restart policy. Set to `0` to disable backoff (current behavior - immediate restart). Default: `0`.
102+
- `base_delay` : Optional. Base delay in seconds for exponential backoff calculation. Only used when `max_retries > 0`. Default: `1`.
103+
104+
### Restart Policy with Exponential Backoff
105+
106+
The Process Watchdog supports a configurable restart policy with exponential backoff to prevent rapid restart loops when a process consistently fails to start.
107+
108+
#### Configuration
109+
110+
Add `max_retries` and `base_delay` to your app configuration:
111+
112+
```ini
113+
[app:MyApp]
114+
start_delay = 10
115+
heartbeat_delay = 60
116+
heartbeat_interval = 20
117+
cmd = /usr/bin/python myapp.py
118+
max_retries = 5
119+
base_delay = 1
120+
```
121+
122+
For a practical test example, see the `BackoffTest` app in `config.ini`:
123+
124+
```ini
125+
[app:BackoffTest]
126+
start_delay = 5
127+
heartbeat_delay = 10
128+
heartbeat_interval = 60
129+
cmd = /usr/bin/python test_child.py 5 crash
130+
max_retries = 3
131+
base_delay = 2
132+
```
133+
134+
#### How It Works
135+
136+
- **Backoff Disabled (default)**: When `max_retries` is `0` or not specified, the watchdog immediately restarts failed processes (current behavior).
137+
- **Backoff Enabled**: When `max_retries > 0`, the watchdog implements exponential backoff:
138+
- On failure, the retry count is incremented
139+
- The delay before the next restart is calculated as: `delay = base_delay * 2^(retry_count - 1)`
140+
- Maximum delay is capped at 3600 seconds (1 hour)
141+
- When `retry_count` reaches `max_retries`, permanent failure is logged and retries stop
142+
- On successful start, retry count is reset to `0`
143+
144+
#### Example Backoff Delays
145+
146+
With `base_delay = 1`:
147+
- 1st retry: 1 second delay (2^0)
148+
- 2nd retry: 2 seconds delay (2^1)
149+
- 3rd retry: 4 seconds delay (2^2)
150+
- 4th retry: 8 seconds delay (2^3)
151+
- 5th retry: 16 seconds delay (2^4), capped at 3600s if higher
152+
153+
#### Behavior
154+
155+
- **Retry Count**: Incremented on process failure, reset to `0` on successful start
156+
- **Failure Logging**: When max retries reached, logs permanent failure with retry count
157+
- **Backoff Logging**: During backoff wait, logs remaining delay time
80158

81159
## Heartbeat Message
82160
For heartbeat sending processes : A heartbeat message is a UDP packet with the process ID (`PID`) prefixed by `p` (e.g., `p12345` for PID `12345`). It is sent periodically by every managed process to a specified UDP port.
@@ -414,7 +492,6 @@ Or just `./run.sh &` which is recommended.
414492
- Add periodic server health reporting
415493
- Add IPC and TCP support
416494
- Add json support
417-
- PID reuse detection (validate process start time to prevent false positives when PIDs are recycled)
418495
- Configurable timeouts (resource sampling interval, stats write interval, process termination timeout)
419496

420497
## :snowman: Author

config.ini

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,42 @@ start_delay = 10
77
heartbeat_delay = 60
88
heartbeat_interval = 20
99
cmd = /usr/bin/python test_child.py 1 crash
10+
# Restart policy with exponential backoff (optional)
11+
# max_retries = 5
12+
# base_delay = 1
1013

1114
[app:Bot]
1215
start_delay = 20
1316
heartbeat_delay = 90
1417
heartbeat_interval = 30
1518
cmd = /usr/bin/python test_child.py 2 noheartbeat
19+
# Restart policy with exponential backoff (optional)
20+
# max_retries = 5
21+
# base_delay = 1
1622

1723
[app:Publisher]
1824
start_delay = 35
1925
heartbeat_delay = 70
2026
heartbeat_interval = 16
2127
cmd = /usr/bin/python test_child.py 3 crash
28+
# Restart policy with exponential backoff (optional)
29+
# max_retries = 5
30+
# base_delay = 1
2231

2332
[app:Alert]
2433
start_delay = 35
2534
heartbeat_delay = 130
2635
heartbeat_interval = 13
27-
cmd = /usr/bin/python test_child.py 4 noheartbeat
36+
cmd = /usr/bin/python test_child.py 4 noheartbeat
37+
# Restart policy with exponential backoff (optional)
38+
# max_retries = 5
39+
# base_delay = 1
40+
41+
[app:BackoffTest]
42+
start_delay = 5
43+
heartbeat_delay = 10
44+
heartbeat_interval = 60
45+
cmd = /usr/bin/python test_child.py 5 crash
46+
# Restart policy with exponential backoff
47+
max_retries = 3
48+
base_delay = 2

0 commit comments

Comments
 (0)