Commit d0ca9ac
fix(egress): retry mitmdump restart with backoff instead of giving up (#942)
* fix(egress): retry mitmdump restart with backoff instead of giving up
Previously, if Launch or WaitListenPort failed during a restart (e.g.
under node memory pressure that just OOM-killed mitmdump), the watchdog
goroutine would log "giving up" and return, leaving egress in a silent
dead state with no future restarts.
Replace the one-shot restart with restartWithBackoff: retry forever with
exponential backoff (1s -> 30s), kill half-launched processes, drain
stale exit signals on success, and respect ctx cancellation. Readiness
gate stays false across the retry window so k8s drains traffic.
* fix(egress): tag mitmdump exits with generation to avoid losing real death
The previous restart-with-backoff fix drained restartCh after a successful
relaunch to discard stale events from killed half-launched attempts. That
drain has a race: if the freshly-restarted mitmdump dies between
WaitListenPort returning and the drain executing, the real death event is
swallowed and the watcher returns to the outer loop with nothing pending,
re-entering the silent dead state the original fix was meant to prevent.
Tag every Launch with a monotonic generation captured in its OnExit
closure. The watcher compares the event's generation against the currently
live generation (set atomically with setRunning) and ignores stale events
without draining. Real deaths of the current mitmdump always match the
live generation and trigger a restart.
* fix(egress): widen restartCh buffer and use GracefulShutdown on retry
Two follow-ups to the generation-tagged watchdog:
1. restartCh buffer was 1, so under a retry storm a single stale exit
event from a half-launched-and-killed mitmdump occupies the slot.
When a later attempt succeeds and the freshly-restarted mitmdump
dies immediately (continued OOM pressure), its OnExit hits the
default branch in launchTagged and the real death event is dropped.
Watcher then reads the stale event, ignores it on gen mismatch, and
blocks forever — the same silent-dead-state the watchdog is meant
to prevent. Buffer is bumped to 64; stale events are still cheap to
discard via the gen check, we just need room to hold them.
2. Replace direct Process.Kill on the half-launched mitmdump with
mitmproxy.GracefulShutdown(_, 1s). Kill returns immediately, so the
next attempt's Launch can race the dying process for the listen
port and fail WaitListenPort purely on contention; GracefulShutdown
waits for reap and is consistent with shutdown.go.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* fix(egress): block OnExit send with shutdown escape so live death events are not dropped
The previous default-drop send in launchTagged could still lose the only
exit signal for the currently-live mitmdump under sustained retry storms:
once the buffer fills with stale events from killed half-launched
attempts, a fresh process's death event hits the default branch and the
watcher sees only stale generations, leaving the watchdog idle in the
exact silent-dead-state it is meant to prevent.
Switch OnExit to a blocking send guarded by a shutdownCh that is closed
when watchMitmproxy's ctx is cancelled. The watcher always drains the
channel, so blocking is bounded; on shutdown the escape branch fires and
we log a warning so any drop is observable. Buffer stays at 64 purely as
a perf cushion against goroutine pile-up during retry storms; correctness
no longer depends on the size.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>1 parent dd03224 commit d0ca9ac
1 file changed
Lines changed: 142 additions & 33 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
23 | 24 | | |
24 | 25 | | |
25 | 26 | | |
| |||
29 | 30 | | |
30 | 31 | | |
31 | 32 | | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
32 | 42 | | |
33 | | - | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
39 | 52 | | |
40 | 53 | | |
41 | 54 | | |
| |||
44 | 57 | | |
45 | 58 | | |
46 | 59 | | |
47 | | - | |
| 60 | + | |
48 | 61 | | |
49 | 62 | | |
50 | 63 | | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
51 | 92 | | |
52 | 93 | | |
53 | 94 | | |
| |||
68 | 109 | | |
69 | 110 | | |
70 | 111 | | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
79 | 121 | | |
80 | 122 | | |
81 | 123 | | |
| |||
93 | 135 | | |
94 | 136 | | |
95 | 137 | | |
96 | | - | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
97 | 148 | | |
98 | 149 | | |
99 | 150 | | |
100 | 151 | | |
101 | 152 | | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
102 | 160 | | |
103 | 161 | | |
104 | 162 | | |
105 | | - | |
| 163 | + | |
106 | 164 | | |
107 | 165 | | |
108 | 166 | | |
109 | 167 | | |
110 | 168 | | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
111 | 177 | | |
112 | | - | |
| 178 | + | |
113 | 179 | | |
| 180 | + | |
114 | 181 | | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
120 | 188 | | |
121 | | - | |
122 | | - | |
123 | | - | |
124 | | - | |
125 | | - | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
126 | 211 | | |
127 | | - | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
128 | 217 | | |
129 | | - | |
130 | | - | |
131 | | - | |
| 218 | + | |
132 | 219 | | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
133 | 227 | | |
| 228 | + | |
| 229 | + | |
134 | 230 | | |
135 | | - | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
136 | 245 | | |
0 commit comments