Skip to content

Commit d0ca9ac

Browse files
Pangjipingclaude
andauthored
fix(egress): retry mitmdump restart with backoff instead of giving up (#942)
* fix(egress): retry mitmdump restart with backoff instead of giving up Previously, if Launch or WaitListenPort failed during a restart (e.g. under node memory pressure that just OOM-killed mitmdump), the watchdog goroutine would log "giving up" and return, leaving egress in a silent dead state with no future restarts. Replace the one-shot restart with restartWithBackoff: retry forever with exponential backoff (1s -> 30s), kill half-launched processes, drain stale exit signals on success, and respect ctx cancellation. Readiness gate stays false across the retry window so k8s drains traffic. * fix(egress): tag mitmdump exits with generation to avoid losing real death The previous restart-with-backoff fix drained restartCh after a successful relaunch to discard stale events from killed half-launched attempts. That drain has a race: if the freshly-restarted mitmdump dies between WaitListenPort returning and the drain executing, the real death event is swallowed and the watcher returns to the outer loop with nothing pending, re-entering the silent dead state the original fix was meant to prevent. Tag every Launch with a monotonic generation captured in its OnExit closure. The watcher compares the event's generation against the currently live generation (set atomically with setRunning) and ignores stale events without draining. Real deaths of the current mitmdump always match the live generation and trigger a restart. * fix(egress): widen restartCh buffer and use GracefulShutdown on retry Two follow-ups to the generation-tagged watchdog: 1. restartCh buffer was 1, so under a retry storm a single stale exit event from a half-launched-and-killed mitmdump occupies the slot. When a later attempt succeeds and the freshly-restarted mitmdump dies immediately (continued OOM pressure), its OnExit hits the default branch in launchTagged and the real death event is dropped. Watcher then reads the stale event, ignores it on gen mismatch, and blocks forever — the same silent-dead-state the watchdog is meant to prevent. Buffer is bumped to 64; stale events are still cheap to discard via the gen check, we just need room to hold them. 2. Replace direct Process.Kill on the half-launched mitmdump with mitmproxy.GracefulShutdown(_, 1s). Kill returns immediately, so the next attempt's Launch can race the dying process for the listen port and fail WaitListenPort purely on contention; GracefulShutdown waits for reap and is consistent with shutdown.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(egress): block OnExit send with shutdown escape so live death events are not dropped The previous default-drop send in launchTagged could still lose the only exit signal for the currently-live mitmdump under sustained retry storms: once the buffer fills with stale events from killed half-launched attempts, a fresh process's death event hits the default branch and the watcher sees only stale generations, leaving the watchdog idle in the exact silent-dead-state it is meant to prevent. Switch OnExit to a blocking send guarded by a shutdownCh that is closed when watchMitmproxy's ctx is cancelled. The watcher always drains the channel, so blocking is bounded; on shutdown the escape branch fires and we log a warning so any drop is observable. Buffer stays at 64 purely as a perf cushion against goroutine pile-up during retry storms; correctness no longer depends on the size. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent dd03224 commit d0ca9ac

1 file changed

Lines changed: 142 additions & 33 deletions

File tree

components/egress/mitmproxy_transparent.go

Lines changed: 142 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ import (
2020
"os"
2121
"strings"
2222
"sync"
23+
"sync/atomic"
2324
"time"
2425

2526
"github.com/alibaba/opensandbox/egress/pkg/constants"
@@ -29,13 +30,25 @@ import (
2930
"github.com/alibaba/opensandbox/internal/safego"
3031
)
3132

33+
// exitEvent carries an OnExit notification tagged with the generation of the
34+
// mitmdump process that produced it. Generation tagging lets the watcher tell
35+
// "the currently-running mitmdump just died" apart from "a half-launched
36+
// attempt we already killed during a retry storm just finished reaping".
37+
type exitEvent struct {
38+
gen uint64
39+
err error
40+
}
41+
3242
type mitmTransparent struct {
33-
mu sync.Mutex
34-
running *mitmproxy.Running
35-
port int
36-
uid uint32
37-
cfg mitmproxy.Config
38-
restartCh chan error
43+
mu sync.Mutex
44+
running *mitmproxy.Running
45+
currentGen uint64 // generation of the mitmdump currently considered live
46+
port int
47+
uid uint32
48+
cfg mitmproxy.Config // OnExit must NOT be set here; built per-Launch
49+
nextGen uint64 // atomic; monotonic gen counter handed to each Launch
50+
restartCh chan exitEvent
51+
shutdownCh chan struct{} // closed by watchMitmproxy on ctx cancel; lets OnExit unblock during shutdown
3952
}
4053

4154
func (m *mitmTransparent) getRunning() *mitmproxy.Running {
@@ -44,10 +57,38 @@ func (m *mitmTransparent) getRunning() *mitmproxy.Running {
4457
return m.running
4558
}
4659

47-
func (m *mitmTransparent) setRunning(r *mitmproxy.Running) {
60+
func (m *mitmTransparent) setRunning(r *mitmproxy.Running, gen uint64) {
4861
m.mu.Lock()
4962
defer m.mu.Unlock()
5063
m.running = r
64+
m.currentGen = gen
65+
}
66+
67+
func (m *mitmTransparent) getCurrentGen() uint64 {
68+
m.mu.Lock()
69+
defer m.mu.Unlock()
70+
return m.currentGen
71+
}
72+
73+
// launchTagged starts mitmdump with an OnExit closure that publishes the death
74+
// of this specific process (identified by gen) into restartCh.
75+
//
76+
// The send is blocking with shutdownCh as the only escape: dropping an exit
77+
// event while the watcher is still running can leave egress in a silent dead
78+
// state (the watcher would never see the death and never trigger a restart).
79+
// Stale events from killed half-launched attempts are still cheap to discard
80+
// downstream via the gen check in watchMitmproxy; we just must not lose them
81+
// in transit. Shutdown is the only legitimate reason to give up on a send,
82+
// and we log a warning when that happens so the drop is observable.
83+
func launchTagged(cfg mitmproxy.Config, restartCh chan<- exitEvent, shutdownCh <-chan struct{}, gen uint64) (*mitmproxy.Running, error) {
84+
cfg.OnExit = func(err error) {
85+
select {
86+
case restartCh <- exitEvent{gen: gen, err: err}:
87+
case <-shutdownCh:
88+
log.Warnf("[mitmproxy] dropping exit event during shutdown (gen=%d): %v", gen, err)
89+
}
90+
}
91+
return mitmproxy.Launch(cfg)
5192
}
5293

5394
// startMitmproxyTransparentIfEnabled starts mitmdump in transparent mode, waits for the listener, and installs OUTPUT REDIRECT, then syncs the CA.
@@ -68,14 +109,15 @@ func startMitmproxyTransparentIfEnabled() (*mitmTransparent, error) {
68109
ConfDir: strings.TrimSpace(os.Getenv(constants.EnvMitmproxyConfDir)),
69110
ScriptPath: strings.TrimSpace(os.Getenv(constants.EnvMitmproxyScript)),
70111
}
71-
restartCh := make(chan error, 1)
72-
cfg.OnExit = func(err error) {
73-
select {
74-
case restartCh <- err:
75-
default:
76-
}
77-
}
78-
running, err := mitmproxy.Launch(cfg)
112+
// Buffer absorbs OnExit events from a retry storm so OnExit goroutines
113+
// don't all park waiting for the watcher to drain. Correctness does not
114+
// depend on the size: launchTagged uses a blocking send with shutdownCh
115+
// as the only escape, so events cannot be silently dropped while the
116+
// watcher is alive.
117+
restartCh := make(chan exitEvent, 64)
118+
shutdownCh := make(chan struct{})
119+
const initialGen uint64 = 1
120+
running, err := launchTagged(cfg, restartCh, shutdownCh, initialGen)
79121
if err != nil {
80122
return nil, fmt.Errorf("start mitmdump: %w", err)
81123
}
@@ -93,44 +135,111 @@ func startMitmproxyTransparentIfEnabled() (*mitmTransparent, error) {
93135
if err := mitmproxy.SyncRootCA(confDir, mpHome); err != nil {
94136
return nil, fmt.Errorf("mitm CA export: %w", err)
95137
}
96-
return &mitmTransparent{running: running, port: mpPort, uid: mpUID, cfg: cfg, restartCh: restartCh}, nil
138+
return &mitmTransparent{
139+
running: running,
140+
currentGen: initialGen,
141+
port: mpPort,
142+
uid: mpUID,
143+
cfg: cfg,
144+
nextGen: initialGen,
145+
restartCh: restartCh,
146+
shutdownCh: shutdownCh,
147+
}, nil
97148
}
98149

99150
// watchMitmproxy monitors mitmdump for unexpected exits, logs the error, and restarts it.
100151
// Must be called after startMitmproxyTransparentIfEnabled.
101152
func (m *mitmTransparent) watchMitmproxy(ctx context.Context, gate *mitmproxy.HealthGate) {
153+
// Closing shutdownCh on ctx cancel unblocks any OnExit closures that are
154+
// parked on the (now-unread) restartCh send so they don't leak past
155+
// shutdown.
156+
safego.Go(func() {
157+
<-ctx.Done()
158+
close(m.shutdownCh)
159+
})
102160
safego.Go(func() {
103161
for {
104162
select {
105-
case err := <-m.restartCh:
163+
case ev := <-m.restartCh:
106164
select {
107165
case <-ctx.Done():
108166
return
109167
default:
110168
}
169+
cur := m.getCurrentGen()
170+
if ev.gen != cur {
171+
// Stale event: a previous half-launched attempt that we
172+
// killed is just now being reaped. The currently-live
173+
// mitmdump is unaffected; ignore and keep watching.
174+
log.Infof("[mitmproxy] ignoring stale exit event (gen=%d, current=%d): %v", ev.gen, cur, ev.err)
175+
continue
176+
}
111177

112-
log.Errorf("[mitmproxy] mitmdump exited: %v; restarting...", err)
178+
log.Errorf("[mitmproxy] mitmdump exited (gen=%d): %v; restarting...", ev.gen, ev.err)
113179
gate.SetReady(false)
180+
m.restartWithBackoff(ctx, gate)
114181

115-
newRunning, launchErr := mitmproxy.Launch(m.cfg)
116-
if launchErr != nil {
117-
log.Errorf("[mitmproxy] failed to restart mitmdump: %v; giving up", launchErr)
118-
return
119-
}
182+
case <-ctx.Done():
183+
return
184+
}
185+
}
186+
})
187+
}
120188

121-
waitAddr := fmt.Sprintf("127.0.0.1:%d", m.cfg.ListenPort)
122-
if waitErr := mitmproxy.WaitListenPort(waitAddr, 15*time.Second); waitErr != nil {
123-
log.Errorf("[mitmproxy] restart: wait listen %s: %v; giving up", waitAddr, waitErr)
124-
return
125-
}
189+
// restartWithBackoff retries mitmdump launch indefinitely with exponential backoff
190+
// (1s, 2s, 4s, ..., capped at 30s) until it succeeds or ctx is cancelled.
191+
// Transient OOM / resource pressure must not leave egress in a permanent dead state.
192+
//
193+
// Each attempt is tagged with a fresh generation; setRunning publishes that
194+
// generation as the "live" one. Exit events for older (killed) generations are
195+
// filtered out by watchMitmproxy, so we do NOT drain restartCh here -- doing
196+
// so could swallow a real death of the freshly-restarted mitmdump.
197+
func (m *mitmTransparent) restartWithBackoff(ctx context.Context, gate *mitmproxy.HealthGate) {
198+
const (
199+
initialBackoff = time.Second
200+
maxBackoff = 30 * time.Second
201+
)
202+
backoff := initialBackoff
203+
waitAddr := fmt.Sprintf("127.0.0.1:%d", m.cfg.ListenPort)
204+
205+
for attempt := 1; ; attempt++ {
206+
select {
207+
case <-ctx.Done():
208+
return
209+
default:
210+
}
126211

127-
m.setRunning(newRunning)
212+
gen := atomic.AddUint64(&m.nextGen, 1)
213+
newRunning, launchErr := launchTagged(m.cfg, m.restartCh, m.shutdownCh, gen)
214+
if launchErr == nil {
215+
if waitErr := mitmproxy.WaitListenPort(waitAddr, 15*time.Second); waitErr == nil {
216+
m.setRunning(newRunning, gen)
128217
gate.SetReady(true)
129-
log.Infof("[mitmproxy] mitmdump restarted (pid %d)", newRunning.Cmd.Process.Pid)
130-
131-
case <-ctx.Done():
218+
log.Infof("[mitmproxy] mitmdump restarted (pid %d, gen %d, attempt %d)", newRunning.Cmd.Process.Pid, gen, attempt)
132219
return
220+
} else {
221+
log.Errorf("[mitmproxy] restart attempt %d (gen %d): wait listen %s: %v", attempt, gen, waitAddr, waitErr)
222+
// GracefulShutdown SIGTERMs then SIGKILLs and waits for reap, so
223+
// the listen port is released before the next attempt's Launch
224+
// races to bind it. Direct Process.Kill returns immediately and
225+
// can cause spurious WaitListenPort failures on port contention.
226+
mitmproxy.GracefulShutdown(newRunning, time.Second)
133227
}
228+
} else {
229+
log.Errorf("[mitmproxy] restart attempt %d (gen %d): launch failed: %v", attempt, gen, launchErr)
134230
}
135-
})
231+
232+
log.Warnf("[mitmproxy] restart attempt %d failed; retrying in %s", attempt, backoff)
233+
select {
234+
case <-ctx.Done():
235+
return
236+
case <-time.After(backoff):
237+
}
238+
if backoff < maxBackoff {
239+
backoff *= 2
240+
if backoff > maxBackoff {
241+
backoff = maxBackoff
242+
}
243+
}
244+
}
136245
}

0 commit comments

Comments
 (0)