fix(egress): retry mitmdump restart with backoff instead of giving up#942
Conversation
Previously, if Launch or WaitListenPort failed during a restart (e.g. under node memory pressure that just OOM-killed mitmdump), the watchdog goroutine would log "giving up" and return, leaving egress in a silent dead state with no future restarts. Replace the one-shot restart with restartWithBackoff: retry forever with exponential backoff (1s -> 30s), kill half-launched processes, drain stale exit signals on success, and respect ctx cancellation. Readiness gate stays false across the retry window so k8s drains traffic.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56c2623741
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…death The previous restart-with-backoff fix drained restartCh after a successful relaunch to discard stale events from killed half-launched attempts. That drain has a race: if the freshly-restarted mitmdump dies between WaitListenPort returning and the drain executing, the real death event is swallowed and the watcher returns to the outer loop with nothing pending, re-entering the silent dead state the original fix was meant to prevent. Tag every Launch with a monotonic generation captured in its OnExit closure. The watcher compares the event's generation against the currently live generation (set atomically with setRunning) and ignores stale events without draining. Real deaths of the current mitmdump always match the live generation and trigger a restart.
Two follow-ups to the generation-tagged watchdog: 1. restartCh buffer was 1, so under a retry storm a single stale exit event from a half-launched-and-killed mitmdump occupies the slot. When a later attempt succeeds and the freshly-restarted mitmdump dies immediately (continued OOM pressure), its OnExit hits the default branch in launchTagged and the real death event is dropped. Watcher then reads the stale event, ignores it on gen mismatch, and blocks forever — the same silent-dead-state the watchdog is meant to prevent. Buffer is bumped to 64; stale events are still cheap to discard via the gen check, we just need room to hold them. 2. Replace direct Process.Kill on the half-launched mitmdump with mitmproxy.GracefulShutdown(_, 1s). Kill returns immediately, so the next attempt's Launch can race the dying process for the listen port and fail WaitListenPort purely on contention; GracefulShutdown waits for reap and is consistent with shutdown.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd2d66d397
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…nts are not dropped The previous default-drop send in launchTagged could still lose the only exit signal for the currently-live mitmdump under sustained retry storms: once the buffer fills with stale events from killed half-launched attempts, a fresh process's death event hits the default branch and the watcher sees only stale generations, leaving the watchdog idle in the exact silent-dead-state it is meant to prevent. Switch OnExit to a blocking send guarded by a shutdownCh that is closed when watchMitmproxy's ctx is cancelled. The watcher always drains the channel, so blocking is bounded; on shutdown the escape branch fires and we log a warning so any drop is observable. Buffer stays at 64 purely as a perf cushion against goroutine pile-up during retry storms; correctness no longer depends on the size. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* feat(egress): wrap with supervisor + cleanup hook Hard-crashed egress leaves stale iptables/nft rules and a zombie mitmdump holding port 18081; restarting the container then accumulates duplicate rules and the new mitmdump fails to bind, sending the in-process mitm watchdog (PR #942) into a retry loop. This change keeps the egress process under a dedicated supervisor so restarts are deterministic and the dirty state is reset on every launch and exit. components/internal/supervisor: new shared single-worker supervisor. Exponential backoff with jitter, pre-start / post-exit hooks (failures non-fatal), crashloop circuit breaker, JSONL event log. SIGTERM is forwarded to the worker with a configurable grace period before SIGKILL. Includes unit + integration tests using a re-exec'd test binary as a fake child. components/internal/cmd/supervisor: opensandbox-supervisor binary built from the same module; flag-driven, no new external deps. components/egress/scripts/cleanup.sh: best-effort, idempotent reset of the iptables DNS REDIRECT rules, transparent-HTTP rules, the nftables `opensandbox` table, and stray mitmdump processes. Hard contract: never exit non-zero so a misbehaving cleanup cannot block restarts. components/egress/Dockerfile: builds and installs the supervisor and the cleanup script alongside the egress binary under /opt/opensandbox-egress/, then switches the ENTRYPOINT to run the supervisor with cleanup as both pre-start and post-exit, grace 20s to cover the egress-internal shutdown budget. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * style(supervisor): gofmt Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(supervisor): jitter-disable, env aliasing, cleanup dedupe, tests Self-review fixes: - BackoffJitter is now *float64 so callers can pass &zero to disable jitter explicitly. The previous default override turned 0 into 0.1, making "no jitter" impossible. cmd/supervisor exposes the value via --backoff-jitter (default 0.1). - Build hookEnv into a fresh slice instead of `append(spec.Env, ...)`, which could write into spec.Env's underlying array when cap > len. - Hoist delete_until_gone to file scope in cleanup.sh; remove the two inline duplicates. - Add cmd/supervisor argv tests: splitOnDoubleDash table cases, toHooks, openEventLog stderr + dir creation, eventLogDest label. - Add backoff tests covering jitter=0 and applyDefaults() pointer semantics. - Document signal handling in the package doc: SIGINT/SIGTERM trigger shutdown via ctx; SIGHUP / SIGUSR1 / SIGUSR2 / SIGWINCH / SIGQUIT are NOT forwarded. - Remove dead fakeClock.advance helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(egress): keep enforcement active during restart backoff PR review (#951, Codex P1) caught a real security regression: the post-exit cleanup hook was tearing down the iptables DNS REDIRECT rules and the `inet opensandbox` nft table before the supervisor slept for backoff and relaunched egress. Because the egress sidecar shares its network namespace with the workload it is meant to filter, that window left the workload with unfiltered egress instead of the stale default-deny rules continuing to protect it. With a worst-case crashloop budget of 10 launches over 5 minutes, that window can stretch to minutes. The fix is to leave netfilter state alone between runs: - Drop the post-exit hook entirely. The backoff window now keeps the previous run's enforcement rules in place. - Slim cleanup.sh to mitmdump-reaping only. iptables rule accumulation across many restarts is a slower-burn drift that egress's own SetupRedirect tolerates (first match wins); the nftables manager already prepends `delete table` to its ruleset script, so ApplyStatic is idempotent. Neither needs hook intervention. - Keep the pre-start mitmdump kill so the new egress can bind the transparent-MITM listen port without colliding with an orphan. (Codex P2 — zombie reaping when supervisor is PID 1 — is intentionally not addressed in this commit; it does not gate the security fix.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Testing
Breaking Changes
Checklist