Skip to content

fix(egress): retry mitmdump restart with backoff instead of giving up#942

Merged
hittyt merged 4 commits into
opensandbox-group:mainfrom
Pangjiping:fix/egress-mitmproxy-watchdog-backoff
May 27, 2026
Merged

fix(egress): retry mitmdump restart with backoff instead of giving up#942
hittyt merged 4 commits into
opensandbox-group:mainfrom
Pangjiping:fix/egress-mitmproxy-watchdog-backoff

Conversation

@Pangjiping
Copy link
Copy Markdown
Collaborator

Summary

  • Previously, if Launch or WaitListenPort failed during a restart (e.g. under node memory pressure that just OOM-killed mitmdump), the watchdog goroutine would log "giving up" and return, leaving egress in a silent dead state with no future restarts.
  • Replace the one-shot restart with restartWithBackoff: retry forever with exponential backoff (1s -> 30s), kill half-launched processes, drain stale exit signals on success, and respect ctx cancellation. Readiness gate stays false across the retry window so k8s drains traffic.

Testing

  • Not run (explain why)
  • Unit tests
  • Integration tests
  • e2e / manual verification

Breaking Changes

  • None
  • Yes (describe impact and migration path)

Checklist

  • Linked Issue or clearly described motivation
  • Added/updated docs (if needed)
  • Added/updated tests (if needed)
  • Security impact considered
  • Backward compatibility considered

Previously, if Launch or WaitListenPort failed during a restart (e.g.
under node memory pressure that just OOM-killed mitmdump), the watchdog
goroutine would log "giving up" and return, leaving egress in a silent
dead state with no future restarts.

Replace the one-shot restart with restartWithBackoff: retry forever with
exponential backoff (1s -> 30s), kill half-launched processes, drain
stale exit signals on success, and respect ctx cancellation. Readiness
gate stays false across the retry window so k8s drains traffic.
@Pangjiping Pangjiping added the bug Something isn't working label May 25, 2026
@Pangjiping Pangjiping requested review from hittyt and jwx0925 as code owners May 25, 2026 02:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56c2623741

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread components/egress/mitmproxy_transparent.go Outdated
Comment thread components/egress/mitmproxy_transparent.go Outdated
Pangjiping and others added 2 commits May 25, 2026 11:06
…death

The previous restart-with-backoff fix drained restartCh after a successful
relaunch to discard stale events from killed half-launched attempts. That
drain has a race: if the freshly-restarted mitmdump dies between
WaitListenPort returning and the drain executing, the real death event is
swallowed and the watcher returns to the outer loop with nothing pending,
re-entering the silent dead state the original fix was meant to prevent.

Tag every Launch with a monotonic generation captured in its OnExit
closure. The watcher compares the event's generation against the currently
live generation (set atomically with setRunning) and ignores stale events
without draining. Real deaths of the current mitmdump always match the
live generation and trigger a restart.
Two follow-ups to the generation-tagged watchdog:

1. restartCh buffer was 1, so under a retry storm a single stale exit
   event from a half-launched-and-killed mitmdump occupies the slot.
   When a later attempt succeeds and the freshly-restarted mitmdump
   dies immediately (continued OOM pressure), its OnExit hits the
   default branch in launchTagged and the real death event is dropped.
   Watcher then reads the stale event, ignores it on gen mismatch, and
   blocks forever — the same silent-dead-state the watchdog is meant
   to prevent. Buffer is bumped to 64; stale events are still cheap to
   discard via the gen check, we just need room to hold them.

2. Replace direct Process.Kill on the half-launched mitmdump with
   mitmproxy.GracefulShutdown(_, 1s). Kill returns immediately, so the
   next attempt's Launch can race the dying process for the listen
   port and fail WaitListenPort purely on contention; GracefulShutdown
   waits for reap and is consistent with shutdown.go.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd2d66d397

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread components/egress/mitmproxy_transparent.go
…nts are not dropped

The previous default-drop send in launchTagged could still lose the only
exit signal for the currently-live mitmdump under sustained retry storms:
once the buffer fills with stale events from killed half-launched
attempts, a fresh process's death event hits the default branch and the
watcher sees only stale generations, leaving the watchdog idle in the
exact silent-dead-state it is meant to prevent.

Switch OnExit to a blocking send guarded by a shutdownCh that is closed
when watchMitmproxy's ctx is cancelled. The watcher always drains the
channel, so blocking is bounded; on shutdown the escape branch fires and
we log a warning so any drop is observable. Buffer stays at 64 purely as
a perf cushion against goroutine pile-up during retry storms; correctness
no longer depends on the size.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@hittyt hittyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hittyt hittyt merged commit d0ca9ac into opensandbox-group:main May 27, 2026
11 checks passed
hittyt pushed a commit that referenced this pull request Jun 3, 2026
* feat(egress): wrap with supervisor + cleanup hook

Hard-crashed egress leaves stale iptables/nft rules and a zombie mitmdump
holding port 18081; restarting the container then accumulates duplicate
rules and the new mitmdump fails to bind, sending the in-process mitm
watchdog (PR #942) into a retry loop. This change keeps the egress
process under a dedicated supervisor so restarts are deterministic and
the dirty state is reset on every launch and exit.

components/internal/supervisor: new shared single-worker supervisor.
Exponential backoff with jitter, pre-start / post-exit hooks (failures
non-fatal), crashloop circuit breaker, JSONL event log. SIGTERM is
forwarded to the worker with a configurable grace period before SIGKILL.
Includes unit + integration tests using a re-exec'd test binary as a
fake child.

components/internal/cmd/supervisor: opensandbox-supervisor binary built
from the same module; flag-driven, no new external deps.

components/egress/scripts/cleanup.sh: best-effort, idempotent reset of
the iptables DNS REDIRECT rules, transparent-HTTP rules, the
nftables `opensandbox` table, and stray mitmdump processes. Hard
contract: never exit non-zero so a misbehaving cleanup cannot block
restarts.

components/egress/Dockerfile: builds and installs the supervisor and
the cleanup script alongside the egress binary under
/opt/opensandbox-egress/, then switches the ENTRYPOINT to run the
supervisor with cleanup as both pre-start and post-exit, grace 20s to
cover the egress-internal shutdown budget.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* style(supervisor): gofmt

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(supervisor): jitter-disable, env aliasing, cleanup dedupe, tests

Self-review fixes:

- BackoffJitter is now *float64 so callers can pass &zero to disable
  jitter explicitly. The previous default override turned 0 into 0.1,
  making "no jitter" impossible. cmd/supervisor exposes the value via
  --backoff-jitter (default 0.1).
- Build hookEnv into a fresh slice instead of `append(spec.Env, ...)`,
  which could write into spec.Env's underlying array when cap > len.
- Hoist delete_until_gone to file scope in cleanup.sh; remove the two
  inline duplicates.
- Add cmd/supervisor argv tests: splitOnDoubleDash table cases, toHooks,
  openEventLog stderr + dir creation, eventLogDest label.
- Add backoff tests covering jitter=0 and applyDefaults() pointer
  semantics.
- Document signal handling in the package doc: SIGINT/SIGTERM trigger
  shutdown via ctx; SIGHUP / SIGUSR1 / SIGUSR2 / SIGWINCH / SIGQUIT are
  NOT forwarded.
- Remove dead fakeClock.advance helper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(egress): keep enforcement active during restart backoff

PR review (#951, Codex P1) caught a real security
regression: the post-exit cleanup hook was tearing down the iptables DNS
REDIRECT rules and the `inet opensandbox` nft table before the supervisor
slept for backoff and relaunched egress. Because the egress sidecar shares
its network namespace with the workload it is meant to filter, that window
left the workload with unfiltered egress instead of the stale default-deny
rules continuing to protect it. With a worst-case crashloop budget of
10 launches over 5 minutes, that window can stretch to minutes.

The fix is to leave netfilter state alone between runs:

- Drop the post-exit hook entirely. The backoff window now keeps the
  previous run's enforcement rules in place.
- Slim cleanup.sh to mitmdump-reaping only. iptables rule accumulation
  across many restarts is a slower-burn drift that egress's own
  SetupRedirect tolerates (first match wins); the nftables manager
  already prepends `delete table` to its ruleset script, so ApplyStatic
  is idempotent. Neither needs hook intervention.
- Keep the pre-start mitmdump kill so the new egress can bind the
  transparent-MITM listen port without colliding with an orphan.

(Codex P2 — zombie reaping when supervisor is PID 1 — is intentionally
not addressed in this commit; it does not gate the security fix.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working component/egress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants