feat(egress): wrap with supervisor + cleanup hook#951
Merged
Conversation
Hard-crashed egress leaves stale iptables/nft rules and a zombie mitmdump holding port 18081; restarting the container then accumulates duplicate rules and the new mitmdump fails to bind, sending the in-process mitm watchdog (PR opensandbox-group#942) into a retry loop. This change keeps the egress process under a dedicated supervisor so restarts are deterministic and the dirty state is reset on every launch and exit. components/internal/supervisor: new shared single-worker supervisor. Exponential backoff with jitter, pre-start / post-exit hooks (failures non-fatal), crashloop circuit breaker, JSONL event log. SIGTERM is forwarded to the worker with a configurable grace period before SIGKILL. Includes unit + integration tests using a re-exec'd test binary as a fake child. components/internal/cmd/supervisor: opensandbox-supervisor binary built from the same module; flag-driven, no new external deps. components/egress/scripts/cleanup.sh: best-effort, idempotent reset of the iptables DNS REDIRECT rules, transparent-HTTP rules, the nftables `opensandbox` table, and stray mitmdump processes. Hard contract: never exit non-zero so a misbehaving cleanup cannot block restarts. components/egress/Dockerfile: builds and installs the supervisor and the cleanup script alongside the egress binary under /opt/opensandbox-egress/, then switches the ENTRYPOINT to run the supervisor with cleanup as both pre-start and post-exit, grace 20s to cover the egress-internal shutdown budget. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Self-review fixes: - BackoffJitter is now *float64 so callers can pass &zero to disable jitter explicitly. The previous default override turned 0 into 0.1, making "no jitter" impossible. cmd/supervisor exposes the value via --backoff-jitter (default 0.1). - Build hookEnv into a fresh slice instead of `append(spec.Env, ...)`, which could write into spec.Env's underlying array when cap > len. - Hoist delete_until_gone to file scope in cleanup.sh; remove the two inline duplicates. - Add cmd/supervisor argv tests: splitOnDoubleDash table cases, toHooks, openEventLog stderr + dir creation, eventLogDest label. - Add backoff tests covering jitter=0 and applyDefaults() pointer semantics. - Document signal handling in the package doc: SIGINT/SIGTERM trigger shutdown via ctx; SIGHUP / SIGUSR1 / SIGUSR2 / SIGWINCH / SIGQUIT are NOT forwarded. - Remove dead fakeClock.advance helper. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8b82b6ee00
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
PR review (opensandbox-group#951, Codex P1) caught a real security regression: the post-exit cleanup hook was tearing down the iptables DNS REDIRECT rules and the `inet opensandbox` nft table before the supervisor slept for backoff and relaunched egress. Because the egress sidecar shares its network namespace with the workload it is meant to filter, that window left the workload with unfiltered egress instead of the stale default-deny rules continuing to protect it. With a worst-case crashloop budget of 10 launches over 5 minutes, that window can stretch to minutes. The fix is to leave netfilter state alone between runs: - Drop the post-exit hook entirely. The backoff window now keeps the previous run's enforcement rules in place. - Slim cleanup.sh to mitmdump-reaping only. iptables rule accumulation across many restarts is a slower-burn drift that egress's own SetupRedirect tolerates (first match wins); the nftables manager already prepends `delete table` to its ruleset script, so ApplyStatic is idempotent. Neither needs hook intervention. - Keep the pre-start mitmdump kill so the new egress can bind the transparent-MITM listen port without colliding with an orphan. (Codex P2 — zombie reaping when supervisor is PID 1 — is intentionally not addressed in this commit; it does not gate the security fix.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pangjiping
added a commit
to Pangjiping/OpenSandbox
that referenced
this pull request
Jun 4, 2026
…ibility PR opensandbox-group#951 moved the egress binary from /egress to /opt/opensandbox-egress/egress so the supervisor and binary could share a single grouped directory. External tooling and older deployment manifests may still reference the old /egress path; add a symlink so both paths resolve to the same binary. Symlink rather than COPY: zero extra image size, single source of truth for chmod and replacement, and `exec /egress` resolves to the supervisor-managed binary like before.
Pangjiping
added a commit
to Pangjiping/OpenSandbox
that referenced
this pull request
Jun 4, 2026
…ibility PR opensandbox-group#951 moved the egress binary from /egress to /opt/opensandbox-egress/egress so the supervisor and binary could share a single grouped directory. External tooling and older deployment manifests may still reference the old /egress path; add a symlink so both paths resolve to the same binary. Symlink rather than COPY: zero extra image size, single source of truth for chmod and replacement, and `exec /egress` resolves to the supervisor-managed binary like before.
Pangjiping
added a commit
to Pangjiping/OpenSandbox
that referenced
this pull request
Jun 4, 2026
…ibility PR opensandbox-group#951 moved the egress binary from /egress to /opt/opensandbox-egress/egress so the supervisor and binary could share a single grouped directory. External tooling and older deployment manifests may still reference the old /egress path; add a symlink so both paths resolve to the same binary. Symlink rather than COPY: zero extra image size, single source of truth for chmod and replacement, and `exec /egress` resolves to the supervisor-managed binary like before.
11 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hard-crashed egress leaves stale iptables/nft rules and a zombie mitmdump holding port 18081; restarting the container then accumulates duplicate rules and the new mitmdump fails to bind, sending the in-process mitm watchdog (PR #942) into a retry loop. This change keeps the egress process under a dedicated supervisor so restarts are deterministic and the dirty state is reset on every launch and exit.
components/internal/supervisor: new shared single-worker supervisor. Exponential backoff with jitter, pre-start / post-exit hooks (failures non-fatal), crashloop circuit breaker, JSONL event log. SIGTERM is forwarded to the worker with a configurable grace period before SIGKILL. Includes unit + integration tests using a re-exec'd test binary as a fake child.
components/internal/cmd/supervisor: opensandbox-supervisor binary built from the same module; flag-driven, no new external deps.
components/egress/scripts/cleanup.sh: best-effort, idempotent reset of the iptables DNS REDIRECT rules, transparent-HTTP rules, the nftables
opensandboxtable, and stray mitmdump processes. Hard contract: never exit non-zero so a misbehaving cleanup cannot block restarts.components/egress/Dockerfile: builds and installs the supervisor and the cleanup script alongside the egress binary under /opt/opensandbox-egress/, then switches the ENTRYPOINT to run the supervisor with cleanup as both pre-start and post-exit, grace 20s to cover the egress-internal shutdown budget.
Testing
Breaking Changes
Checklist