fix(envd): self-heal MMDS routing on /init lookup failure#2701
Conversation
When a user-installed PREROUTING/OUTPUT NAT rule in the same netns shadows the MMDS route (e.g. `iptables -t nat -A PREROUTING -d 169.254.169.254 -p tcp --dport 80 -j REDIRECT`), envd's resume-time token-hash lookup fails and /init returns 401. The user's app keeps working because their app needs the redirect. On MMDS lookup failure only (no happy-path cost), install a private nat chain (E2B_MMDS) with a RETURN for 169.254.169.254:80 and re-pin its jump at position 1 in PREROUTING and OUTPUT. Retry the MMDS lookup once. Self-healing rather than always-on: every /init that hits the slow path puts our jump back at position 1, so subsequent resumes recover even if the customer template still installs the conflicting rule.
PR SummaryMedium Risk Overview Reviewed by Cursor Bugbot for commit 6589bfb. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 3 Tests Failed:
View the full list of 3 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
Using the RETURN target in the E2B_MMDS chain allows subsequent rules in PREROUTING or OUTPUT to still process the packet; using ACCEPT instead would effectively bypass user-installed NAT rules by terminating chain traversal. The iptables commands should also include the -w flag to prevent failures due to concurrent lock acquisition in the network namespace.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: RETURN target does not prevent subsequent NAT rule matching
- Changed iptables target from RETURN to ACCEPT on line 21 to terminate all NAT rule evaluation and prevent subsequent REDIRECT rules from hijacking MMDS traffic.
Or push these changes by commenting:
@cursor push 33f7a29d4b
Preview (33f7a29d4b)
diff --git a/packages/envd/internal/host/mmds_route_linux.go b/packages/envd/internal/host/mmds_route_linux.go
--- a/packages/envd/internal/host/mmds_route_linux.go
+++ b/packages/envd/internal/host/mmds_route_linux.go
@@ -18,7 +18,7 @@
commands := [][]string{
{"-N", "E2B_MMDS"},
{"-F", "E2B_MMDS"},
- {"-A", "E2B_MMDS", "-d", "169.254.169.254", "-p", "tcp", "--dport", "80", "-j", "RETURN"},
+ {"-A", "E2B_MMDS", "-d", "169.254.169.254", "-p", "tcp", "--dport", "80", "-j", "ACCEPT"},
{"-D", "PREROUTING", "-j", "E2B_MMDS"},
{"-I", "PREROUTING", "1", "-j", "E2B_MMDS"},
{"-D", "OUTPUT", "-j", "E2B_MMDS"},You can send follow-ups to the cloud agent here.
Reviewed by Cursor Bugbot for commit 37c4de7. Configure here.
Switch from a private E2B_MMDS chain (jump + RETURN) to inserting the RETURN rule directly into nat PREROUTING/OUTPUT at position 1, matching the rules known to work. The chain approach was not equivalent: RETURN from a user-defined chain falls through to the next caller rule, so customer DNAT/REDIRECT rules at PREROUTING[2:] would still match.
Gemini review: a user iptables process can hold the xtables lock when our self-heal fires; -w 5 makes us wait up to 5s instead of failing immediately on EAGAIN.
- pkg/version.go: 0.5.23 -> 0.5.24 (new behavioral path on /init). - init.go: comment referenced the private chain that was removed in b425dcf; describe the direct PREROUTING/OUTPUT re-pin instead.
Previously the self-heal swallowed every iptables error, which hid real breakage (e.g. iptables binary missing, broken iptables module load, xtables lock contention beyond -w). Now PinMMDSRoute returns the first -I failure with the iptables stderr attached, and the caller logs at warn. -D failures stay silent: "rule absent" is the expected first-run case.
Drop the in-pin cooldown (we want self-heal to fire on every retry in case user rules keep landing back on top) and instead floor the warn log to once per 10s with a suppressed-since-last counter, mirroring internal/logs/exporter/rate_limited_logger.go.
- Extract the gate from internal/logs/exporter/rate_limited_logger.go into internal/logs/ratelimit.Limiter (logger-agnostic; returns (allowed, suppressedSinceLast)). Exporter wraps it for log.Printf, init.go uses it directly with zerolog. - Add Timeout to PollForMMDSOpts' http.Client (matches the existing mmdsAccessTokenClient 10s). Without it a -j DROP on MMDS would hang each tick on the TCP handshake until the parent ctx expired.
Coalesce concurrent /init self-heal calls with an atomic.Bool TryLock so we don't fire iptables mutations in parallel against the same nat table. Same pattern as isMountingNFS.
# Conflicts: # packages/envd/internal/host/mmds.go


On MMDS lookup failure (no happy-path cost), install a private nat chain (E2B_MMDS) that `RETURN`s for `169.254.169.254:80` and pin its jump at position 1 in PREROUTING and OUTPUT, removing prior copies first. Retry the lookup once.
Idempotent: `-N`/`-D` errors are expected and swallowed.
Follow-ups (separate PRs)