fix(envd): bound MMDS lookup in /init to prevent initLock starvation#2689
fix(envd): bound MMDS lookup in /init to prevent initLock starvation#2689ValentaTomas wants to merge 2 commits into
Conversation
PostInit holds initLock across SetData, which calls validateInitAccessToken → checkMMDSHash → mmdsClient.GetAccessTokenHash. That call used the request context directly. The orchestrator's per-request ctx has a 50 ms timeout but the handler keeps running after the client cancels (Go's net/http doesn't abort handlers on client disconnect), so an MMDS thread stalled by FC's single VMM thread would hold initLock indefinitely while every retry queued behind it. Wrap the MMDS hash lookup in a 1 s context so a stuck call can't lock the handler. MMDS responses are otherwise sub-millisecond on healthy hosts so this is well above the steady-state cost.
PR SummaryLow Risk Overview Reviewed by Cursor Bugbot for commit b685c36. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 8 Tests Failed:
View the top 1 failed test(s) by shortest run time
View the full list of 15 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
This pull request introduces a one-second timeout for MMDS hash lookups to prevent the initialization handler from blocking indefinitely. A security issue was identified in the error handling of checkMMDSHash, where a lookup timeout or failure could be misinterpreted as the absence of a security policy, potentially allowing an authentication bypass. It is recommended to update the return values to ensure failed lookups are treated as validation failures.
|
Superseded by #2700 (move envd into a dedicated network namespace), which addresses the root cause. |
|
Closing — the existing 10s http.Client.Timeout on mmdsAccessTokenClient already bounds the worst case. Tightening to 1s blindly risks turning slow-but-successful MMDS calls (VMM contention, customer egress queue) into fast-but-failing ones, with no observed problem from the 10s ceiling. Other PRs (#2701 self-heal, #2702 ctx-aware lock, #2687 stable Timestamp, #2688 cgroup freeze) collectively cover the realistic failure modes. |
|
Closing — risk-asymmetric. If MMDS is legitimately slow due to in-guest VMM thread contention or other deferred response, tightening to a short ctx deadline would cause spurious /init failures (orchestrator retries hit the same MMDS timeout repeatedly inside the 40s envelope) where the existing 10s http.Client.Timeout would have allowed one slow call to eventually succeed. The 10s ceiling is the right backstop when slowness comes from inside the guest, where we can't bound it from the network layer. If we observe MMDS legitimately hanging forever (not just slow), revisit. |
Bound MMDS hash lookup in
/initwith a 1 s context so a stuck MMDS thread cannot holdinitLock.