fix(orch): keep PostInit Timestamp stable across retries to break envd livelock#2687
fix(orch): keep PostInit Timestamp stable across retries to break envd livelock#2687ValentaTomas wants to merge 3 commits into
Conversation
…d livelock doRequestWithInfiniteRetries previously set jsonBody.Timestamp = time.Now() on every iteration of the retry loop. envd's PostInit guards SetData with lastSetTime.SetToGreater(initRequest.Timestamp.UnixNano()), which is meant to skip work for an out-of-order or replayed request — but with a fresh timestamp per retry the guard always observed a newer value and ran SetData end-to-end every time. When the orchestrator's 50 ms EnvdInitRequestTimeout fires while envd is still serving the cold first request, envd queues the next retry on initLock and the next one and the next one. Each one then runs SetData against a connection the orchestrator already gave up on, writes 204 to a dead socket, and the orchestrator never observes a success — the livelock described in the bitfrost / Mode-B investigation. Pin Timestamp once before the loop. On the cold path envd's idempotency guard now actually fires for replays, the queue drains by skipping the duplicates, and the next live retry's response reaches the orchestrator. Also marshal the body once instead of on every retry (it doesn't change).
PR SummaryMedium Risk Overview Reusing a single marshaled request body across retries means any per-attempt fields (like deadlines) are no longer reflected in the payload; combined with the pinned timestamp this could also trigger unintended system time adjustments if the first attempt is significantly delayed. Reviewed by Cursor Bugbot for commit df5007c. Bugbot is set up for automated code reviews on this repo. Configure here. |
❌ 8 Tests Failed:
View the top 1 failed test(s) by shortest run time
View the full list of 21 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
Pinning the timestamp before the retry loop causes the sandbox system clock to be set to a stale value if initialization takes significant time, leading to clock skew that affects TLS verification and log ordering. If a request fails after updating the idempotency guard but before completing initialization, subsequent retries with the same pinned timestamp will skip initialization logic and return success, leaving the sandbox in an inconsistent state.
With the orchestrator now pinning Timestamp once across retries, a SetData failure would otherwise leave lastSetTime updated and cause subsequent retries with the same timestamp to skip the (still-needed) SetData and return 204 success. Move the SetToGreater after the success check.
|
Superseded by #2700 (move envd into a dedicated network namespace), which addresses the root cause. |
|
Close this to prevent risking outdated time. |
Pin
Timestamponce before the retry loop so envdlastSetTime.SetToGreatercan skip replays; marshal body once.