fix(sandbox): fail-fast on stalled HTTP clones to escape NAT blackholes#3318
Merged
fix(sandbox): fail-fast on stalled HTTP clones to escape NAT blackholes#3318
Conversation
Set http.lowSpeedLimit=1000 / http.lowSpeedTime=30 in the system git config so libcurl aborts streams that drop below ~1KB/s for 30s, and extend TRANSIENT_ERRORS so the existing 3-attempt retry loop in clone.ts catches the resulting "Operation too slow" / "RPC failed" / "transfer closed with" outputs. Without this, an in-flight clone whose egress path is severed mid-stream (e.g. fck-nat ASG instance refresh, PMTUD blackhole) waits on the kernel TCP keepalive (~2h default) and visibly hangs at "Receiving objects". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
🧪 BenchmarkShould we run the Virtual MCP strategy benchmark for this PR? React with 👍 to run the benchmark.
Benchmark will run on the next push after you react. |
Contributor
Release OptionsSuggested: Patch ( React with an emoji to override the release type:
Current version:
|
pedrofrxncx
approved these changes
May 8, 2026
Carries the http.lowSpeedLimit/lowSpeedTime + clone.ts retry-string update so the next sandbox image build picks up the fail-fast fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
http.lowSpeedLimit=1000/http.lowSpeedTime=30in the system git config baked into the sandbox image, so libcurl aborts any clone stream whose throughput drops below ~1 KB/s for 30 s.TRANSIENT_ERRORSinpackages/sandbox/daemon/setup/clone.tswith the libcurl/git outputs the new threshold produces ("Operation too slow","transfer closed with","RPC failed","the remote end hung up"), so the existing 3-attempt retry loop actually fires on these failures.Why
Sandbox pods clone user repositories over HTTPS through fck-nat instances on
eks-decocms(sa-east-1, three single-instance ASGs, one per AZ). When a NAT instance is replaced (e.g. ASG instance refresh) or otherwise drops in-flight packets, the clone's TCP stream is silently blackholed: the kernel waits on the keepalive default (~2 h) before raising any error, so the user-visible symptom is an indefinite hang at "Receiving objects" with no diagnostic.Today's investigation against the prod cluster confirmed all three fck-nat instances were replaced via instance refresh between
2026-05-08T17:13Zand17:18Z, which lines up with the reported intermittent failures.This change converts that silent hang into a fail-fast error within ~30 s. Combined with the retry loop in
clone.ts, most affected clones will recover transparently. The structural NAT-layer fixes (HA per AZ / Managed NAT Gateway) are tracked separately and require infra/SRE work.Test plan
git config --system --get http.lowSpeedLimitreturns1000andhttp.lowSpeedTimereturns30inside the container.git clone --depth 1of a small repo — completes unchanged.iptables -A OUTPUT -p tcp --sport <github-conn> -j DROPmid-clone) and verify the clone aborts within ~30 s withOperation too slow/RPC failed, then the daemon retries and succeeds.studio-sandboxchart bump in stg, monitorReceiving objectsfailure rate.🤖 Generated with Claude Code
Summary by cubic
Make sandbox HTTP git clones fail fast by setting image-level low-speed thresholds (
http.lowSpeedLimit=1000,http.lowSpeedTime=30) and treating the resulting curl/git messages as transient so the 3-retry loop kicks in. Bumped@decocms/sandboxto 0.4.6 to ship the change; hangs at "Receiving objects" during NAT blackholes now fail in ~30s and auto-retry.Written for commit c2a7708. Summary will update on new commits.