Skip to content

fix(sandbox): fail-fast on stalled HTTP clones to escape NAT blackholes#3318

Merged
tlgimenes merged 2 commits intomainfrom
tlgimenes/git-clone-hang-analysis
May 8, 2026
Merged

fix(sandbox): fail-fast on stalled HTTP clones to escape NAT blackholes#3318
tlgimenes merged 2 commits intomainfrom
tlgimenes/git-clone-hang-analysis

Conversation

@tlgimenes
Copy link
Copy Markdown
Contributor

@tlgimenes tlgimenes commented May 8, 2026

Summary

  • Set http.lowSpeedLimit=1000 / http.lowSpeedTime=30 in the system git config baked into the sandbox image, so libcurl aborts any clone stream whose throughput drops below ~1 KB/s for 30 s.
  • Extend TRANSIENT_ERRORS in packages/sandbox/daemon/setup/clone.ts with the libcurl/git outputs the new threshold produces ("Operation too slow", "transfer closed with", "RPC failed", "the remote end hung up"), so the existing 3-attempt retry loop actually fires on these failures.

Why

Sandbox pods clone user repositories over HTTPS through fck-nat instances on eks-decocms (sa-east-1, three single-instance ASGs, one per AZ). When a NAT instance is replaced (e.g. ASG instance refresh) or otherwise drops in-flight packets, the clone's TCP stream is silently blackholed: the kernel waits on the keepalive default (~2 h) before raising any error, so the user-visible symptom is an indefinite hang at "Receiving objects" with no diagnostic.

Today's investigation against the prod cluster confirmed all three fck-nat instances were replaced via instance refresh between 2026-05-08T17:13Z and 17:18Z, which lines up with the reported intermittent failures.

This change converts that silent hang into a fail-fast error within ~30 s. Combined with the retry loop in clone.ts, most affected clones will recover transparently. The structural NAT-layer fixes (HA per AZ / Managed NAT Gateway) are tracked separately and require infra/SRE work.

Test plan

  • Build the sandbox image locally and confirm git config --system --get http.lowSpeedLimit returns 1000 and http.lowSpeedTime returns 30 inside the container.
  • Smoke-test a normal git clone --depth 1 of a small repo — completes unchanged.
  • Simulate a stalled stream (iptables -A OUTPUT -p tcp --sport <github-conn> -j DROP mid-clone) and verify the clone aborts within ~30 s with Operation too slow / RPC failed, then the daemon retries and succeeds.
  • Roll the new image through studio-sandbox chart bump in stg, monitor Receiving objects failure rate.

🤖 Generated with Claude Code


Summary by cubic

Make sandbox HTTP git clones fail fast by setting image-level low-speed thresholds (http.lowSpeedLimit=1000, http.lowSpeedTime=30) and treating the resulting curl/git messages as transient so the 3-retry loop kicks in. Bumped @decocms/sandbox to 0.4.6 to ship the change; hangs at "Receiving objects" during NAT blackholes now fail in ~30s and auto-retry.

Written for commit c2a7708. Summary will update on new commits.

Set http.lowSpeedLimit=1000 / http.lowSpeedTime=30 in the system git
config so libcurl aborts streams that drop below ~1KB/s for 30s, and
extend TRANSIENT_ERRORS so the existing 3-attempt retry loop in clone.ts
catches the resulting "Operation too slow" / "RPC failed" / "transfer
closed with" outputs.

Without this, an in-flight clone whose egress path is severed mid-stream
(e.g. fck-nat ASG instance refresh, PMTUD blackhole) waits on the kernel
TCP keepalive (~2h default) and visibly hangs at "Receiving objects".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

🧪 Benchmark

Should we run the Virtual MCP strategy benchmark for this PR?

React with 👍 to run the benchmark.

Reaction Action
👍 Run quick benchmark (10 & 128 tools)

Benchmark will run on the next push after you react.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Release Options

Suggested: Patch (2.311.2) — based on fix: prefix

React with an emoji to override the release type:

Reaction Type Next Version
👍 Prerelease 2.311.2-alpha.1
🎉 Patch 2.311.2
❤️ Minor 2.312.0
🚀 Major 3.0.0

Current version: 2.311.1

Note: If multiple reactions exist, the smallest bump wins. If no reactions, the suggested bump is used (default: patch).

Carries the http.lowSpeedLimit/lowSpeedTime + clone.ts retry-string
update so the next sandbox image build picks up the fail-fast fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tlgimenes tlgimenes merged commit 9975850 into main May 8, 2026
16 checks passed
@tlgimenes tlgimenes deleted the tlgimenes/git-clone-hang-analysis branch May 8, 2026 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants