Skip to content

Commit 320dcb8

Browse files
committed
Retry transient full-scan upload failures (502/503/504/408, dropped connections)
Production api-v0 backends occasionally stop reading an upload mid-request (pod event-loop stalls during load episodes, e.g. the 2026-06-08 'Degraded API Performance' incident): the connection freezes mid-body, Cloudflare returns a 502 to the client after ~30s, and the backend later 408s the half-read request. The CLI previously made exactly one attempt, so an entire tier-1 reachability run died on a single transient failure even though a retried upload routed to a healthy backend almost always succeeds. create_full_scan now retries the fullscans POST up to 3 total attempts with increasing waits (~10s, then ~30s, plus jitter) on transient failures only: APIBadGateway (502), APIConnectionError, APITimeout, and catch-all APIFailure whose embedded original_status_code is 408/503/504. Dedicated 4xx classes, catch-all 400s, and error payloads are never retried. Because the server never finished reading the request body, no scan was created server-side, so a retry cannot duplicate a scan. The retry loop lives inside the existing try/finally so the brotli-compressed .socket.facts.json.br temp files survive until every attempt has finished; fullscans.post rebuilds its lazy file loaders from the plain paths on every call, so re-invoking it per attempt is safe. Assisted-by: Claude Code:claude-opus-4-8
1 parent f5ae591 commit 320dcb8

6 files changed

Lines changed: 356 additions & 5 deletions

File tree

CHANGELOG.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,19 @@
11
# Changelog
22

3+
## 2.4.8
4+
5+
### Fixed: retry transient full-scan upload failures
6+
7+
- The full-scan upload (`POST /orgs/<org>/full-scans`) now retries transient
8+
gateway/connection failures — HTTP 502/503/504/408, dropped or reset connections, and
9+
request timeouts — up to 3 total attempts with increasing waits (~10s, then ~30s, plus
10+
jitter). Production gateways occasionally drop an upload mid-request when a backend pod
11+
stalls and stops reading the body (the client sees a 502 after ~30s); these episodes are
12+
transient and a retried upload almost always succeeds. Since the server never finished
13+
reading the request body, no scan was created, so retrying cannot duplicate a scan.
14+
Non-transient errors (400/401/403/404/429 and error payloads) are never retried. Each
15+
retry logs a warning explaining what failed and when the next attempt happens.
16+
317
## 2.4.7
418

519
### Changed: pin @coana-tech/cli version; auto-update is now opt-in

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ build-backend = "hatchling.build"
66

77
[project]
88
name = "socketsecurity"
9-
version = "2.4.7"
9+
version = "2.4.8"
1010
requires-python = ">= 3.11"
1111
license = {"file" = "LICENSE"}
1212
dependencies = [

socketsecurity/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
__author__ = 'socket.dev'
2-
__version__ = '2.4.7'
2+
__version__ = '2.4.8'
33
USER_AGENT = f'SocketPythonCLI/{__version__}'

socketsecurity/core/__init__.py

Lines changed: 74 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import logging
22
import os
3+
import random
34
import re
45
import sys
56
import tarfile
@@ -13,7 +14,12 @@
1314
if TYPE_CHECKING:
1415
from socketsecurity.config import CliConfig
1516
from socketdev import socketdev
16-
from socketdev.exceptions import APIFailure
17+
from socketdev.exceptions import (
18+
APIBadGateway,
19+
APIConnectionError,
20+
APIFailure,
21+
APITimeout,
22+
)
1723
from socketdev.fullscans import FullScanParams, SocketArtifact
1824
from socketdev.org import Organization
1925
from socketdev.repos import RepositoryInfo
@@ -76,6 +82,46 @@
7682
TIER1_FINALIZE_MAX_ATTEMPTS = 3
7783
TIER1_FINALIZE_BACKOFF_SECONDS = 1.0
7884

85+
# Full scan upload retry policy. Production gateways occasionally drop an upload mid-request
86+
# (a backend pod stalls and stops reading the body; the client then sees a 502/408 or a reset
87+
# connection). Those episodes are transient and pod-local: a retried upload routed to another
88+
# backend almost always succeeds, and because the server never finished reading the request
89+
# body, no scan was created, so retrying cannot duplicate a scan.
90+
FULL_SCAN_UPLOAD_MAX_ATTEMPTS = 3
91+
# Wait before retry attempt 2 and attempt 3 respectively (plus a little jitter so a fleet of
92+
# CI jobs hitting the same episode doesn't retry in lock-step).
93+
FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS = (10.0, 30.0)
94+
FULL_SCAN_UPLOAD_BACKOFF_JITTER_SECONDS = 2.0
95+
# Transient gateway/timeout HTTP statuses that the SDK does NOT raise as a dedicated
96+
# exception class (502 has APIBadGateway; 408/503/504 surface as the catch-all APIFailure
97+
# with the status only present in the message text - see _is_transient_full_scan_upload_error).
98+
FULL_SCAN_UPLOAD_RETRYABLE_STATUS_CODES = frozenset({408, 503, 504})
99+
# Matches the status code the SDK embeds in catch-all APIFailure messages
100+
# (socketdev/core/api.py: "Bad Request: HTTP original_status_code:<code> ...").
101+
_API_FAILURE_STATUS_CODE_RE = re.compile(r"original_status_code:(\d{3})")
102+
103+
104+
def _is_transient_full_scan_upload_error(error: Exception) -> bool:
105+
"""Whether a full-scan upload failure is transient and safe to retry.
106+
107+
Transient means the failure happened at the gateway/connection level before the server
108+
finished reading the request body (so no scan was created server-side): HTTP 502/503/504/408,
109+
client-side timeouts, and dropped/reset connections. 4xx client errors (400/401/403/404/429)
110+
and success responses carrying an error payload are never retried.
111+
"""
112+
if isinstance(error, (APIBadGateway, APIConnectionError, APITimeout)):
113+
# 502 / connection reset-dropped / request timeout - the SDK raises dedicated classes.
114+
return True
115+
if type(error) is APIFailure:
116+
# The SDK raises 408/503/504 (and every other status without a dedicated class,
117+
# including 400) as the catch-all APIFailure, so match on the exact class plus the
118+
# status code embedded in the message. Subclasses (APIAccessDenied, APIResourceNotFound,
119+
# APIInsufficientQuota, ...) are deliberately excluded - those are never transient.
120+
match = _API_FAILURE_STATUS_CODE_RE.search(str(error))
121+
if match:
122+
return int(match.group(1)) in FULL_SCAN_UPLOAD_RETRYABLE_STATUS_CODES
123+
return False
124+
79125

80126
def _humanize_alert_type(alert_type: str) -> str:
81127
"""Convert a camelCase/PascalCase alert type into a Title-Cased label.
@@ -787,7 +833,33 @@ def create_full_scan(self, files: List[str], params: FullScanParams, base_paths:
787833
# facts file under the per-file upload size cap. See _compress_facts_files_for_upload.
788834
upload_files, compressed_temp_files = self._compress_facts_files_for_upload(files)
789835
try:
790-
res = self.sdk.fullscans.post(upload_files, params, use_types=True, use_lazy_loading=True, max_open_files=50, base_paths=base_paths)
836+
# Retry transient gateway/timeout failures (502/503/504/408, dropped connections,
837+
# timeouts) with increasing waits; a stalled backend pod recovers or gets routed
838+
# around within minutes, and since it never finished reading the request body no
839+
# scan was created, so a retry cannot duplicate one. fullscans.post() rebuilds its
840+
# lazy file loaders from the plain paths in upload_files on every call, so simply
841+
# calling it again per attempt is safe. The loop must stay inside this try so the
842+
# temp .br files (cleaned up in the finally below) outlive every attempt.
843+
for attempt in range(1, FULL_SCAN_UPLOAD_MAX_ATTEMPTS + 1):
844+
try:
845+
res = self.sdk.fullscans.post(upload_files, params, use_types=True, use_lazy_loading=True, max_open_files=50, base_paths=base_paths)
846+
break
847+
except APIFailure as error:
848+
if attempt >= FULL_SCAN_UPLOAD_MAX_ATTEMPTS or not _is_transient_full_scan_upload_error(error):
849+
raise
850+
backoff_index = min(attempt, len(FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS)) - 1
851+
wait_seconds = FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS[backoff_index] + random.uniform(
852+
0, FULL_SCAN_UPLOAD_BACKOFF_JITTER_SECONDS
853+
)
854+
# SDK error messages can span many lines (path + response headers); the
855+
# first line carries the status, which is all the warning needs.
856+
error_summary = str(error).strip().splitlines()[0] if str(error).strip() else ""
857+
log.warning(
858+
f"Full scan upload failed with {type(error).__name__}({error_summary}), "
859+
f"retrying in {wait_seconds:.0f}s "
860+
f"(attempt {attempt + 1}/{FULL_SCAN_UPLOAD_MAX_ATTEMPTS})"
861+
)
862+
time.sleep(wait_seconds)
791863
finally:
792864
for temp_file in compressed_temp_files:
793865
try:

0 commit comments

Comments
 (0)