Skip to content

Commit d32806b

Browse files
committed
Retry transient full-scan upload failures (502/503/504/408, dropped connections)
A full-scan upload can fail transiently at the gateway/connection level - an HTTP 502/503/504/408, a dropped or reset connection, or a client-side timeout - without the server having created the scan. The CLI previously made exactly one attempt, so an entire run (including a completed reachability analysis) died on a single transient failure even though a retried upload almost always succeeds. create_full_scan now retries the fullscans POST up to 3 total attempts with increasing waits (~10s, then ~30s, plus jitter) on transient failures only: APIBadGateway (502), APIConnectionError, APITimeout, and catch-all APIFailure whose embedded original_status_code is 408/503/504. Dedicated 4xx classes, catch-all 400s, and error payloads are never retried. In these failure modes the server never finished reading the request body, so no scan was created and a retry does not duplicate one; in the rare case where a gateway timeout races a request the server later completes, the extra scan is benign and superseded by the retry (as if the CLI had run twice). The retry loop lives inside the existing try/finally so the brotli-compressed .socket.facts.json.br temp files survive until every attempt has finished; fullscans.post rebuilds its lazy file loaders from the plain paths on every call, so re-invoking it per attempt is safe. Assisted-by: Claude Code:claude-opus-4-8
1 parent f5ae591 commit d32806b

6 files changed

Lines changed: 361 additions & 5 deletions

File tree

CHANGELOG.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
# Changelog
22

3+
## 2.4.8
4+
5+
### Fixed: retry transient full-scan upload failures
6+
7+
- The full-scan upload (`POST /orgs/<org>/full-scans`) now retries transient
8+
gateway/connection failures — HTTP 502/503/504/408, dropped or reset connections, and
9+
request timeouts — up to 3 total attempts with increasing waits (~10s, then ~30s, plus
10+
jitter). Such failures are intermittent and a retried upload almost always succeeds.
11+
In these failure modes the server never finished reading the request body, so no scan
12+
was created and a retry does not duplicate one; in the rare case where a gateway
13+
timeout races a request the server later
14+
completes, the extra scan is benign and superseded by the retried one (as if the CLI had
15+
run twice).
16+
Non-transient errors (400/401/403/404/429 and error payloads) are never retried. Each
17+
retry logs a warning explaining what failed and when the next attempt happens.
18+
319
## 2.4.7
420

521
### Changed: pin @coana-tech/cli version; auto-update is now opt-in

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ build-backend = "hatchling.build"
66

77
[project]
88
name = "socketsecurity"
9-
version = "2.4.7"
9+
version = "2.4.8"
1010
requires-python = ">= 3.11"
1111
license = {"file" = "LICENSE"}
1212
dependencies = [

socketsecurity/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,3 @@
11
__author__ = 'socket.dev'
2-
__version__ = '2.4.7'
2+
__version__ = '2.4.8'
33
USER_AGENT = f'SocketPythonCLI/{__version__}'

socketsecurity/core/__init__.py

Lines changed: 77 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
import logging
22
import os
3+
import random
34
import re
45
import sys
56
import tarfile
@@ -13,7 +14,12 @@
1314
if TYPE_CHECKING:
1415
from socketsecurity.config import CliConfig
1516
from socketdev import socketdev
16-
from socketdev.exceptions import APIFailure
17+
from socketdev.exceptions import (
18+
APIBadGateway,
19+
APIConnectionError,
20+
APIFailure,
21+
APITimeout,
22+
)
1723
from socketdev.fullscans import FullScanParams, SocketArtifact
1824
from socketdev.org import Organization
1925
from socketdev.repos import RepositoryInfo
@@ -76,6 +82,48 @@
7682
TIER1_FINALIZE_MAX_ATTEMPTS = 3
7783
TIER1_FINALIZE_BACKOFF_SECONDS = 1.0
7884

85+
# Full scan upload retry policy. An upload can fail transiently at the gateway/connection
86+
# level (an HTTP 502/503/504/408, a dropped or reset connection, or a client-side timeout)
87+
# without the server having created the scan; a retried upload almost always succeeds.
88+
# In these failure modes no scan was created, so a retry does not duplicate one. (A
89+
# duplicate is possible only if a gateway timeout races a request the server later
90+
# completes; that is benign - the retried scan supersedes the orphaned one, same as
91+
# running the CLI twice.)
92+
FULL_SCAN_UPLOAD_MAX_ATTEMPTS = 3
93+
# Wait before retry attempt 2 and attempt 3 respectively (plus a little jitter so a fleet of
94+
# CI jobs hitting the same failure doesn't retry in lock-step).
95+
FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS = (10.0, 30.0)
96+
FULL_SCAN_UPLOAD_BACKOFF_JITTER_SECONDS = 2.0
97+
# Transient gateway/timeout HTTP statuses that the SDK does NOT raise as a dedicated
98+
# exception class (502 has APIBadGateway; 408/503/504 surface as the catch-all APIFailure
99+
# with the status only present in the message text - see _is_transient_full_scan_upload_error).
100+
FULL_SCAN_UPLOAD_RETRYABLE_STATUS_CODES = frozenset({408, 503, 504})
101+
# Matches the status code the SDK embeds in catch-all APIFailure messages
102+
# (socketdev/core/api.py: "Bad Request: HTTP original_status_code:<code> ...").
103+
_API_FAILURE_STATUS_CODE_RE = re.compile(r"original_status_code:(\d{3})")
104+
105+
106+
def _is_transient_full_scan_upload_error(error: Exception) -> bool:
107+
"""Whether a full-scan upload failure is transient and safe to retry.
108+
109+
Transient means the failure happened at the gateway/connection level, normally before the
110+
server finished reading the request body (so no scan was created server-side): HTTP
111+
502/503/504/408, client-side timeouts, and dropped/reset connections. 4xx client errors
112+
(400/401/403/404/429) and success responses carrying an error payload are never retried.
113+
"""
114+
if isinstance(error, (APIBadGateway, APIConnectionError, APITimeout)):
115+
# 502 / connection reset-dropped / request timeout - the SDK raises dedicated classes.
116+
return True
117+
if type(error) is APIFailure:
118+
# The SDK raises 408/503/504 (and every other status without a dedicated class,
119+
# including 400) as the catch-all APIFailure, so match on the exact class plus the
120+
# status code embedded in the message. Subclasses (APIAccessDenied, APIResourceNotFound,
121+
# APIInsufficientQuota, ...) are deliberately excluded - those are never transient.
122+
match = _API_FAILURE_STATUS_CODE_RE.search(str(error))
123+
if match:
124+
return int(match.group(1)) in FULL_SCAN_UPLOAD_RETRYABLE_STATUS_CODES
125+
return False
126+
79127

80128
def _humanize_alert_type(alert_type: str) -> str:
81129
"""Convert a camelCase/PascalCase alert type into a Title-Cased label.
@@ -787,7 +835,34 @@ def create_full_scan(self, files: List[str], params: FullScanParams, base_paths:
787835
# facts file under the per-file upload size cap. See _compress_facts_files_for_upload.
788836
upload_files, compressed_temp_files = self._compress_facts_files_for_upload(files)
789837
try:
790-
res = self.sdk.fullscans.post(upload_files, params, use_types=True, use_lazy_loading=True, max_open_files=50, base_paths=base_paths)
838+
# Retry transient gateway/timeout failures (502/503/504/408, dropped connections,
839+
# timeouts) with increasing waits; such failures are intermittent and a retried
840+
# upload almost always succeeds. In these failure modes the server never finished
841+
# reading the request body, so no scan was created and a retry does not duplicate
842+
# one (see the retry-policy comment above FULL_SCAN_UPLOAD_MAX_ATTEMPTS). fullscans.post()
843+
# rebuilds its lazy file loaders from the plain paths in upload_files on every call,
844+
# so simply calling it again per attempt is safe. The loop must stay inside this try
845+
# so the temp .br files (cleaned up in the finally below) outlive every attempt.
846+
for attempt in range(1, FULL_SCAN_UPLOAD_MAX_ATTEMPTS + 1):
847+
try:
848+
res = self.sdk.fullscans.post(upload_files, params, use_types=True, use_lazy_loading=True, max_open_files=50, base_paths=base_paths)
849+
break
850+
except APIFailure as error:
851+
if attempt >= FULL_SCAN_UPLOAD_MAX_ATTEMPTS or not _is_transient_full_scan_upload_error(error):
852+
raise
853+
backoff_index = min(attempt, len(FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS)) - 1
854+
wait_seconds = FULL_SCAN_UPLOAD_BACKOFF_SCHEDULE_SECONDS[backoff_index] + random.uniform(
855+
0, FULL_SCAN_UPLOAD_BACKOFF_JITTER_SECONDS
856+
)
857+
# SDK error messages can span many lines (path + response headers); the
858+
# first line carries the status, which is all the warning needs.
859+
error_summary = str(error).strip().splitlines()[0] if str(error).strip() else ""
860+
log.warning(
861+
f"Full scan upload failed with {type(error).__name__}({error_summary}), "
862+
f"retrying in {wait_seconds:.0f}s "
863+
f"(attempt {attempt + 1}/{FULL_SCAN_UPLOAD_MAX_ATTEMPTS})"
864+
)
865+
time.sleep(wait_seconds)
791866
finally:
792867
for temp_file in compressed_temp_files:
793868
try:

0 commit comments

Comments
 (0)