docs: soften the no-duplicate-scans claim to cover the 504 race

mtorp · mtorp · commit 5d25c749116b · 2026-06-10T11:07:24.000+02:00
In the investigated failure mode (backend stops reading the body
mid-upload) no scan is created server-side, so a retry cannot duplicate
one. But a gateway timeout can in principle race a request the origin
later completes, in which case a retry creates a second scan. That case
is benign - the orphaned scan is superseded by the retried one as
pending head, the same outcome as running the CLI twice - but the
comments, docstring, and changelog should not claim duplicates are
impossible.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,8 +9,11 @@
   request timeouts — up to 3 total attempts with increasing waits (~10s, then ~30s, plus
   jitter). Production gateways occasionally drop an upload mid-request when a backend pod
   stalls and stops reading the body (the client sees a 502 after ~30s); these episodes are
-  transient and a retried upload almost always succeeds. Since the server never finished
-  reading the request body, no scan was created, so retrying cannot duplicate a scan.
+  transient and a retried upload almost always succeeds. In this failure mode the server
+  never finished reading the request body, so no scan was created and a retry does not
+  duplicate one; in the rare case where a gateway timeout races a request the server later
+  completes, the extra scan is benign and superseded by the retried one (as if the CLI had
+  run twice).
   Non-transient errors (400/401/403/404/429 and error payloads) are never retried. Each
   retry logs a warning explaining what failed and when the next attempt happens.
 
diff --git a/socketsecurity/core/__init__.py b/socketsecurity/core/__init__.py
@@ -85,8 +85,10 @@
 # Full scan upload retry policy. Production gateways occasionally drop an upload mid-request
 # (a backend pod stalls and stops reading the body; the client then sees a 502/408 or a reset
 # connection). Those episodes are transient and pod-local: a retried upload routed to another
-# backend almost always succeeds, and because the server never finished reading the request
-# body, no scan was created, so retrying cannot duplicate a scan.
+# backend almost always succeeds. In this failure mode the server never finished reading the
+# request body, so no scan was created and a retry does not duplicate one. (A duplicate is
+# possible only if a gateway timeout races a request the server later completes; that is
+# benign - the retried scan supersedes the orphaned one, same as running the CLI twice.)
 FULL_SCAN_UPLOAD_MAX_ATTEMPTS = 3
 # Wait before retry attempt 2 and attempt 3 respectively (plus a little jitter so a fleet of
 # CI jobs hitting the same episode doesn't retry in lock-step).
@@ -104,10 +106,10 @@
 def _is_transient_full_scan_upload_error(error: Exception) -> bool:
     """Whether a full-scan upload failure is transient and safe to retry.
 
-    Transient means the failure happened at the gateway/connection level before the server
-    finished reading the request body (so no scan was created server-side): HTTP 502/503/504/408,
-    client-side timeouts, and dropped/reset connections. 4xx client errors (400/401/403/404/429)
-    and success responses carrying an error payload are never retried.
+    Transient means the failure happened at the gateway/connection level, normally before the
+    server finished reading the request body (so no scan was created server-side): HTTP
+    502/503/504/408, client-side timeouts, and dropped/reset connections. 4xx client errors
+    (400/401/403/404/429) and success responses carrying an error payload are never retried.
     """
     if isinstance(error, (APIBadGateway, APIConnectionError, APITimeout)):
         # 502 / connection reset-dropped / request timeout - the SDK raises dedicated classes.
@@ -835,11 +837,12 @@ def create_full_scan(self, files: List[str], params: FullScanParams, base_paths:
         try:
             # Retry transient gateway/timeout failures (502/503/504/408, dropped connections,
             # timeouts) with increasing waits; a stalled backend pod recovers or gets routed
-            # around within minutes, and since it never finished reading the request body no
-            # scan was created, so a retry cannot duplicate one. fullscans.post() rebuilds its
-            # lazy file loaders from the plain paths in upload_files on every call, so simply
-            # calling it again per attempt is safe. The loop must stay inside this try so the
-            # temp .br files (cleaned up in the finally below) outlive every attempt.
+            # around within minutes, and in this failure mode the server never finished reading
+            # the request body, so no scan was created and a retry does not duplicate one (see
+            # the retry-policy comment above FULL_SCAN_UPLOAD_MAX_ATTEMPTS). fullscans.post()
+            # rebuilds its lazy file loaders from the plain paths in upload_files on every call,
+            # so simply calling it again per attempt is safe. The loop must stay inside this try
+            # so the temp .br files (cleaned up in the finally below) outlive every attempt.
             for attempt in range(1, FULL_SCAN_UPLOAD_MAX_ATTEMPTS + 1):
                 try:
                     res = self.sdk.fullscans.post(upload_files, params, use_types=True, use_lazy_loading=True, max_open_files=50, base_paths=base_paths)