Merge remote-tracking branch 'upstream/main' into multivalue-chunker

thodson-usgs · thodson-usgs · commit 84d1d333d731 · 2026-05-15T15:02:20.000-05:00
# Conflicts:
#	NEWS.md
#	dataretrieval/waterdata/utils.py
#	tests/waterdata_test.py
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -13,9 +13,9 @@ jobs:
   lint:
     runs-on: ubuntu-latest
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v6
       - name: Set up Python 3.14
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
         with:
           python-version: "3.14"
           cache: "pip"
@@ -36,9 +36,9 @@ jobs:
         python-version: ["3.9", "3.13", "3.14"]
 
     steps:
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@v6
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
         with:
           python-version: ${{ matrix.python-version }}
           cache: "pip"
diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml
@@ -21,9 +21,9 @@ jobs:
     runs-on: ubuntu-latest
 
     steps:
-    - uses: actions/checkout@v4
+    - uses: actions/checkout@v6
     - name: Set up Python
-      uses: actions/setup-python@v5
+      uses: actions/setup-python@v6
       with:
         python-version: '3.x'
         cache: 'pip'
diff --git a/.github/workflows/sphinx-docs.yml b/.github/workflows/sphinx-docs.yml
@@ -11,11 +11,11 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Checkout
-        uses: actions/checkout@v4
+        uses: actions/checkout@v6
         with:
           persist-credentials: false
       - name: Set up Python
-        uses: actions/setup-python@v5
+        uses: actions/setup-python@v6
         with:
           python-version: "3.13"
           cache: "pip"
diff --git a/NEWS.md b/NEWS.md
@@ -1,5 +1,7 @@
 **05/15/2026:** The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) now transparently chunk requests whose URLs would otherwise exceed the server's ~8 KB byte limit. A common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — previously failed with HTTP 414 once the resulting URL grew past the limit; it now fans out across multiple sub-requests under the hood and returns one combined DataFrame. The chunker coordinates with the existing CQL `filter` chunker (long top-level-`OR` filters still split correctly when used alongside long multi-value lists), caps cartesian-product plans at 1000 sub-requests (the default USGS hourly quota), and aborts mid-call with a structured `QuotaExhausted` exception — carrying the partial result and a resume offset — if `x-ratelimit-remaining` drops below a safety floor. Mirrors R `dataRetrieval`'s [#870](https://github.com/DOI-USGS/dataRetrieval/pull/870), generalized to N dimensions.
 
+**05/14/2026:** Fixed two latent bugs in the paginated `waterdata` request loop (`_walk_pages` and `get_stats_data`). Previously, when `requests.Session.request(...)` itself raised mid-pagination (network error, timeout), the except block called `_error_body()` on the *prior page's* response, so the logged "error" described the wrong request and could itself crash on non-JSON bodies. Separately, no status-code check was performed on subsequent paginated responses, so a 5xx body that didn't include `numberReturned` was silently treated as an empty page — pagination quietly stopped and the user got truncated data with no error logged. The loop now status-checks each page like the initial request and reports the actual exception. The "best-effort" behavior (return whatever pages were collected) is preserved.
+
 **05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.
 
 **05/07/2026:** `waterdata.get_samples()` and `wqp.get_results()` now append a derived `<prefix>DateTime` UTC column for every Date/Time/TimeZone triplet in the response (e.g. `Activity_StartDate` + `Activity_StartTime` + `Activity_StartTimeZone` → `Activity_StartDateTime`). Both the WQX3 (`<X>Date`/`<X>Time`/`<X>TimeZone`) and legacy WQP (`<X>Date`/`<X>Time/Time`/`<X>Time/TimeZoneCode`) shapes are recognized; abbreviations like EST/EDT/CST/PST resolve to a UTC `Timestamp`, unknown codes resolve to `NaT`, and the original triplet columns are preserved. Returned rows are also now sorted by `Activity_StartDateTime` (or the legacy `ActivityStartDateTime`) — the underlying APIs return rows in an unstable order. Mirrors R's `create_dateTime` and end-of-pipeline sort. Closes #266.
diff --git a/dataretrieval/waterdata/utils.py b/dataretrieval/waterdata/utils.py
@@ -410,6 +410,18 @@ def _error_body(resp: requests.Response):
     )
 
 
+def _raise_for_non_200(resp: requests.Response) -> None:
+    """Raise ``RuntimeError(_error_body(resp))`` if ``resp`` is not 200.
+
+    Routes through ``_error_body`` (USGS-API-aware: handles 429/403
+    specially, extracts ``code``/``description`` from JSON error bodies)
+    rather than ``Response.raise_for_status``, which raises
+    ``HTTPError`` with a generic message.
+    """
+    if resp.status_code != 200:
+        raise RuntimeError(_error_body(resp))
+
+
 def _construct_api_requests(
     service: str,
     properties: list[str] | None = None,
@@ -464,12 +476,12 @@ def _construct_api_requests(
 
     if service in _CQL2_REQUIRED_SERVICES:
         # POST with CQL2 JSON: multi-value params go in the request body.
+        # The date-range loop above has already collapsed any _DATE_RANGE_PARAMS
+        # value to a string, so the list/tuple check below cannot match them.
         post_params = {
             k: v
             for k, v in kwargs.items()
-            if k not in _DATE_RANGE_PARAMS
-            and isinstance(v, (list, tuple))
-            and len(v) > 1
+            if isinstance(v, (list, tuple)) and len(v) > 1
         }
         params = {k: v for k, v in kwargs.items() if k not in post_params}
     else:
@@ -652,8 +664,7 @@ def _walk_pages(
     client = client or requests.Session()
     try:
         resp = client.send(req)
-        if resp.status_code != 200:
-            raise RuntimeError(_error_body(resp))
+        _raise_for_non_200(resp)
 
         # Store the initial response for metadata
         initial_response = resp
@@ -675,11 +686,11 @@ def _walk_pages(
                     headers=headers,
                     data=content if method == "POST" else None,
                 )
+                _raise_for_non_200(resp)
                 dfs.append(_get_resp_data(resp, geopd=geopd))
                 curr_url = _next_req_url(resp)
-            except Exception:  # noqa: BLE001
-                error_text = _error_body(resp)
-                logger.error("Request incomplete. %s", error_text)
+            except Exception as e:  # noqa: BLE001
+                logger.error("Request incomplete: %s", e)
                 logger.warning(
                     "Request failed for URL: %s. Data download interrupted.", curr_url
                 )
@@ -1115,8 +1126,7 @@ def get_stats_data(
 
     try:
         resp = client.send(req)
-        if resp.status_code != 200:
-            raise RuntimeError(_error_body(resp))
+        _raise_for_non_200(resp)
 
         # Store the initial response for metadata
         initial_response = resp
@@ -1142,14 +1152,17 @@ def get_stats_data(
                     params=args,
                     headers=headers,
                 )
+                _raise_for_non_200(resp)
                 body = resp.json()
                 all_dfs.append(_handle_stats_nesting(body, geopd=GEOPANDAS))
                 next_token = body["next"]
-            except Exception:  # noqa: BLE001
-                error_text = _error_body(resp)
-                logger.error("Request incomplete. %s", error_text)
+            except Exception as e:  # noqa: BLE001
+                logger.error("Request incomplete: %s", e)
                 logger.warning(
-                    "Request failed for URL: %s. Data download interrupted.", resp.url
+                    "Request failed for URL: %s (next_token=%s). "
+                    "Data download interrupted.",
+                    url,
+                    next_token,
                 )
                 next_token = None
 
diff --git a/pyproject.toml b/pyproject.toml
@@ -35,6 +35,7 @@ dataretrieval = ["py.typed"]
 test = [
   "pytest > 5.0.0",
   "pytest-cov[all]",
+  "pytest-rerunfailures",
   "coverage",
   "requests-mock",
   "ruff",
diff --git a/tests/waterdata_test.py b/tests/waterdata_test.py
@@ -1,4 +1,5 @@
 import datetime
+import json
 import sys
 from unittest import mock
 
@@ -45,6 +46,26 @@
     _normalize_str_iterable,
 )
 
+# Most tests in this module call the live USGS Water Data API. After
+# PR #273, transient upstream errors (5xx / 429 / connection drops)
+# propagate instead of silently truncating, which makes CI susceptible
+# to flaking on a brief upstream blip. Auto-retry such failures, but
+# only for the narrow set of transient-error trace patterns below —
+# library bugs raising other exception types still fail on the first
+# try. The marker is attached to every test in the module, but the
+# patterns match only traces produced by real network round-trips
+# (``_raise_for_non_200`` output, ``requests`` exceptions), so tests
+# using ``requests_mock`` or ``mock.patch`` are no-ops for the rerun.
+pytestmark = pytest.mark.flaky(
+    reruns=2,
+    reruns_delay=5,
+    only_rerun=[
+        r"RuntimeError:\s*(?:429|5\d\d):",  # _raise_for_non_200 output
+        r"ConnectionError",
+        r"ReadTimeout|ConnectTimeout|Timeout",
+    ],
+)
+
 
 def mock_request(requests_mock, request_url, file_path):
     """Mock request code"""
@@ -142,7 +163,18 @@ def test_construct_api_requests_monitoring_locations_post():
         hydrologic_unit_code=["010802050102", "010802050103"],
     )
     assert req.method == "POST"
-    assert req.body is not None
+    assert req.headers["Content-Type"] == "application/query-cql-json"
+
+    body = json.loads(req.body)
+    # Top-level shape: AND over a list of per-param predicates.
+    assert body["op"] == "and"
+    assert isinstance(body["args"], list) and len(body["args"]) == 1
+
+    # The single predicate is an IN over hydrologic_unit_code with both values.
+    predicate = body["args"][0]
+    assert predicate["op"] == "in"
+    assert predicate["args"][0] == {"property": "hydrologic_unit_code"}
+    assert predicate["args"][1] == ["010802050102", "010802050103"]
 
 
 def test_construct_api_requests_single_value_stays_get():
diff --git a/tests/waterdata_utils_test.py b/tests/waterdata_utils_test.py