Skip to content

Commit 0a14ba5

Browse files
thodson-usgsclaude
andauthored
Fix silent / misleading failures in paginated waterdata pagination (#273)
Fixes bugs in _walk_pages (and the parallel get_stats_data pagination loop) caused silent or misleading failures when a paginated request was interrupted mid-walk: 1. The except handler called _error_body(resp) — but when client.request() itself raised (ConnectionError, Timeout, etc.), `resp` still pointed at the *previous successful page*. The "incomplete" log message therefore described the wrong request, and on a non-JSON 200 body would itself raise inside resp.json() and bury the original failure. 2. No status-code check was performed on paginated responses. The initial request guards `if resp.status_code != 200`, but the loop doesn't. A 5xx body that didn't include "numberReturned" was silently turned into an empty DataFrame by _get_resp_data, the `next` link wasn't found, and the loop quietly exited — handing the user truncated data with no error logged anywhere. This affects every paginated waterdata getter (get_daily, get_continuous, get_monitoring_locations, …). Fix: - Guard the loop body with `if resp.status_code != 200: raise RuntimeError(_error_body(resp))`, mirroring the initial request. - Capture the exception with `as e` and log `e` instead of touching the stale `resp`. Same change in get_stats_data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent c755f6b commit 0a14ba5

3 files changed

Lines changed: 195 additions & 11 deletions

File tree

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
**05/14/2026:** Fixed two latent bugs in the paginated `waterdata` request loop (`_walk_pages` and `get_stats_data`). Previously, when `requests.Session.request(...)` itself raised mid-pagination (network error, timeout), the except block called `_error_body()` on the *prior page's* response, so the logged "error" described the wrong request and could itself crash on non-JSON bodies. Separately, no status-code check was performed on subsequent paginated responses, so a 5xx body that didn't include `numberReturned` was silently treated as an empty page — pagination quietly stopped and the user got truncated data with no error logged. The loop now status-checks each page like the initial request and reports the actual exception. The "best-effort" behavior (return whatever pages were collected) is preserved.
2+
13
**05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.
24

35
**05/07/2026:** `waterdata.get_samples()` and `wqp.get_results()` now append a derived `<prefix>DateTime` UTC column for every Date/Time/TimeZone triplet in the response (e.g. `Activity_StartDate` + `Activity_StartTime` + `Activity_StartTimeZone``Activity_StartDateTime`). Both the WQX3 (`<X>Date`/`<X>Time`/`<X>TimeZone`) and legacy WQP (`<X>Date`/`<X>Time/Time`/`<X>Time/TimeZoneCode`) shapes are recognized; abbreviations like EST/EDT/CST/PST resolve to a UTC `Timestamp`, unknown codes resolve to `NaT`, and the original triplet columns are preserved. Returned rows are also now sorted by `Activity_StartDateTime` (or the legacy `ActivityStartDateTime`) — the underlying APIs return rows in an unstable order. Mirrors R's `create_dateTime` and end-of-pipeline sort. Closes #266.

dataretrieval/waterdata/utils.py

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,18 @@ def _error_body(resp: requests.Response):
406406
)
407407

408408

409+
def _raise_for_non_200(resp: requests.Response) -> None:
410+
"""Raise ``RuntimeError(_error_body(resp))`` if ``resp`` is not 200.
411+
412+
Routes through ``_error_body`` (USGS-API-aware: handles 429/403
413+
specially, extracts ``code``/``description`` from JSON error bodies)
414+
rather than ``Response.raise_for_status``, which raises
415+
``HTTPError`` with a generic message.
416+
"""
417+
if resp.status_code != 200:
418+
raise RuntimeError(_error_body(resp))
419+
420+
409421
def _construct_api_requests(
410422
service: str,
411423
properties: list[str] | None = None,
@@ -645,8 +657,7 @@ def _walk_pages(
645657
client = client or requests.Session()
646658
try:
647659
resp = client.send(req)
648-
if resp.status_code != 200:
649-
raise RuntimeError(_error_body(resp))
660+
_raise_for_non_200(resp)
650661

651662
# Store the initial response for metadata
652663
initial_response = resp
@@ -668,11 +679,11 @@ def _walk_pages(
668679
headers=headers,
669680
data=content if method == "POST" else None,
670681
)
682+
_raise_for_non_200(resp)
671683
dfs.append(_get_resp_data(resp, geopd=geopd))
672684
curr_url = _next_req_url(resp)
673-
except Exception: # noqa: BLE001
674-
error_text = _error_body(resp)
675-
logger.error("Request incomplete. %s", error_text)
685+
except Exception as e: # noqa: BLE001
686+
logger.error("Request incomplete: %s", e)
676687
logger.warning(
677688
"Request failed for URL: %s. Data download interrupted.", curr_url
678689
)
@@ -1105,8 +1116,7 @@ def get_stats_data(
11051116

11061117
try:
11071118
resp = client.send(req)
1108-
if resp.status_code != 200:
1109-
raise RuntimeError(_error_body(resp))
1119+
_raise_for_non_200(resp)
11101120

11111121
# Store the initial response for metadata
11121122
initial_response = resp
@@ -1132,14 +1142,17 @@ def get_stats_data(
11321142
params=args,
11331143
headers=headers,
11341144
)
1145+
_raise_for_non_200(resp)
11351146
body = resp.json()
11361147
all_dfs.append(_handle_stats_nesting(body, geopd=GEOPANDAS))
11371148
next_token = body["next"]
1138-
except Exception: # noqa: BLE001
1139-
error_text = _error_body(resp)
1140-
logger.error("Request incomplete. %s", error_text)
1149+
except Exception as e: # noqa: BLE001
1150+
logger.error("Request incomplete: %s", e)
11411151
logger.warning(
1142-
"Request failed for URL: %s. Data download interrupted.", resp.url
1152+
"Request failed for URL: %s (next_token=%s). "
1153+
"Data download interrupted.",
1154+
url,
1155+
next_token,
11431156
)
11441157
next_token = None
11451158

tests/waterdata_utils_test.py

Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
1+
import logging
12
from unittest import mock
23

34
import pandas as pd
45
import requests
56

7+
import dataretrieval.waterdata.utils as _utils_module
68
from dataretrieval.waterdata.utils import (
79
_arrange_cols,
810
_error_body,
@@ -12,6 +14,8 @@
1214
_walk_pages,
1315
)
1416

17+
_LOGGER_NAME = _utils_module.__name__
18+
1519

1620
def test_get_args_basic():
1721
local_vars = {
@@ -87,6 +91,171 @@ def test_walk_pages_multiple_mocked():
8791
assert mock_client.request.call_args[0][1] == "https://example.com/page2"
8892

8993

94+
def _resp_ok(features):
95+
"""Build a 200-OK mock response carrying the given features list."""
96+
links = [{"rel": "next", "href": "https://example.com/page2"}] if features else []
97+
resp = mock.MagicMock()
98+
resp.json.return_value = {
99+
"numberReturned": len(features),
100+
"features": features,
101+
"links": links,
102+
}
103+
resp.headers = {}
104+
resp.status_code = 200
105+
resp.url = "https://example.com/page1"
106+
return resp
107+
108+
109+
def _error_log_messages(caplog):
110+
"""Pull ERROR-and-above message strings out of caplog. Shared by the
111+
pagination-failure tests below."""
112+
return [r.getMessage() for r in caplog.records if r.levelno >= logging.ERROR]
113+
114+
115+
def _walk_pages_with_failure(failure_resp_or_exc):
116+
"""Run _walk_pages where page 1 succeeds and page 2 fails as given."""
117+
resp1 = _resp_ok([{"id": "1", "properties": {"val": "a"}}])
118+
119+
mock_client = mock.MagicMock(spec=requests.Session)
120+
mock_client.send.return_value = resp1
121+
if isinstance(failure_resp_or_exc, BaseException):
122+
mock_client.request.side_effect = failure_resp_or_exc
123+
else:
124+
mock_client.request.return_value = failure_resp_or_exc
125+
126+
mock_req = mock.MagicMock(spec=requests.PreparedRequest)
127+
mock_req.method = "GET"
128+
mock_req.headers = {}
129+
mock_req.url = "https://example.com/page1"
130+
131+
return _walk_pages(geopd=False, req=mock_req, client=mock_client)
132+
133+
134+
def test_walk_pages_logs_actual_exception_when_request_raises(caplog):
135+
"""Exception from client.request() must be logged with its actual message."""
136+
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
137+
138+
df, _ = _walk_pages_with_failure(requests.ConnectionError("boom"))
139+
140+
# First page's data is preserved (best-effort behavior).
141+
assert list(df["val"]) == ["a"]
142+
# Logged error mentions the actual ConnectionError, not a stale page body.
143+
messages = _error_log_messages(caplog)
144+
assert any("boom" in m for m in messages), messages
145+
146+
147+
def test_walk_pages_surfaces_5xx_mid_pagination(caplog):
148+
"""A non-200 mid-pagination response must be logged, not silently swallowed."""
149+
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
150+
151+
page2_503 = mock.MagicMock()
152+
page2_503.status_code = 503
153+
page2_503.json.return_value = {
154+
"code": "ServiceUnavailable",
155+
"description": "upstream timeout",
156+
}
157+
page2_503.url = "https://example.com/page2"
158+
159+
df, _ = _walk_pages_with_failure(page2_503)
160+
161+
assert list(df["val"]) == ["a"]
162+
messages = _error_log_messages(caplog)
163+
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
164+
165+
166+
def _stats_initial_ok():
167+
"""A 200-OK initial stats response: empty data list, signals one more page."""
168+
resp = mock.MagicMock()
169+
resp.status_code = 200
170+
resp.json.return_value = {
171+
"next": "tok2",
172+
"features": [],
173+
}
174+
resp.headers = {}
175+
resp.url = "https://example.com/stats?service=foo"
176+
return resp
177+
178+
179+
def _run_get_stats_data_with_failure(failure_resp_or_exc, monkeypatch):
180+
"""Exercise get_stats_data where the initial response succeeds and the
181+
paginated follow-up fails as given. Mirrors _walk_pages_with_failure.
182+
`monkeypatch` stubs ``_handle_stats_nesting`` so the synthetic minimal
183+
response body doesn't need to parse — these tests only assert on the
184+
pagination loop's error surfacing."""
185+
from dataretrieval.waterdata.utils import get_stats_data
186+
187+
monkeypatch.setattr(
188+
_utils_module,
189+
"_handle_stats_nesting",
190+
mock.MagicMock(return_value=pd.DataFrame()),
191+
)
192+
193+
mock_client = mock.MagicMock(spec=requests.Session)
194+
mock_client.send.return_value = _stats_initial_ok()
195+
if isinstance(failure_resp_or_exc, BaseException):
196+
mock_client.request.side_effect = failure_resp_or_exc
197+
else:
198+
mock_client.request.return_value = failure_resp_or_exc
199+
200+
return get_stats_data(
201+
args={"monitoring_location_id": "USGS-1"},
202+
service="observationNormals",
203+
expand_percentiles=False,
204+
client=mock_client,
205+
)
206+
207+
208+
def test_get_stats_data_logs_actual_exception_when_request_raises(caplog, monkeypatch):
209+
"""get_stats_data variant of the connection-error scenario."""
210+
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
211+
212+
_run_get_stats_data_with_failure(
213+
requests.ConnectionError("stats-boom"),
214+
monkeypatch,
215+
)
216+
217+
messages = _error_log_messages(caplog)
218+
assert any("stats-boom" in m for m in messages), messages
219+
220+
221+
def test_get_stats_data_surfaces_5xx_mid_pagination(caplog, monkeypatch):
222+
"""get_stats_data variant of the mid-pagination 5xx scenario."""
223+
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
224+
225+
page2_503 = mock.MagicMock()
226+
page2_503.status_code = 503
227+
page2_503.json.return_value = {
228+
"code": "ServiceUnavailable",
229+
"description": "upstream timeout",
230+
}
231+
page2_503.url = "https://example.com/stats?service=foo&next_token=tok2"
232+
233+
_run_get_stats_data_with_failure(page2_503, monkeypatch)
234+
235+
messages = _error_log_messages(caplog)
236+
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
237+
238+
239+
def test_get_stats_data_warning_includes_next_token(caplog, monkeypatch):
240+
"""The pagination-failure warning includes the next_token so operators
241+
can identify which page in the sequence failed. (Addresses Copilot's
242+
PR #273 review note: the base URL alone drops cursor context.)"""
243+
caplog.set_level(logging.WARNING, logger=_LOGGER_NAME)
244+
245+
page2_503 = mock.MagicMock()
246+
page2_503.status_code = 503
247+
page2_503.json.return_value = {
248+
"code": "ServiceUnavailable",
249+
"description": "upstream timeout",
250+
}
251+
252+
_run_get_stats_data_with_failure(page2_503, monkeypatch)
253+
254+
warnings_ = [r.getMessage() for r in caplog.records if r.levelno == logging.WARNING]
255+
# The initial response from _stats_initial_ok carries next=tok2.
256+
assert any("tok2" in m for m in warnings_), warnings_
257+
258+
90259
def test_handle_stats_nesting_tolerates_missing_drop_columns():
91260
"""If the upstream stats response shape ever changes such that one of
92261
the columns we try to drop ("type", "properties.data") is absent, the

0 commit comments

Comments
 (0)