Skip to content

Commit 463106c

Browse files
thodson-usgsclaude
andcommitted
Raise on any mid-pagination failure instead of silently truncating
`_walk_pages` and `get_stats_data`'s pagination loops have, since PR DOI-USGS#273, logged failures correctly but preserved a "best effort" contract of returning whatever pages had been collected when a follow-up page failed. The waterdata API exposes no resume cursor once a paginated walk is interrupted, so the partial DataFrame couldn't reliably be extended — silently returning it handed callers truncated data they had no way to know was truncated. Both loops now wrap any mid-pagination exception (429, 5xx, network error) in a ``RuntimeError`` carrying: - the number of pages successfully collected, - the upstream cause (as both the message text and ``__cause__`` for programmatic inspection), - a short menu of recovery actions (wait and retry, reduce request size, or obtain an API token). The shared helper ``_paginated_failure_message`` builds the user- facing string so both loops stay aligned. Behavior change: callers that previously consumed partial DataFrames on transient upstream blips will now see an exception they need to handle (typically: retry, possibly with a smaller ``limit`` or narrower query). Called out in NEWS. Tests: - Replaced the prior best-effort-preserves-partial assertions with raises-with-cause-chain assertions for all three failure modes (connection error, 5xx, 429), in both ``_walk_pages`` and ``get_stats_data`` variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 36866a0 commit 463106c

3 files changed

Lines changed: 99 additions & 24 deletions

File tree

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
**05/16/2026:** Paginated `waterdata` request loops (`_walk_pages` and `get_stats_data`) now raise on *any* mid-pagination failure (HTTP 429, 5xx, network error, …) instead of logging a warning and returning the rows accumulated up to the failure. The API exposes no resume cursor once a paginated walk is interrupted, so the prior best-effort return-partial behavior silently handed callers truncated DataFrames they couldn't reliably extend — a data-correctness footgun. The wrapper `RuntimeError` carries the upstream cause via `__cause__` plus a short menu of recovery actions (wait and retry, reduce the request, or obtain an API token). **Behavior change**: callers that previously consumed partial DataFrames on transient upstream blips will now see an exception; retry the call (possibly with a smaller `limit` or narrower query).
2+
13
**05/14/2026:** Fixed two latent bugs in the paginated `waterdata` request loop (`_walk_pages` and `get_stats_data`). Previously, when `requests.Session.request(...)` itself raised mid-pagination (network error, timeout), the except block called `_error_body()` on the *prior page's* response, so the logged "error" described the wrong request and could itself crash on non-JSON bodies. Separately, no status-code check was performed on subsequent paginated responses, so a 5xx body that didn't include `numberReturned` was silently treated as an empty page — pagination quietly stopped and the user got truncated data with no error logged. The loop now status-checks each page like the initial request and reports the actual exception. The "best-effort" behavior (return whatever pages were collected) is preserved.
24

35
**05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.

dataretrieval/waterdata/utils.py

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -422,6 +422,24 @@ def _raise_for_non_200(resp: requests.Response) -> None:
422422
raise RuntimeError(_error_body(resp))
423423

424424

425+
def _paginated_failure_message(pages_collected: int, cause: BaseException) -> str:
426+
"""Build the user-facing message for a mid-pagination failure.
427+
428+
Wraps the upstream error with the page count we got through and a
429+
short menu of recovery actions, since the API offers no resume
430+
cursor: a partial result here cannot be reliably extended, so the
431+
caller's next step is always to retry the whole call (possibly
432+
smaller, possibly later, possibly with an API token).
433+
"""
434+
return (
435+
f"Paginated request failed after collecting {pages_collected} "
436+
f"page(s): {str(cause).rstrip('.')}. To recover: wait for the "
437+
f"rate-limit window to reset and retry, reduce the request size "
438+
f"(e.g. fewer locations, a shorter time range, or a smaller "
439+
f"``limit``), or obtain an API token."
440+
)
441+
442+
425443
def _construct_api_requests(
426444
service: str,
427445
properties: list[str] | None = None,
@@ -647,8 +665,16 @@ def _walk_pages(
647665
648666
Raises
649667
------
650-
Exception
651-
If a request fails/returns a non-200 status code.
668+
RuntimeError
669+
On any failure during pagination (non-200 status on any page, or
670+
any exception from the underlying ``client.request``). Because
671+
the API exposes no resume cursor once a paginated walk is
672+
interrupted, returning the partial pages would hand the caller
673+
truncated data with no reliable way to fill the gap. The
674+
exception message reports how many pages were collected, the
675+
upstream cause, and recovery options (wait, reduce the request,
676+
or obtain an API token); the original exception is set as
677+
``__cause__``.
652678
"""
653679
logger.info("Requesting: %s", req.url)
654680

@@ -694,7 +720,7 @@ def _walk_pages(
694720
logger.warning(
695721
"Request failed for URL: %s. Data download interrupted.", curr_url
696722
)
697-
curr_url = None
723+
raise RuntimeError(_paginated_failure_message(len(dfs), e)) from e
698724

699725
# Concatenate all pages at once for efficiency
700726
return pd.concat(dfs, ignore_index=True), initial_response
@@ -1161,7 +1187,7 @@ def get_stats_data(
11611187
url,
11621188
next_token,
11631189
)
1164-
next_token = None
1190+
raise RuntimeError(_paginated_failure_message(len(all_dfs), e)) from e
11651191

11661192
dfs = pd.concat(all_dfs, ignore_index=True) if len(all_dfs) > 1 else all_dfs[0]
11671193

tests/waterdata_utils_test.py

Lines changed: 67 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from unittest import mock
33

44
import pandas as pd
5+
import pytest
56
import requests
67

78
import dataretrieval.waterdata.utils as _utils_module
@@ -131,21 +132,28 @@ def _walk_pages_with_failure(failure_resp_or_exc):
131132
return _walk_pages(geopd=False, req=mock_req, client=mock_client)
132133

133134

134-
def test_walk_pages_logs_actual_exception_when_request_raises(caplog):
135-
"""Exception from client.request() must be logged with its actual message."""
135+
def test_walk_pages_raises_on_connection_error_mid_pagination(caplog):
136+
"""A connection error mid-pagination must raise with the upstream cause
137+
chained, and the wrapper message must include recovery guidance."""
136138
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
137139

138-
df, _ = _walk_pages_with_failure(requests.ConnectionError("boom"))
140+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
141+
_walk_pages_with_failure(requests.ConnectionError("boom"))
142+
143+
# Original exception preserved via chaining for callers who need it.
144+
assert isinstance(excinfo.value.__cause__, requests.ConnectionError)
145+
assert "boom" in str(excinfo.value)
146+
# And the actionable advice is present in the message.
147+
assert "API token" in str(excinfo.value)
139148

140-
# First page's data is preserved (best-effort behavior).
141-
assert list(df["val"]) == ["a"]
142-
# Logged error mentions the actual ConnectionError, not a stale page body.
143149
messages = _error_log_messages(caplog)
144150
assert any("boom" in m for m in messages), messages
145151

146152

147-
def test_walk_pages_surfaces_5xx_mid_pagination(caplog):
148-
"""A non-200 mid-pagination response must be logged, not silently swallowed."""
153+
def test_walk_pages_raises_on_5xx_mid_pagination(caplog):
154+
"""A 5xx mid-pagination must raise — partial data is no longer returned
155+
because the API has no resume cursor, so silently truncating is the
156+
wrong default."""
149157
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
150158

151159
page2_503 = mock.MagicMock()
@@ -156,13 +164,30 @@ def test_walk_pages_surfaces_5xx_mid_pagination(caplog):
156164
}
157165
page2_503.url = "https://example.com/page2"
158166

159-
df, _ = _walk_pages_with_failure(page2_503)
167+
with pytest.raises(RuntimeError, match="Paginated request failed"):
168+
_walk_pages_with_failure(page2_503)
160169

161-
assert list(df["val"]) == ["a"]
162170
messages = _error_log_messages(caplog)
163171
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
164172

165173

174+
def test_walk_pages_raises_on_mid_pagination_429(caplog):
175+
"""A 429 mid-pagination must raise. Specific status code is preserved in
176+
the chained cause so callers can branch on rate-limit vs other failures."""
177+
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
178+
179+
page2_429 = mock.MagicMock()
180+
page2_429.status_code = 429
181+
page2_429.url = "https://example.com/page2"
182+
183+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
184+
_walk_pages_with_failure(page2_429)
185+
186+
assert "429" in str(excinfo.value)
187+
messages = _error_log_messages(caplog)
188+
assert any("429" in m for m in messages), messages
189+
190+
166191
def _stats_initial_ok():
167192
"""A 200-OK initial stats response: empty data list, signals one more page."""
168193
resp = mock.MagicMock()
@@ -205,21 +230,25 @@ def _run_get_stats_data_with_failure(failure_resp_or_exc, monkeypatch):
205230
)
206231

207232

208-
def test_get_stats_data_logs_actual_exception_when_request_raises(caplog, monkeypatch):
209-
"""get_stats_data variant of the connection-error scenario."""
233+
def test_get_stats_data_raises_on_connection_error_mid_pagination(caplog, monkeypatch):
234+
"""get_stats_data variant of the connection-error-raises contract."""
210235
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
211236

212-
_run_get_stats_data_with_failure(
213-
requests.ConnectionError("stats-boom"),
214-
monkeypatch,
215-
)
237+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
238+
_run_get_stats_data_with_failure(
239+
requests.ConnectionError("stats-boom"),
240+
monkeypatch,
241+
)
242+
243+
assert isinstance(excinfo.value.__cause__, requests.ConnectionError)
244+
assert "stats-boom" in str(excinfo.value)
216245

217246
messages = _error_log_messages(caplog)
218247
assert any("stats-boom" in m for m in messages), messages
219248

220249

221-
def test_get_stats_data_surfaces_5xx_mid_pagination(caplog, monkeypatch):
222-
"""get_stats_data variant of the mid-pagination 5xx scenario."""
250+
def test_get_stats_data_raises_on_5xx_mid_pagination(caplog, monkeypatch):
251+
"""get_stats_data variant of the 5xx-raises contract."""
223252
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
224253

225254
page2_503 = mock.MagicMock()
@@ -230,12 +259,29 @@ def test_get_stats_data_surfaces_5xx_mid_pagination(caplog, monkeypatch):
230259
}
231260
page2_503.url = "https://example.com/stats?service=foo&next_token=tok2"
232261

233-
_run_get_stats_data_with_failure(page2_503, monkeypatch)
262+
with pytest.raises(RuntimeError, match="Paginated request failed"):
263+
_run_get_stats_data_with_failure(page2_503, monkeypatch)
234264

235265
messages = _error_log_messages(caplog)
236266
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
237267

238268

269+
def test_get_stats_data_raises_on_mid_pagination_429(caplog, monkeypatch):
270+
"""get_stats_data variant of the 429-raises contract."""
271+
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
272+
273+
page2_429 = mock.MagicMock()
274+
page2_429.status_code = 429
275+
page2_429.url = "https://example.com/stats?service=foo&next_token=tok2"
276+
277+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
278+
_run_get_stats_data_with_failure(page2_429, monkeypatch)
279+
280+
assert "429" in str(excinfo.value)
281+
messages = _error_log_messages(caplog)
282+
assert any("429" in m for m in messages), messages
283+
284+
239285
def test_get_stats_data_warning_includes_next_token(caplog, monkeypatch):
240286
"""The pagination-failure warning includes the next_token so operators
241287
can identify which page in the sequence failed. (Addresses Copilot's
@@ -249,7 +295,8 @@ def test_get_stats_data_warning_includes_next_token(caplog, monkeypatch):
249295
"description": "upstream timeout",
250296
}
251297

252-
_run_get_stats_data_with_failure(page2_503, monkeypatch)
298+
with pytest.raises(RuntimeError):
299+
_run_get_stats_data_with_failure(page2_503, monkeypatch)
253300

254301
warnings_ = [r.getMessage() for r in caplog.records if r.levelno == logging.WARNING]
255302
# The initial response from _stats_initial_ok carries next=tok2.

0 commit comments

Comments
 (0)