Skip to content

Commit d654dd4

Browse files
thodson-usgsclaude
andcommitted
Raise on any mid-pagination failure instead of silently truncating
`_walk_pages` and `get_stats_data`'s pagination loops have, since PR #273, logged failures correctly but preserved a "best effort" contract of returning whatever pages had been collected when a follow-up page failed. The waterdata API exposes no resume cursor once a paginated walk is interrupted, so the partial DataFrame couldn't reliably be extended — silently returning it handed callers truncated data they had no way to know was truncated. Both loops now wrap any mid-pagination exception (429, 5xx, network error) in a ``RuntimeError`` carrying: - the number of pages successfully collected, - the upstream cause (as both the message text and ``__cause__`` for programmatic inspection), - a short menu of recovery actions (wait and retry, reduce request size, or obtain an API token). The shared helper ``_paginated_failure_message`` builds the user- facing string so both loops stay aligned. Behavior change: callers that previously consumed partial DataFrames on transient upstream blips will now see an exception they need to handle (typically: retry, possibly with a smaller ``limit`` or narrower query). Called out in NEWS. Tests: - Replaced the prior best-effort-preserves-partial assertions with raises-with-cause-chain assertions for all three failure modes (connection error, 5xx, 429), in both ``_walk_pages`` and ``get_stats_data`` variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 36866a0 commit d654dd4

3 files changed

Lines changed: 96 additions & 49 deletions

File tree

NEWS.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
**05/14/2026:** Fixed two latent bugs in the paginated `waterdata` request loop (`_walk_pages` and `get_stats_data`). Previously, when `requests.Session.request(...)` itself raised mid-pagination (network error, timeout), the except block called `_error_body()` on the *prior page's* response, so the logged "error" described the wrong request and could itself crash on non-JSON bodies. Separately, no status-code check was performed on subsequent paginated responses, so a 5xx body that didn't include `numberReturned` was silently treated as an empty page — pagination quietly stopped and the user got truncated data with no error logged. The loop now status-checks each page like the initial request and reports the actual exception. The "best-effort" behavior (return whatever pages were collected) is preserved.
1+
**05/16/2026:** Paginated `waterdata` request loops (`_walk_pages` and `get_stats_data`) now raise on *any* mid-pagination failure (HTTP 429, 5xx, network error, …) instead of logging a warning and returning the rows accumulated up to the failure. The API exposes no resume cursor once a paginated walk is interrupted, so the prior best-effort return-partial behavior silently handed callers truncated DataFrames they couldn't reliably extend — a data-correctness footgun. The wrapper `RuntimeError` carries the upstream cause via `__cause__` plus a short menu of recovery actions (wait and retry, reduce the request, or obtain an API token). **Behavior change**: callers that previously consumed partial DataFrames on transient upstream blips will now see an exception; retry the call (possibly with a smaller `limit` or narrower query).
2+
3+
**05/14/2026:** Fixed two latent bugs in the paginated `waterdata` request loop (`_walk_pages` and `get_stats_data`). Previously, when `requests.Session.request(...)` itself raised mid-pagination (network error, timeout), the except block called `_error_body()` on the *prior page's* response, so the logged "error" described the wrong request and could itself crash on non-JSON bodies. Separately, no status-code check was performed on subsequent paginated responses, so a 5xx body that didn't include `numberReturned` was silently treated as an empty page — pagination quietly stopped and the user got truncated data with no error logged. The loop now status-checks each page like the initial request and reports the actual exception. (Superseded on 05/16/2026, see entry above: the "best-effort" return-partial behavior preserved here was subsequently replaced with raising on any mid-pagination failure, since the API offers no resume cursor.)
24

35
**05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.
46

dataretrieval/waterdata/utils.py

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -422,6 +422,27 @@ def _raise_for_non_200(resp: requests.Response) -> None:
422422
raise RuntimeError(_error_body(resp))
423423

424424

425+
def _paginated_failure_message(pages_collected: int, cause: BaseException) -> str:
426+
"""User-facing message for a mid-pagination failure.
427+
428+
The API exposes no resume cursor, so the caller's only recovery is
429+
to retry the whole call — the message lists the practical knobs,
430+
tailored to whether the failure was rate-limit (429) or something
431+
else.
432+
"""
433+
cause_str = str(cause).removesuffix(".")
434+
if cause_str.startswith("429"):
435+
action = "wait for the rate-limit window to reset and retry"
436+
else:
437+
action = "retry the request (possibly after a short backoff)"
438+
return (
439+
f"Paginated request failed after collecting {pages_collected} "
440+
f"page(s): {cause_str}. To recover: {action}, reduce the "
441+
f"request size (e.g. fewer locations, a shorter time range, or "
442+
f"a smaller ``limit``), or obtain an API token."
443+
)
444+
445+
425446
def _construct_api_requests(
426447
service: str,
427448
properties: list[str] | None = None,
@@ -647,8 +668,12 @@ def _walk_pages(
647668
648669
Raises
649670
------
650-
Exception
651-
If a request fails/returns a non-200 status code.
671+
RuntimeError
672+
On any request failure. A non-200 initial response raises the
673+
bare message from ``_error_body``. A failure on any subsequent
674+
page (non-200 status or any exception from ``client.request``)
675+
raises a wrapped message built by ``_paginated_failure_message``
676+
with the original exception chained as ``__cause__``.
652677
"""
653678
logger.info("Requesting: %s", req.url)
654679

@@ -690,11 +715,10 @@ def _walk_pages(
690715
dfs.append(_get_resp_data(resp, geopd=geopd))
691716
curr_url = _next_req_url(resp)
692717
except Exception as e: # noqa: BLE001
693-
logger.error("Request incomplete: %s", e)
694718
logger.warning(
695719
"Request failed for URL: %s. Data download interrupted.", curr_url
696720
)
697-
curr_url = None
721+
raise RuntimeError(_paginated_failure_message(len(dfs), e)) from e
698722

699723
# Concatenate all pages at once for efficiency
700724
return pd.concat(dfs, ignore_index=True), initial_response
@@ -1154,14 +1178,13 @@ def get_stats_data(
11541178
all_dfs.append(_handle_stats_nesting(body, geopd=GEOPANDAS))
11551179
next_token = body["next"]
11561180
except Exception as e: # noqa: BLE001
1157-
logger.error("Request incomplete: %s", e)
11581181
logger.warning(
11591182
"Request failed for URL: %s (next_token=%s). "
11601183
"Data download interrupted.",
11611184
url,
11621185
next_token,
11631186
)
1164-
next_token = None
1187+
raise RuntimeError(_paginated_failure_message(len(all_dfs), e)) from e
11651188

11661189
dfs = pd.concat(all_dfs, ignore_index=True) if len(all_dfs) > 1 else all_dfs[0]
11671190

tests/waterdata_utils_test.py

Lines changed: 64 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from unittest import mock
33

44
import pandas as pd
5+
import pytest
56
import requests
67

78
import dataretrieval.waterdata.utils as _utils_module
@@ -106,12 +107,6 @@ def _resp_ok(features):
106107
return resp
107108

108109

109-
def _error_log_messages(caplog):
110-
"""Pull ERROR-and-above message strings out of caplog. Shared by the
111-
pagination-failure tests below."""
112-
return [r.getMessage() for r in caplog.records if r.levelno >= logging.ERROR]
113-
114-
115110
def _walk_pages_with_failure(failure_resp_or_exc):
116111
"""Run _walk_pages where page 1 succeeds and page 2 fails as given."""
117112
resp1 = _resp_ok([{"id": "1", "properties": {"val": "a"}}])
@@ -131,23 +126,24 @@ def _walk_pages_with_failure(failure_resp_or_exc):
131126
return _walk_pages(geopd=False, req=mock_req, client=mock_client)
132127

133128

134-
def test_walk_pages_logs_actual_exception_when_request_raises(caplog):
135-
"""Exception from client.request() must be logged with its actual message."""
136-
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
137-
138-
df, _ = _walk_pages_with_failure(requests.ConnectionError("boom"))
139-
140-
# First page's data is preserved (best-effort behavior).
141-
assert list(df["val"]) == ["a"]
142-
# Logged error mentions the actual ConnectionError, not a stale page body.
143-
messages = _error_log_messages(caplog)
144-
assert any("boom" in m for m in messages), messages
129+
def test_walk_pages_raises_on_connection_error_mid_pagination():
130+
"""A connection error mid-pagination must raise with the upstream cause
131+
chained, and the wrapper message must include recovery guidance that
132+
is NOT rate-limit-specific (no quota window involved)."""
133+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
134+
_walk_pages_with_failure(requests.ConnectionError("boom"))
145135

136+
msg = str(excinfo.value)
137+
assert isinstance(excinfo.value.__cause__, requests.ConnectionError)
138+
assert "boom" in msg
139+
assert "retry the request" in msg
140+
assert "rate-limit window" not in msg
146141

147-
def test_walk_pages_surfaces_5xx_mid_pagination(caplog):
148-
"""A non-200 mid-pagination response must be logged, not silently swallowed."""
149-
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
150142

143+
def test_walk_pages_raises_on_5xx_mid_pagination():
144+
"""A 5xx mid-pagination must raise — partial data is no longer returned
145+
because the API has no resume cursor, so silently truncating is the
146+
wrong default."""
151147
page2_503 = mock.MagicMock()
152148
page2_503.status_code = 503
153149
page2_503.json.return_value = {
@@ -156,11 +152,27 @@ def test_walk_pages_surfaces_5xx_mid_pagination(caplog):
156152
}
157153
page2_503.url = "https://example.com/page2"
158154

159-
df, _ = _walk_pages_with_failure(page2_503)
155+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
156+
_walk_pages_with_failure(page2_503)
157+
158+
msg = str(excinfo.value)
159+
assert "503" in msg or "ServiceUnavailable" in msg
160+
assert "rate-limit window" not in msg # not rate-limited
161+
162+
163+
def test_walk_pages_raises_on_mid_pagination_429():
164+
"""A 429 mid-pagination must raise. Specific status code is preserved in
165+
the chained cause so callers can branch on rate-limit vs other failures."""
166+
page2_429 = mock.MagicMock()
167+
page2_429.status_code = 429
168+
page2_429.url = "https://example.com/page2"
169+
170+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
171+
_walk_pages_with_failure(page2_429)
160172

161-
assert list(df["val"]) == ["a"]
162-
messages = _error_log_messages(caplog)
163-
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
173+
msg = str(excinfo.value)
174+
assert "429" in msg
175+
assert "rate-limit window" in msg # 429-specific guidance present
164176

165177

166178
def _stats_initial_ok():
@@ -205,23 +217,20 @@ def _run_get_stats_data_with_failure(failure_resp_or_exc, monkeypatch):
205217
)
206218

207219

208-
def test_get_stats_data_logs_actual_exception_when_request_raises(caplog, monkeypatch):
209-
"""get_stats_data variant of the connection-error scenario."""
210-
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
220+
def test_get_stats_data_raises_on_connection_error_mid_pagination(monkeypatch):
221+
"""get_stats_data variant of the connection-error-raises contract."""
222+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
223+
_run_get_stats_data_with_failure(
224+
requests.ConnectionError("stats-boom"),
225+
monkeypatch,
226+
)
211227

212-
_run_get_stats_data_with_failure(
213-
requests.ConnectionError("stats-boom"),
214-
monkeypatch,
215-
)
216-
217-
messages = _error_log_messages(caplog)
218-
assert any("stats-boom" in m for m in messages), messages
228+
assert isinstance(excinfo.value.__cause__, requests.ConnectionError)
229+
assert "stats-boom" in str(excinfo.value)
219230

220231

221-
def test_get_stats_data_surfaces_5xx_mid_pagination(caplog, monkeypatch):
222-
"""get_stats_data variant of the mid-pagination 5xx scenario."""
223-
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
224-
232+
def test_get_stats_data_raises_on_5xx_mid_pagination(monkeypatch):
233+
"""get_stats_data variant of the 5xx-raises contract."""
225234
page2_503 = mock.MagicMock()
226235
page2_503.status_code = 503
227236
page2_503.json.return_value = {
@@ -230,10 +239,22 @@ def test_get_stats_data_surfaces_5xx_mid_pagination(caplog, monkeypatch):
230239
}
231240
page2_503.url = "https://example.com/stats?service=foo&next_token=tok2"
232241

233-
_run_get_stats_data_with_failure(page2_503, monkeypatch)
242+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
243+
_run_get_stats_data_with_failure(page2_503, monkeypatch)
244+
245+
assert "503" in str(excinfo.value) or "ServiceUnavailable" in str(excinfo.value)
246+
247+
248+
def test_get_stats_data_raises_on_mid_pagination_429(monkeypatch):
249+
"""get_stats_data variant of the 429-raises contract."""
250+
page2_429 = mock.MagicMock()
251+
page2_429.status_code = 429
252+
page2_429.url = "https://example.com/stats?service=foo&next_token=tok2"
253+
254+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
255+
_run_get_stats_data_with_failure(page2_429, monkeypatch)
234256

235-
messages = _error_log_messages(caplog)
236-
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
257+
assert "429" in str(excinfo.value)
237258

238259

239260
def test_get_stats_data_warning_includes_next_token(caplog, monkeypatch):
@@ -249,7 +270,8 @@ def test_get_stats_data_warning_includes_next_token(caplog, monkeypatch):
249270
"description": "upstream timeout",
250271
}
251272

252-
_run_get_stats_data_with_failure(page2_503, monkeypatch)
273+
with pytest.raises(RuntimeError):
274+
_run_get_stats_data_with_failure(page2_503, monkeypatch)
253275

254276
warnings_ = [r.getMessage() for r in caplog.records if r.levelno == logging.WARNING]
255277
# The initial response from _stats_initial_ok carries next=tok2.

0 commit comments

Comments
 (0)