Skip to content

Commit 0d02181

Browse files
thodson-usgsclaude
andcommitted
Raise on mid-pagination 429 instead of silently truncating
`_walk_pages` and `get_stats_data`'s pagination loops have, since PR #273, logged failures correctly but preserved the "best effort" behavior of returning whatever pages had been collected when a page mid-walk failed. For 429 specifically that's a data-correctness footgun: a rate-limit hit means rows are *definitively* missing (not "the upstream had a transient glitch"), and the caller can retry once the rate-limit window resets — but only if they know the result was incomplete. Silently handing back a short DataFrame hides that. The loops now re-raise the `RuntimeError` raised by `_raise_for_non_200` when the failing response carried HTTP 429. For transient network errors and 5xx the existing best-effort return-partial behavior is preserved (so an upstream blip on the last page of a long walk doesn't lose the prior pages). Two new tests assert the new contract: - `test_walk_pages_raises_on_mid_pagination_429` - `test_get_stats_data_raises_on_mid_pagination_429` Existing 5xx tests still pass — proves the network/5xx path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 36866a0 commit 0d02181

3 files changed

Lines changed: 52 additions & 2 deletions

File tree

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
**05/16/2026:** Paginated `waterdata` request loops (`_walk_pages` and `get_stats_data`) now raise on a mid-pagination HTTP 429 instead of logging a warning and returning the rows accumulated up to that point. Rate-limit hits mean rows are definitively missing and the caller can retry once the rate-limit window resets, so silently returning a short DataFrame was a data-correctness footgun. Best-effort return-partial behavior is preserved for transient network and 5xx failures.
2+
13
**05/14/2026:** Fixed two latent bugs in the paginated `waterdata` request loop (`_walk_pages` and `get_stats_data`). Previously, when `requests.Session.request(...)` itself raised mid-pagination (network error, timeout), the except block called `_error_body()` on the *prior page's* response, so the logged "error" described the wrong request and could itself crash on non-JSON bodies. Separately, no status-code check was performed on subsequent paginated responses, so a 5xx body that didn't include `numberReturned` was silently treated as an empty page — pagination quietly stopped and the user got truncated data with no error logged. The loop now status-checks each page like the initial request and reports the actual exception. The "best-effort" behavior (return whatever pages were collected) is preserved.
24

35
**05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.

dataretrieval/waterdata/utils.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -647,8 +647,12 @@ def _walk_pages(
647647
648648
Raises
649649
------
650-
Exception
651-
If a request fails/returns a non-200 status code.
650+
RuntimeError
651+
If the initial request or any subsequent page returns a non-200
652+
status code. Mid-pagination 429s also raise — best-effort
653+
return-partial is preserved only for transient network/5xx
654+
failures, since 429s mean rows are definitively missing and the
655+
caller can retry once the rate-limit window resets.
652656
"""
653657
logger.info("Requesting: %s", req.url)
654658

@@ -694,6 +698,12 @@ def _walk_pages(
694698
logger.warning(
695699
"Request failed for URL: %s. Data download interrupted.", curr_url
696700
)
701+
# Surface mid-pagination 429s instead of returning truncated
702+
# rows silently: rate-limit hits are deterministic and
703+
# recoverable on retry, unlike transient network/5xx blips
704+
# for which the best-effort behavior is preserved.
705+
if getattr(resp, "status_code", None) == 429:
706+
raise
697707
curr_url = None
698708

699709
# Concatenate all pages at once for efficiency
@@ -1161,6 +1171,8 @@ def get_stats_data(
11611171
url,
11621172
next_token,
11631173
)
1174+
if getattr(resp, "status_code", None) == 429:
1175+
raise
11641176
next_token = None
11651177

11661178
dfs = pd.concat(all_dfs, ignore_index=True) if len(all_dfs) > 1 else all_dfs[0]

tests/waterdata_utils_test.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from unittest import mock
33

44
import pandas as pd
5+
import pytest
56
import requests
67

78
import dataretrieval.waterdata.utils as _utils_module
@@ -163,6 +164,26 @@ def test_walk_pages_surfaces_5xx_mid_pagination(caplog):
163164
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
164165

165166

167+
def test_walk_pages_raises_on_mid_pagination_429(caplog):
168+
"""A 429 mid-pagination must raise instead of returning partial rows.
169+
170+
Rate-limit hits mean rows are definitively missing; silently
171+
returning the pages we already collected hands the caller truncated
172+
data with no exception to catch.
173+
"""
174+
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
175+
176+
page2_429 = mock.MagicMock()
177+
page2_429.status_code = 429
178+
page2_429.url = "https://example.com/page2"
179+
180+
with pytest.raises(RuntimeError, match="429"):
181+
_walk_pages_with_failure(page2_429)
182+
183+
messages = _error_log_messages(caplog)
184+
assert any("429" in m for m in messages), messages
185+
186+
166187
def _stats_initial_ok():
167188
"""A 200-OK initial stats response: empty data list, signals one more page."""
168189
resp = mock.MagicMock()
@@ -236,6 +257,21 @@ def test_get_stats_data_surfaces_5xx_mid_pagination(caplog, monkeypatch):
236257
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
237258

238259

260+
def test_get_stats_data_raises_on_mid_pagination_429(caplog, monkeypatch):
261+
"""get_stats_data variant of the 429-raises-instead-of-truncating contract."""
262+
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
263+
264+
page2_429 = mock.MagicMock()
265+
page2_429.status_code = 429
266+
page2_429.url = "https://example.com/stats?service=foo&next_token=tok2"
267+
268+
with pytest.raises(RuntimeError, match="429"):
269+
_run_get_stats_data_with_failure(page2_429, monkeypatch)
270+
271+
messages = _error_log_messages(caplog)
272+
assert any("429" in m for m in messages), messages
273+
274+
239275
def test_get_stats_data_warning_includes_next_token(caplog, monkeypatch):
240276
"""The pagination-failure warning includes the next_token so operators
241277
can identify which page in the sequence failed. (Addresses Copilot's

0 commit comments

Comments
 (0)