Skip to content

Commit 5c9ee68

Browse files
authored
Merge branch 'main' into multivalue-chunker
2 parents 526a7e4 + 4a65fb1 commit 5c9ee68

3 files changed

Lines changed: 117 additions & 48 deletions

File tree

NEWS.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
**05/15/2026:** The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) now transparently chunk requests whose URLs would otherwise exceed the server's ~8 KB byte limit. A common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — previously failed with HTTP 414 once the resulting URL grew past the limit; it now fans out across multiple sub-requests under the hood and returns one combined DataFrame. The chunker coordinates with the existing CQL `filter` chunker (long top-level-`OR` filters still split correctly when used alongside long multi-value lists), caps cartesian-product plans at 1000 sub-requests (the default USGS hourly quota), and aborts mid-call with a structured `QuotaExhausted` exception — carrying the partial result and a resume offset — if `x-ratelimit-remaining` drops below a safety floor. Mirrors R `dataRetrieval`'s [#870](https://github.com/DOI-USGS/dataRetrieval/pull/870), generalized to N dimensions. Note one metadata-behavior change for paginated/chunked calls: `BaseMetadata.url` still reflects the user's original query (unchanged), but `BaseMetadata.header` now carries the *last* page's / sub-request's headers (so `x-ratelimit-remaining` is current) rather than the first, and `BaseMetadata.query_time` is now the cumulative wall-clock across pages instead of the first page's elapsed.
1+
**05/17/2026:** The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) now transparently chunk requests whose URLs would otherwise exceed the server's ~8 KB byte limit. A common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — previously failed with HTTP 414 once the resulting URL grew past the limit; it now fans out across multiple sub-requests under the hood and returns one combined DataFrame. The chunker coordinates with the existing CQL `filter` chunker (long top-level-`OR` filters still split correctly when used alongside long multi-value lists), caps cartesian-product plans at 1000 sub-requests (the default USGS hourly quota), and aborts mid-call with a structured `QuotaExhausted` exception — carrying the partial result and a resume offset — if `x-ratelimit-remaining` drops below a safety floor. Mirrors R `dataRetrieval`'s [#870](https://github.com/DOI-USGS/dataRetrieval/pull/870), generalized to N dimensions. Note one metadata-behavior change for paginated/chunked calls: `BaseMetadata.url` still reflects the user's original query (unchanged), but `BaseMetadata.header` now carries the *last* page's / sub-request's headers (so `x-ratelimit-remaining` is current) rather than the first, and `BaseMetadata.query_time` is now the cumulative wall-clock across pages instead of the first page's elapsed.
22

3-
**05/14/2026:** Fixed two latent bugs in the paginated `waterdata` request loop (`_walk_pages` and `get_stats_data`). Previously, when `requests.Session.request(...)` itself raised mid-pagination (network error, timeout), the except block called `_error_body()` on the *prior page's* response, so the logged "error" described the wrong request and could itself crash on non-JSON bodies. Separately, no status-code check was performed on subsequent paginated responses, so a 5xx body that didn't include `numberReturned` was silently treated as an empty page — pagination quietly stopped and the user got truncated data with no error logged. The loop now status-checks each page like the initial request and reports the actual exception. The "best-effort" behavior (return whatever pages were collected) is preserved.
3+
**05/16/2026:** Fixed silent truncation in the paginated `waterdata` request loops (`_walk_pages` and `get_stats_data`). Mid-pagination failures (HTTP 429, 5xx, network error) were previously swallowed — pagination would quietly stop and the function would return whatever rows it had collected, leaving callers with truncated DataFrames they had no way to detect. The loops now status-check every page like the initial request and raise `RuntimeError` on any failure, with the upstream exception chained as `__cause__` and a short menu of recovery actions (wait and retry, reduce the request, or obtain an API token) in the message. **Behavior change**: callers that previously consumed partial DataFrames on transient upstream blips will now see an exception; retry the call (possibly with a smaller `limit` or narrower query).
44

55
**05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.
66

dataretrieval/waterdata/utils.py

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -422,6 +422,32 @@ def _raise_for_non_200(resp: requests.Response) -> None:
422422
raise RuntimeError(_error_body(resp))
423423

424424

425+
def _paginated_failure_message(pages_collected: int, cause: BaseException) -> str:
426+
"""User-facing message for a mid-pagination failure.
427+
428+
The API exposes no resume cursor, so the caller's only recovery is
429+
to retry the whole call — the message lists the practical knobs,
430+
tailored to whether the failure was rate-limit (429) or something
431+
else.
432+
"""
433+
cause_str = str(cause).removesuffix(".")
434+
# Some ``requests`` exceptions (e.g. ``Timeout()`` with no args)
435+
# stringify to empty; fall back to the class name so the wrapper
436+
# message is always informative.
437+
if not cause_str.strip():
438+
cause_str = type(cause).__name__
439+
if cause_str.startswith("429"):
440+
action = "wait for the rate-limit window to reset and retry"
441+
else:
442+
action = "retry the request (possibly after a short backoff)"
443+
return (
444+
f"Paginated request failed after collecting {pages_collected} "
445+
f"page(s): {cause_str}. To recover: {action}, reduce the "
446+
f"request size (e.g. fewer locations, a shorter time range, or "
447+
f"a smaller ``limit``), or obtain an API token."
448+
)
449+
450+
425451
def _construct_api_requests(
426452
service: str,
427453
properties: list[str] | None = None,
@@ -671,8 +697,17 @@ def _walk_pages(
671697
672698
Raises
673699
------
674-
Exception
675-
If a request fails/returns a non-200 status code.
700+
RuntimeError
701+
On a non-200 initial response (bare message from ``_error_body``)
702+
or any failure on a subsequent page (wrapped message built by
703+
``_paginated_failure_message`` with the original exception
704+
chained as ``__cause__``).
705+
requests.exceptions.RequestException
706+
Network-level failures on the *initial* request (e.g.
707+
``ConnectionError``, ``Timeout``) propagate unmodified to
708+
preserve their specific type for callers that branch on it.
709+
Equivalent failures on *subsequent* pages are caught and
710+
re-raised as ``RuntimeError`` per the rule above.
676711
"""
677712
logger.info("Requesting: %s", req.url)
678713

@@ -717,11 +752,10 @@ def _walk_pages(
717752
total_elapsed += resp.elapsed
718753
curr_url = _next_req_url(resp)
719754
except Exception as e: # noqa: BLE001
720-
logger.error("Request incomplete: %s", e)
721755
logger.warning(
722756
"Request failed for URL: %s. Data download interrupted.", curr_url
723757
)
724-
curr_url = None
758+
raise RuntimeError(_paginated_failure_message(len(dfs), e)) from e
725759

726760
_finalize_paginated_response(initial_response, resp, total_elapsed)
727761

@@ -1189,14 +1223,13 @@ def get_stats_data(
11891223
total_elapsed += resp.elapsed
11901224
next_token = body["next"]
11911225
except Exception as e: # noqa: BLE001
1192-
logger.error("Request incomplete: %s", e)
11931226
logger.warning(
11941227
"Request failed for URL: %s (next_token=%s). "
11951228
"Data download interrupted.",
11961229
url,
11971230
next_token,
11981231
)
1199-
next_token = None
1232+
raise RuntimeError(_paginated_failure_message(len(all_dfs), e)) from e
12001233

12011234
_finalize_paginated_response(initial_response, resp, total_elapsed)
12021235

tests/waterdata_utils_test.py

Lines changed: 76 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from unittest import mock
33

44
import pandas as pd
5+
import pytest
56
import requests
67

78
import dataretrieval.waterdata.utils as _utils_module
@@ -106,12 +107,6 @@ def _resp_ok(features):
106107
return resp
107108

108109

109-
def _error_log_messages(caplog):
110-
"""Pull ERROR-and-above message strings out of caplog. Shared by the
111-
pagination-failure tests below."""
112-
return [r.getMessage() for r in caplog.records if r.levelno >= logging.ERROR]
113-
114-
115110
def _walk_pages_with_failure(failure_resp_or_exc):
116111
"""Run _walk_pages where page 1 succeeds and page 2 fails as given."""
117112
resp1 = _resp_ok([{"id": "1", "properties": {"val": "a"}}])
@@ -131,23 +126,38 @@ def _walk_pages_with_failure(failure_resp_or_exc):
131126
return _walk_pages(geopd=False, req=mock_req, client=mock_client)
132127

133128

134-
def test_walk_pages_logs_actual_exception_when_request_raises(caplog):
135-
"""Exception from client.request() must be logged with its actual message."""
136-
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
129+
def test_walk_pages_raises_on_connection_error_mid_pagination():
130+
"""A connection error mid-pagination must raise with the upstream cause
131+
chained, and the wrapper message must include recovery guidance that
132+
is NOT rate-limit-specific (no quota window involved)."""
133+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
134+
_walk_pages_with_failure(requests.ConnectionError("boom"))
137135

138-
df, _ = _walk_pages_with_failure(requests.ConnectionError("boom"))
136+
msg = str(excinfo.value)
137+
assert isinstance(excinfo.value.__cause__, requests.ConnectionError)
138+
assert "boom" in msg
139+
assert "retry the request" in msg
140+
assert "rate-limit window" not in msg
139141

140-
# First page's data is preserved (best-effort behavior).
141-
assert list(df["val"]) == ["a"]
142-
# Logged error mentions the actual ConnectionError, not a stale page body.
143-
messages = _error_log_messages(caplog)
144-
assert any("boom" in m for m in messages), messages
145142

143+
def test_walk_pages_raises_with_class_name_when_cause_stringifies_empty():
144+
"""Some ``requests`` exceptions (e.g. ``Timeout()`` with no args)
145+
stringify to ``""``. The wrapper must still produce an informative
146+
message — fall back to the exception class name."""
147+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
148+
_walk_pages_with_failure(requests.Timeout())
146149

147-
def test_walk_pages_surfaces_5xx_mid_pagination(caplog):
148-
"""A non-200 mid-pagination response must be logged, not silently swallowed."""
149-
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
150+
msg = str(excinfo.value)
151+
assert "Timeout" in msg, msg
152+
# Sanity-check the malformed-empty placeholder didn't slip through.
153+
assert "page(s): ." not in msg
154+
assert "page(s): To recover" not in msg
150155

156+
157+
def test_walk_pages_raises_on_5xx_mid_pagination():
158+
"""A 5xx mid-pagination must raise — partial data is no longer returned
159+
because the API has no resume cursor, so silently truncating is the
160+
wrong default."""
151161
page2_503 = mock.MagicMock()
152162
page2_503.status_code = 503
153163
page2_503.json.return_value = {
@@ -156,11 +166,27 @@ def test_walk_pages_surfaces_5xx_mid_pagination(caplog):
156166
}
157167
page2_503.url = "https://example.com/page2"
158168

159-
df, _ = _walk_pages_with_failure(page2_503)
169+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
170+
_walk_pages_with_failure(page2_503)
171+
172+
msg = str(excinfo.value)
173+
assert "503" in msg or "ServiceUnavailable" in msg
174+
assert "rate-limit window" not in msg # not rate-limited
175+
176+
177+
def test_walk_pages_raises_on_mid_pagination_429():
178+
"""A 429 mid-pagination must raise. Specific status code is preserved in
179+
the chained cause so callers can branch on rate-limit vs other failures."""
180+
page2_429 = mock.MagicMock()
181+
page2_429.status_code = 429
182+
page2_429.url = "https://example.com/page2"
183+
184+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
185+
_walk_pages_with_failure(page2_429)
160186

161-
assert list(df["val"]) == ["a"]
162-
messages = _error_log_messages(caplog)
163-
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
187+
msg = str(excinfo.value)
188+
assert "429" in msg
189+
assert "rate-limit window" in msg # 429-specific guidance present
164190

165191

166192
def _stats_initial_ok():
@@ -205,23 +231,20 @@ def _run_get_stats_data_with_failure(failure_resp_or_exc, monkeypatch):
205231
)
206232

207233

208-
def test_get_stats_data_logs_actual_exception_when_request_raises(caplog, monkeypatch):
209-
"""get_stats_data variant of the connection-error scenario."""
210-
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
234+
def test_get_stats_data_raises_on_connection_error_mid_pagination(monkeypatch):
235+
"""get_stats_data variant of the connection-error-raises contract."""
236+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
237+
_run_get_stats_data_with_failure(
238+
requests.ConnectionError("stats-boom"),
239+
monkeypatch,
240+
)
211241

212-
_run_get_stats_data_with_failure(
213-
requests.ConnectionError("stats-boom"),
214-
monkeypatch,
215-
)
216-
217-
messages = _error_log_messages(caplog)
218-
assert any("stats-boom" in m for m in messages), messages
242+
assert isinstance(excinfo.value.__cause__, requests.ConnectionError)
243+
assert "stats-boom" in str(excinfo.value)
219244

220245

221-
def test_get_stats_data_surfaces_5xx_mid_pagination(caplog, monkeypatch):
222-
"""get_stats_data variant of the mid-pagination 5xx scenario."""
223-
caplog.set_level(logging.ERROR, logger=_LOGGER_NAME)
224-
246+
def test_get_stats_data_raises_on_5xx_mid_pagination(monkeypatch):
247+
"""get_stats_data variant of the 5xx-raises contract."""
225248
page2_503 = mock.MagicMock()
226249
page2_503.status_code = 503
227250
page2_503.json.return_value = {
@@ -230,10 +253,22 @@ def test_get_stats_data_surfaces_5xx_mid_pagination(caplog, monkeypatch):
230253
}
231254
page2_503.url = "https://example.com/stats?service=foo&next_token=tok2"
232255

233-
_run_get_stats_data_with_failure(page2_503, monkeypatch)
256+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
257+
_run_get_stats_data_with_failure(page2_503, monkeypatch)
258+
259+
assert "503" in str(excinfo.value) or "ServiceUnavailable" in str(excinfo.value)
260+
261+
262+
def test_get_stats_data_raises_on_mid_pagination_429(monkeypatch):
263+
"""get_stats_data variant of the 429-raises contract."""
264+
page2_429 = mock.MagicMock()
265+
page2_429.status_code = 429
266+
page2_429.url = "https://example.com/stats?service=foo&next_token=tok2"
267+
268+
with pytest.raises(RuntimeError, match="Paginated request failed") as excinfo:
269+
_run_get_stats_data_with_failure(page2_429, monkeypatch)
234270

235-
messages = _error_log_messages(caplog)
236-
assert any("503" in m or "ServiceUnavailable" in m for m in messages), messages
271+
assert "429" in str(excinfo.value)
237272

238273

239274
def test_get_stats_data_warning_includes_next_token(caplog, monkeypatch):
@@ -249,7 +284,8 @@ def test_get_stats_data_warning_includes_next_token(caplog, monkeypatch):
249284
"description": "upstream timeout",
250285
}
251286

252-
_run_get_stats_data_with_failure(page2_503, monkeypatch)
287+
with pytest.raises(RuntimeError):
288+
_run_get_stats_data_with_failure(page2_503, monkeypatch)
253289

254290
warnings_ = [r.getMessage() for r in caplog.records if r.levelno == logging.WARNING]
255291
# The initial response from _stats_initial_ok carries next=tok2.

0 commit comments

Comments
 (0)