Skip to content

Commit 95bd5f0

Browse files
thodson-usgsclaude
andcommitted
feat(waterdata): Auto-chunk OGC requests over the URL byte limit
The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) previously failed with HTTP 414 when the request URL exceeded the server's ~8 KB byte limit. The common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — was the main offender: from dataretrieval.waterdata import get_daily, get_monitoring_locations sites_df, _ = get_monitoring_locations( state_name="Ohio", site_type_code="ST", skip_geometry=True, ) # Before: HTTP 414 once `sites_df` exceeded ~500 rows. # After: transparently chunked into multiple sub-requests, one # combined DataFrame returned. df, md = get_daily( monitoring_location_id=sites_df["monitoring_location_id"].tolist(), parameter_code="00060", time="P7D", ) This patch introduces a joint chunker that models every multi-value list parameter AND the cql-text `filter` (split on its top-level `OR` clauses) as a chunkable axis. Greedy halving splits the biggest chunk across all axes until each sub-request URL fits the limit; the chunker fans out into multiple HTTP requests under the hood and returns one combined DataFrame. Callers see no API change. Every axis (a list-shaped kwarg, or the filter split into its top-level `OR` clauses) is represented by an `_Axis` dataclass: the args key, the tuple of indivisible atoms (site IDs or clauses), and the joiner used to compose them back into URL text (`,` for list axes, ` OR ` for the filter axis). `ChunkPlan` extracts the chunkable axes for a request and runs greedy halving against the biggest chunk across all axes until the worst-case sub-request URL fits. `ChunkedCall` iterates the joint cartesian product of axis chunks and drives the sub-requests to completion. Requests that already fit get a trivial single-step plan — one code path either way. After the first sub-request, `ChunkedCall` reads `x-ratelimit-remaining`; if the rest of the plan can't fit the current per-key rate-limit window, it raises `RequestExceedsQuota` reporting the deficit before burning more budget. Set `API_USGS_LIMIT=0` to bypass the pre-emptive check. Mid-stream transient failures surface as a `ChunkInterrupted` subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for HTTP 5xx. Both carry the partial result plus a resumable call handle on `exc.call`: import time from dataretrieval.waterdata import get_daily from dataretrieval.waterdata.chunking import ChunkInterrupted try: df, md = get_daily(monitoring_location_id=long_list) except ChunkInterrupted as exc: time.sleep(exc.retry_after or 5 * 60) # Re-issues only the still-pending sub-requests; banked work # is preserved on `exc.call`. df, md = exc.call.resume() `ChunkedCall.resume` opens one `requests.Session` for the entire fan-out and publishes it via a `ContextVar` so paginated-loop helpers downstream (`_walk_pages`, `get_stats_data` via the new `_paginate` helper) reuse the same connection pool across every sub-request — saves one TCP/TLS handshake per sub-request after the first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk fan-out against the live USGS API (1.78s shared vs 3.03s per-sub-request). One behavior change for paginated/chunked calls: - `BaseMetadata.url` still reflects the user's original query (unchanged). - `BaseMetadata.header` now carries the *last* page/sub-request headers so downstream code that branches on `x-ratelimit-remaining` sees current state (was: first page's headers). - `BaseMetadata.query_time` is now cumulative wall-clock across every page/sub-request (was: first page's elapsed). - New module `dataretrieval.waterdata.chunking`: joint planner, exception hierarchy (`_RetryableTransportError`, `RateLimited`, `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`, `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`), `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator, shared-session ContextVar plumbing. - `dataretrieval.waterdata.utils`: paginated-loop body consolidated into a `_paginate` strategy helper that `_walk_pages` and `get_stats_data` both delegate to; typed transport exceptions moved out to `chunking` so the layer direction is strictly `utils → chunking` (no more lazy cross-module import). - `dataretrieval.waterdata.filters`: existing top-level-OR splitter and filter-chunkability detector kept as primitives the joint planner consumes. 80 new unit tests in `tests/waterdata_chunking_test.py` covering the planner, axis extraction, cartesian-product enumeration, rate-limit gating, resume idempotency and equivalence, transient- error classification, shared-session reuse, and a URL-construction stress test against the real `_construct_api_requests` builder (not a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting every sub-request URL stays under 8000 bytes and the joint planner beats the bail-floor worst case. Mid-pagination 429/5xx now also covered for both the OGC and stats paginators. Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870), generalized from one filter axis to N joint axes. Also fixes a handful of pre-existing docstring typos in `waterdata/api.py` (`meaining` → `meaning`, `instantanous` → `instantaneous`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 4a65fb1 commit 95bd5f0

9 files changed

Lines changed: 3302 additions & 580 deletions

File tree

NEWS.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
**05/17/2026:** The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) now transparently chunk requests whose URLs would otherwise exceed the server's ~8 KB byte limit. A common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — previously failed with HTTP 414 once the resulting URL grew past the limit; it now fans out across multiple sub-requests under the hood and returns one combined DataFrame. Every multi-value list parameter and the cql-text `filter` (split on its top-level `OR`s) is modeled as a chunkable axis; greedy halving splits the biggest chunk across all axes until each sub-request URL fits. After the first sub-request `ChunkedCall` reads `x-ratelimit-remaining`; if the rest of the plan won't fit the window it raises `RequestExceedsQuota` reporting the deficit. Mid-call transient failures (429 or 5xx) surface as a `ChunkInterrupted` subclass — `QuotaExhausted` for 429, `ServiceInterrupted` for 5xx — carrying the partial result plus a resumable call handle (`exc.call`); call `exc.call.resume()` to continue only the still-pending sub-requests once the underlying condition clears. Mirrors R `dataRetrieval`'s [#870](https://github.com/DOI-USGS/dataRetrieval/pull/870), generalized to N axes. Note one metadata-behavior change for paginated/chunked calls: `BaseMetadata.url` still reflects the user's original query (unchanged), but `BaseMetadata.header` now carries the *last* page/sub-request headers (so `x-ratelimit-remaining` is current) rather than the first, and `BaseMetadata.query_time` is now the cumulative wall-clock across pages instead of the first page's elapsed.
2+
13
**05/16/2026:** Fixed silent truncation in the paginated `waterdata` request loops (`_walk_pages` and `get_stats_data`). Mid-pagination failures (HTTP 429, 5xx, network error) were previously swallowed — pagination would quietly stop and the function would return whatever rows it had collected, leaving callers with truncated DataFrames they had no way to detect. The loops now status-check every page like the initial request and raise `RuntimeError` on any failure, with the upstream exception chained as `__cause__` and a short menu of recovery actions (wait and retry, reduce the request, or obtain an API token) in the message. **Behavior change**: callers that previously consumed partial DataFrames on transient upstream blips will now see an exception; retry the call (possibly with a smaller `limit` or narrower query).
24

35
**05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.
@@ -36,4 +38,4 @@
3638

3739
**03/01/2024:** USGS data availability and format have changed on Water Quality Portal (WQP). Since March 2024, data obtained from WQP legacy profiles will not include new USGS data or recent updates to existing data. All USGS data (up to and beyond March 2024) are available using the new WQP beta services. You can access the beta services by setting `legacy=False` in the functions in the `wqp` module.
3840

39-
To view the status of changes in data availability and code functionality, visit: https://doi-usgs.github.io/dataRetrieval/articles/Status.html
41+
To view the status of changes in data availability and code functionality, visit: https://doi-usgs.github.io/dataRetrieval/articles/Status.html

dataretrieval/waterdata/api.py

Lines changed: 21 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ def get_daily(
113113
data are released on the condition that neither the USGS nor the United
114114
States Government may be held liable for any damages resulting from its
115115
use. This field reflects the approval status of each record, and is either
116-
"Approved", meaining processing review has been completed and the data is
116+
"Approved", meaning processing review has been completed and the data is
117117
approved for publication, or "Provisional" and subject to revision. For
118118
more information about provisional data, go to:
119119
https://waterdata.usgs.gov/provisional-data-statement/.
@@ -230,6 +230,21 @@ def get_daily(
230230
... parameter_code="00060",
231231
... last_modified="P7D",
232232
... )
233+
234+
>>> # Chain queries: pull all stream sites in a state, then their
235+
>>> # daily discharge for the last week. The site list can be hundreds
236+
>>> # of values long — the request is transparently chunked across
237+
>>> # multiple sub-requests so the URL stays under the server's byte
238+
>>> # limit. Combined output looks like a single query.
239+
>>> sites_df, _ = dataretrieval.waterdata.get_monitoring_locations(
240+
... state_name="Ohio",
241+
... site_type="Stream",
242+
... )
243+
>>> df, md = dataretrieval.waterdata.get_daily(
244+
... monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
245+
... parameter_code="00060",
246+
... time="P7D",
247+
... )
233248
"""
234249
service = "daily"
235250
output_id = "daily_id"
@@ -259,7 +274,7 @@ def get_continuous(
259274
convert_type: bool = True,
260275
) -> tuple[pd.DataFrame, BaseMetadata]:
261276
"""
262-
Continuous data provide instantanous water conditions.
277+
Continuous data provide instantaneous water conditions.
263278
264279
This is an early version of the continuous endpoint that is feature-complete
265280
and is being made available for limited use. Geometries are not included
@@ -320,7 +335,7 @@ def get_continuous(
320335
data are released on the condition that neither the USGS nor the United
321336
States Government may be held liable for any damages resulting from its
322337
use. This field reflects the approval status of each record, and is either
323-
"Approved", meaining processing review has been completed and the data is
338+
"Approved", meaning processing review has been completed and the data is
324339
approved for publication, or "Provisional" and subject to revision. For
325340
more information about provisional data, go to:
326341
https://waterdata.usgs.gov/provisional-data-statement/.
@@ -1254,7 +1269,7 @@ def get_latest_continuous(
12541269
data are released on the condition that neither the USGS nor the United
12551270
States Government may be held liable for any damages resulting from its
12561271
use. This field reflects the approval status of each record, and is either
1257-
"Approved", meaining processing review has been completed and the data is
1272+
"Approved", meaning processing review has been completed and the data is
12581273
approved for publication, or "Provisional" and subject to revision. For
12591274
more information about provisional data, go to:
12601275
https://waterdata.usgs.gov/provisional-data-statement/.
@@ -1451,7 +1466,7 @@ def get_latest_daily(
14511466
data are released on the condition that neither the USGS nor the United
14521467
States Government may be held liable for any damages resulting from its
14531468
use. This field reflects the approval status of each record, and is either
1454-
"Approved", meaining processing review has been completed and the data is
1469+
"Approved", meaning processing review has been completed and the data is
14551470
approved for publication, or "Provisional" and subject to revision. For
14561471
more information about provisional data, go to:
14571472
https://waterdata.usgs.gov/provisional-data-statement/.
@@ -1633,7 +1648,7 @@ def get_field_measurements(
16331648
data are released on the condition that neither the USGS nor the United
16341649
States Government may be held liable for any damages resulting from its
16351650
use. This field reflects the approval status of each record, and is either
1636-
"Approved", meaining processing review has been completed and the data is
1651+
"Approved", meaning processing review has been completed and the data is
16371652
approved for publication, or "Provisional" and subject to revision. For
16381653
more information about provisional data, go to:
16391654
https://waterdata.usgs.gov/provisional-data-statement/.

0 commit comments

Comments
 (0)