Commit 95bd5f0
feat(waterdata): Auto-chunk OGC requests over the URL byte limit
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:
from dataretrieval.waterdata import get_daily, get_monitoring_locations
sites_df, _ = get_monitoring_locations(
state_name="Ohio",
site_type_code="ST",
skip_geometry=True,
)
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked into multiple sub-requests, one
# combined DataFrame returned.
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.
After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:
import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted
try:
df, md = get_daily(monitoring_location_id=long_list)
except ChunkInterrupted as exc:
time.sleep(exc.retry_after or 5 * 60)
# Re-issues only the still-pending sub-requests; banked work
# is preserved on `exc.call`.
df, md = exc.call.resume()
`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).
One behavior change for paginated/chunked calls:
- `BaseMetadata.url` still reflects the user's original query
(unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
headers so downstream code that branches on
`x-ratelimit-remaining` sees current state (was: first page's
headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
every page/sub-request (was: first page's elapsed).
- New module `dataretrieval.waterdata.chunking`: joint planner,
exception hierarchy (`_RetryableTransportError`, `RateLimited`,
`ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
`ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
`ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
into a `_paginate` strategy helper that `_walk_pages` and
`get_stats_data` both delegate to; typed transport exceptions
moved out to `chunking` so the layer direction is strictly
`utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
and filter-chunkability detector kept as primitives the joint
planner consumes.
80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.
Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>1 parent 4a65fb1 commit 95bd5f0
9 files changed
Lines changed: 3302 additions & 580 deletions
File tree
- dataretrieval/waterdata
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
1 | 3 | | |
2 | 4 | | |
3 | 5 | | |
| |||
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
39 | | - | |
| 41 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
116 | | - | |
| 116 | + | |
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
| |||
230 | 230 | | |
231 | 231 | | |
232 | 232 | | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
233 | 248 | | |
234 | 249 | | |
235 | 250 | | |
| |||
259 | 274 | | |
260 | 275 | | |
261 | 276 | | |
262 | | - | |
| 277 | + | |
263 | 278 | | |
264 | 279 | | |
265 | 280 | | |
| |||
320 | 335 | | |
321 | 336 | | |
322 | 337 | | |
323 | | - | |
| 338 | + | |
324 | 339 | | |
325 | 340 | | |
326 | 341 | | |
| |||
1254 | 1269 | | |
1255 | 1270 | | |
1256 | 1271 | | |
1257 | | - | |
| 1272 | + | |
1258 | 1273 | | |
1259 | 1274 | | |
1260 | 1275 | | |
| |||
1451 | 1466 | | |
1452 | 1467 | | |
1453 | 1468 | | |
1454 | | - | |
| 1469 | + | |
1455 | 1470 | | |
1456 | 1471 | | |
1457 | 1472 | | |
| |||
1633 | 1648 | | |
1634 | 1649 | | |
1635 | 1650 | | |
1636 | | - | |
| 1651 | + | |
1637 | 1652 | | |
1638 | 1653 | | |
1639 | 1654 | | |
| |||
0 commit comments