Skip to content

Commit df25e0d

Browse files
thodson-usgsclaude
andcommitted
Address PR review: lazy max_chunks default, early-exit cap check, doc fixes
Five fixes from the PR review: - ``_plan_chunks`` checks ``total > max_chunks`` inside the halving loop now: each split only grows the cartesian product, so once the cap is crossed it can never come back under. Continuing to halve the URL just wastes work. - ``_plan_chunks``'s ``max_chunks`` default becomes ``int | None = None`` and resolves to ``_DEFAULT_MAX_CHUNKS`` at call time. The previous ``max_chunks: int = _DEFAULT_MAX_CHUNKS`` bound the constant at module-import time, defeating the documented monkeypatch path for direct callers (the wrapper already resolved lazily, but ``_plan_chunks`` direct calls saw the import-time value). - ``_chunk_bytes`` docstring no longer claims the URL-encoded comma overhead is "constant per chunk" — it scales with ``2 * (len - 1)``. The function still uses raw ``,`` length because the planner only needs a monotone comparator across dims, but the wording was wrong. - ``QuotaExhausted.partial_response`` docstring now says "last completed sub-request" to match the bug_001 fix in ``_combine_chunk_responses``. - Module-level docstring drops the chained-query example (duplicated from ``get_daily``'s docstring) and points readers there. No behavior change for existing callers. 209 waterdata tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 13d5e0c commit df25e0d

1 file changed

Lines changed: 36 additions & 37 deletions

File tree

dataretrieval/waterdata/chunking.py

Lines changed: 36 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,8 @@
44
(e.g. ``monitoring_location_id=USGS-A,USGS-B,...``). Long lists can blow
55
the server's ~8 KB URL byte limit. This module adds a decorator that
66
sits OUTSIDE ``filters.chunked`` and splits multi-value list params
7-
across multiple sub-requests so each URL fits.
8-
9-
Motivating use case: chained queries where one getter feeds the next:
10-
11-
>>> # All stream sites in Ohio, then their daily discharge.
12-
>>> # Without chunking the second call's URL would exceed the
13-
>>> # server's byte limit for any state with > ~500 stations.
14-
>>> sites_df, _ = waterdata.get_monitoring_locations(
15-
... state_name="Ohio",
16-
... site_type="Stream",
17-
... )
18-
>>> df, _ = waterdata.get_daily(
19-
... monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
20-
... parameter_code="00060",
21-
... time="P7D",
22-
... )
7+
across multiple sub-requests so each URL fits. See ``get_daily``'s
8+
docstring for an end-to-end chained-query example.
239
2410
Design (orthogonal to filter chunking):
2511
@@ -138,9 +124,11 @@ class QuotaExhausted(RuntimeError):
138124
Concatenated, deduplicated result of every sub-request that
139125
completed before the floor was crossed.
140126
partial_response : requests.Response
141-
Aggregated response (URL/headers of the first sub-request,
142-
summed ``elapsed``). Wrap in ``BaseMetadata`` to surface to
143-
the caller alongside the partial frame.
127+
Aggregated response (URL/headers of the last completed sub-
128+
request, summed ``elapsed``). The last sub-request's headers
129+
are preferred so callers inspecting ``x-ratelimit-remaining``
130+
see current quota state. Wrap in ``BaseMetadata`` to surface
131+
to the caller alongside the partial frame.
144132
completed_chunks : int
145133
Number of sub-requests successfully completed.
146134
total_chunks : int
@@ -218,9 +206,11 @@ def _filter_aware_probe_args(args: dict[str, Any]) -> dict[str, Any]:
218206
def _chunk_bytes(chunk: list) -> int:
219207
"""Byte length of ``chunk`` when comma-joined into a URL param value.
220208
221-
This is the cost the planner uses to compare chunks across dims; the
222-
real URL builder also URL-encodes the comma, but the byte counts come
223-
out the same modulo a constant per-chunk overhead.
209+
The planner uses this only to compare chunks across dims, so the
210+
raw `",".join` length is sufficient: the real URL builder
211+
URL-encodes each comma to ``%2C`` (adding ``2 * (len(chunk) - 1)``
212+
bytes), but that overhead is monotone in chunk size and doesn't
213+
change the relative ordering this function exists to support.
224214
"""
225215
return len(",".join(map(str, chunk)))
226216

@@ -258,7 +248,7 @@ def _plan_chunks(
258248
args: dict[str, Any],
259249
build_request: Callable[..., Any],
260250
url_limit: int,
261-
max_chunks: int = _DEFAULT_MAX_CHUNKS,
251+
max_chunks: int | None = None,
262252
) -> dict[str, list[list]] | None:
263253
"""Greedy halving until the worst-case sub-request URL fits.
264254
@@ -267,9 +257,16 @@ def _plan_chunks(
267257
- every multi-value param is already a singleton chunk AND the
268258
filter (if any) is already at its smallest OR-clause and the URL
269259
still exceeds ``url_limit`` (irreducible), or
270-
- the converged cartesian-product plan would issue more than
271-
``max_chunks`` sub-requests (hourly API budget).
260+
- the cartesian-product plan exceeds ``max_chunks`` sub-requests
261+
(the hourly API budget); checked after each split so we bail
262+
promptly once the cap is unreachable.
263+
264+
``max_chunks`` defaults to ``_DEFAULT_MAX_CHUNKS`` resolved at call
265+
time, so monkeypatching the module constant takes effect for
266+
direct callers too.
272267
"""
268+
if max_chunks is None:
269+
max_chunks = _DEFAULT_MAX_CHUNKS
273270
chunkable = _chunkable_params(args)
274271
if not chunkable:
275272
return None
@@ -282,7 +279,7 @@ def _plan_chunks(
282279
while True:
283280
worst = _worst_case_args(probe_args, plan)
284281
if _request_bytes(build_request(**worst)) <= url_limit:
285-
break
282+
return plan
286283

287284
# Find the single biggest chunk across all dims and halve it.
288285
best: tuple[str, int, int] | None = None # (dim, chunk_index, size)
@@ -306,15 +303,18 @@ def _plan_chunks(
306303
mid = len(big) // 2
307304
plan[dim] = plan[dim][:idx] + [big[:mid], big[mid:]] + plan[dim][idx + 1 :]
308305

309-
total = math.prod(len(chunks) for chunks in plan.values())
310-
if total > max_chunks:
311-
raise RequestTooLarge(
312-
f"Chunked plan would issue {total} sub-requests, exceeding "
313-
f"max_chunks={max_chunks} (USGS API's default hourly rate "
314-
f"limit per key). Reduce input list sizes, narrow the time "
315-
f"window, or raise max_chunks if you have a higher quota."
316-
)
317-
return plan
306+
# Each split only grows the cartesian product, so once we
307+
# cross max_chunks we can never come back under. Bail now
308+
# rather than keep splitting (the URL probe could still take
309+
# many more iterations).
310+
total = math.prod(len(chunks) for chunks in plan.values())
311+
if total > max_chunks:
312+
raise RequestTooLarge(
313+
f"Chunked plan would issue {total} sub-requests, exceeding "
314+
f"max_chunks={max_chunks} (USGS API's default hourly rate "
315+
f"limit per key). Reduce input list sizes, narrow the time "
316+
f"window, or raise max_chunks if you have a higher quota."
317+
)
318318

319319

320320
_FetchOnce = TypeVar(
@@ -376,13 +376,12 @@ def wrapper(
376376
if url_limit is not None
377377
else filters._WATERDATA_URL_BYTE_LIMIT
378378
)
379-
cap = max_chunks if max_chunks is not None else _DEFAULT_MAX_CHUNKS
380379
floor = (
381380
quota_safety_floor
382381
if quota_safety_floor is not None
383382
else _DEFAULT_QUOTA_SAFETY_FLOOR
384383
)
385-
plan = _plan_chunks(args, build_request, limit, cap)
384+
plan = _plan_chunks(args, build_request, limit, max_chunks)
386385
if plan is None:
387386
return fetch_once(args)
388387

0 commit comments

Comments
 (0)