Skip to content

Commit 526a7e4

Browse files
thodson-usgsclaude
andcommitted
Tighten chunking docstrings for accuracy and clarity
Five docstring fixes, no behavior change: 1. ``_plan_chunks`` docstring described the first ``RequestTooLarge`` case as "filter (if any) is already at its smallest OR-clause" — same wrong-direction error already fixed on ``RequestTooLarge``. The planner probes at the inner chunker's BAIL floor (longest clause's URL contribution), not the smallest. Rewrite to "the smallest reducible plan" with the bail-floor clarification. 2. The ``RequestTooLarge`` raised inside the halving loop carried a matching wrong-direction phrase ("any chunkable filter reduced to one OR-clause"). Rewrite to "any chunkable filter at the inner chunker's bail-floor size" and broaden the user advice to include "shorten the filter". 3. ``_chunk_bytes`` docstring claimed it's "indirectly used as the URL contribution estimate" — that's not what the code does; the function is a comparator only. Trim the misleading sentence and keep the rationale for ``quote_plus`` over raw join length. 4. ``_worst_case_args`` docstring's "with the filter already reduced to its filter-chunker floor" was oblique. Rewrite to make the chain explicit: caller passes ``probe_args`` (already through ``_filter_aware_probe_args``), and this function uses each dim's largest chunk against that. 5. The ``_DEFAULT_MAX_CHUNKS`` module comment said it's "read lazily in the wrapper" — stale; ``_plan_chunks`` now also resolves lazily. Update to "both the decorator wrapper and ``_plan_chunks``". 209 waterdata tests pass; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 3b11ce5 commit 526a7e4

1 file changed

Lines changed: 25 additions & 19 deletions

File tree

dataretrieval/waterdata/chunking.py

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,8 @@
8484
# 1000 matches the default hourly quota and is a reasonable upper bound
8585
# for single-page sub-requests; tune lower if your queries paginate.
8686
# Override per-decorator via ``max_chunks=`` or by monkeypatching this
87-
# module attribute (read lazily in the wrapper).
87+
# module attribute — both the decorator wrapper and ``_plan_chunks``
88+
# read it lazily.
8889
_DEFAULT_MAX_CHUNKS = 1000
8990

9091
# When ``x-ratelimit-remaining`` drops below this between sub-requests,
@@ -203,14 +204,14 @@ def _chunk_bytes(chunk: list[Any]) -> int:
203204
"""URL-encoded byte length of ``chunk`` when comma-joined into a
204205
URL parameter value.
205206
206-
Used both as the planner's biggest-chunk comparator (in
207-
``_worst_case_args`` and the halving loop) and indirectly as the
208-
URL contribution estimate. Sizing with ``quote_plus`` rather than
209-
raw ``,``-join length avoids mis-ranking chunks whose values
210-
contain characters that expand under URL encoding (``%``, ``+``,
211-
``/``, ``&``, etc.) — for typical USGS multi-value workloads
212-
(alphanumeric IDs and codes) the two are equal, but the encoded
213-
form is always correct.
207+
Used as the planner's biggest-chunk comparator in
208+
``_worst_case_args`` and the halving loop. ``quote_plus`` (rather
209+
than raw ``,``-join length) keeps the comparator faithful to what
210+
the real URL builder produces, so values containing characters
211+
that expand under URL encoding (``%``, ``+``, ``/``, ``&``, …)
212+
can't be mis-ranked. For typical USGS multi-value workloads
213+
(alphanumeric IDs and codes) raw and encoded lengths are equal,
214+
but the encoded form is always correct.
214215
"""
215216
return len(quote_plus(",".join(map(str, chunk))))
216217

@@ -235,9 +236,12 @@ def _request_bytes(req: requests.PreparedRequest) -> int:
235236
def _worst_case_args(
236237
probe_args: dict[str, Any], plan: dict[str, list[list[Any]]]
237238
) -> dict[str, Any]:
238-
"""Args dict using the LARGEST chunk from each dim — represents the
239-
most byte-heavy sub-request the plan will issue, with the filter
240-
already reduced to its filter-chunker floor."""
239+
"""Args representing the worst-case sub-request the plan will issue:
240+
each dim's largest chunk (by URL-encoded bytes), against the
241+
``probe_args`` already returned by ``_filter_aware_probe_args``
242+
(so any chunkable filter is at the inner chunker's bail-floor
243+
size). Driving the URL probe with this args dict tells the
244+
planner whether the *biggest* sub-request it would emit fits."""
241245
out = dict(probe_args)
242246
for k, chunks in plan.items():
243247
out[k] = max(chunks, key=_chunk_bytes)
@@ -254,9 +258,9 @@ def _plan_chunks(
254258
255259
Returns ``None`` when no chunking is needed (request as-is fits or
256260
no chunkable lists). Raises ``RequestTooLarge`` when:
257-
- every multi-value param is already a singleton chunk AND the
258-
filter (if any) is already at its smallest OR-clause and the URL
259-
still exceeds ``url_limit`` (irreducible), or
261+
- the smallest reducible plan still exceeds ``url_limit`` (every
262+
multi-value param at a singleton chunk and any chunkable filter
263+
already at the inner chunker's bail-floor size), or
260264
- the cartesian-product plan exceeds ``max_chunks`` sub-requests
261265
(the hourly API budget); checked after each split so we bail
262266
promptly once the cap is unreachable.
@@ -291,10 +295,12 @@ def _plan_chunks(
291295
biggest = max(splittable, key=lambda t: _chunk_bytes(t[2]), default=None)
292296
if biggest is None:
293297
raise RequestTooLarge(
294-
f"Request exceeds {url_limit} bytes (URL + body) even "
295-
f"with every multi-value parameter at a singleton chunk "
296-
f"and any chunkable filter reduced to one OR-clause. "
297-
f"Reduce the number of values or split the call manually."
298+
f"Request exceeds {url_limit} bytes (URL + body) at the "
299+
f"smallest reducible plan: every multi-value parameter "
300+
f"at a singleton chunk and any chunkable filter at the "
301+
f"inner chunker's bail-floor size. Reduce the number "
302+
f"of values, shorten the filter, or split the call "
303+
f"manually."
298304
)
299305
dim, idx, chunk = biggest
300306
mid = len(chunk) // 2

0 commit comments

Comments
 (0)