Skip to content

Commit c7ef181

Browse files
thodson-usgsclaude
andcommitted
Probe with longest OR-clause, not shortest
The filter chunker (`filters.chunked`) splits a filter into chunks each ≤ the per-sub-request budget, but bails (returns the full filter unchanged) when ANY single OR-clause exceeds the budget. So the smallest filter size the inner chunker can guarantee to emit per sub-request is bounded below by the LARGEST single clause, not the smallest. The original implementation used `min(parts)` to model the chunker's output floor. For filters with uniform clause sizes (all my prior tests), min == max and the bug was hidden. For filters with lopsided clauses — e.g. `id='1' OR id='abcdef…long-string'` — using `min` would let the planner falsely declare a plan feasible. The inner chunker would then bail on the large clause, the real per-sub-request URL would carry the full filter, and the request would 414 server-side. Switch to `max(parts, key=len)`. If singleton+max-clause fits the URL limit, the inner chunker's budget is ≥ max(parts), so all clauses fit individually and chunking succeeds. If singleton+max-clause doesn't fit, the planner correctly raises `RequestTooLarge` instead of producing an unservable plan. Regression test: `test_plan_chunks_probes_with_max_clause_not_min`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 22a09c7 commit c7ef181

2 files changed

Lines changed: 47 additions & 18 deletions

File tree

dataretrieval/waterdata/chunking.py

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,14 @@
1919
determines output schema and chunking it would shard columns.
2020
2121
Coordination with ``filters.chunked``:
22-
The planner probes URL length using the SHORTEST top-level OR-clause
22+
The planner probes URL length using the LONGEST top-level OR-clause
2323
when a chunkable filter is present, not the full filter. ``filters.
24-
chunked`` (inner) will split the filter per sub-request, so probing
25-
with the smallest clause models the per-sub-request URL the stack will
26-
actually produce. Without this, a long OR-filter plus multi-value
27-
lists would trigger a premature ``RequestTooLarge`` even though the
28-
combined chunkers would have made things fit.
24+
chunked`` (inner) will split the filter per sub-request but bails if
25+
any single clause exceeds its budget, so the longest clause is the
26+
smallest filter size the stack is guaranteed to emit. Without this
27+
coordination, a long OR-filter plus multi-value lists would trigger a
28+
premature ``RequestTooLarge`` even though the combined chunkers would
29+
have made things fit.
2930
"""
3031

3132
from __future__ import annotations
@@ -155,23 +156,27 @@ def _chunkable_params(args: dict[str, Any]) -> dict[str, list]:
155156

156157

157158
def _filter_aware_probe_args(args: dict[str, Any]) -> dict[str, Any]:
158-
"""Substitute the filter with its shortest top-level OR-clause if the
159+
"""Substitute the filter with its LONGEST top-level OR-clause if the
159160
filter is chunkable, otherwise return ``args`` unchanged.
160161
161-
The inner ``filters.chunked`` decorator will reduce the filter per
162-
sub-request to at most one OR-clause (its hard floor — see
163-
``_chunk_cql_or``). Probing with that minimum models the per-sub-
164-
request URL the decorator stack will actually emit, so we don't
165-
plan around bytes the filter chunker has already promised to remove.
162+
The inner ``filters.chunked`` decorator splits a filter into chunks
163+
each ≤ the per-sub-request byte budget, but bails (returns the full
164+
filter unchanged) when ANY single OR-clause exceeds the budget. So
165+
the smallest filter the inner chunker is guaranteed to emit per
166+
sub-request is bounded below by the largest single clause — not the
167+
smallest. Probing with ``max(parts, key=len)`` models the worst
168+
achievable per-sub-request URL the decorator stack can produce; if
169+
that fits, we know the inner chunker won't bail and the actual URL
170+
will fit too.
166171
"""
167172
filter_expr = args.get("filter")
168173
filter_lang = args.get("filter_lang")
169174
if not _is_chunkable(filter_expr, filter_lang):
170175
return args
171176
parts = _split_top_level_or(filter_expr)
172177
if len(parts) < 2:
173-
return args # one-clause filter — filter chunker can't shrink it
174-
return {**args, "filter": min(parts, key=len)}
178+
return args # one-clause filter — inner chunker can't shrink it
179+
return {**args, "filter": max(parts, key=len)}
175180

176181

177182
def _worst_case_args(

tests/waterdata_test.py

Lines changed: 28 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -225,11 +225,14 @@ def test_filter_aware_probe_args_passes_through_when_not_chunkable():
225225
assert _filter_aware_probe_args(args) == args
226226

227227

228-
def test_filter_aware_probe_args_substitutes_shortest_or_clause():
229-
"""Chunkable filter → return args with filter replaced by shortest clause."""
228+
def test_filter_aware_probe_args_substitutes_longest_or_clause():
229+
"""Chunkable filter → return args with filter replaced by the LONGEST
230+
clause. The inner filter chunker bails if any single clause exceeds
231+
the per-sub-request budget, so the longest clause is the smallest
232+
filter size the chunker can guarantee to emit per sub-request."""
230233
args = {"filter": "a='1' OR a='22' OR a='333'", "x": 7}
231234
probe = _filter_aware_probe_args(args)
232-
assert probe["filter"] == "a='1'"
235+
assert probe["filter"] == "a='333'"
233236
assert probe["x"] == 7
234237
assert args["filter"] == "a='1' OR a='22' OR a='333'" # input not mutated
235238

@@ -288,7 +291,9 @@ def test_plan_chunks_coordinates_with_filter_chunker():
288291
"filter": " OR ".join(clauses),
289292
}
290293
# singleton+full-filter ≈ 200 + 10 + 86 = 296 (over limit 240) — would raise.
291-
# min-clause probe model ≈ 200 + 10 + 5 = 215 (under limit) — plan succeeds.
294+
# max-clause probe model ≈ 200 + 10 + 5 = 215 (under limit) — plan succeeds.
295+
# (Here all clauses are the same length, so min==max; the fix matters for
296+
# filters with lopsided clauses where min < max.)
292297
plan = _plan_chunks(args, _fake_build, url_limit=240)
293298
assert plan is not None # coordination prevented the premature raise
294299
assert len(plan["monitoring_location_id"]) > 1 # planner did split
@@ -306,6 +311,25 @@ def test_plan_chunks_coordinates_with_filter_chunker():
306311
ch._filter_aware_probe_args = saved
307312

308313

314+
def test_plan_chunks_probes_with_max_clause_not_min():
315+
"""Regression: with lopsided OR-clauses (one short, one long), probing
316+
with min(parts) lets the planner falsely declare a plan feasible that
317+
the inner filter chunker can't actually deliver — it bails when any
318+
single clause exceeds the per-sub-request budget. Probing with the
319+
longest clause is the safe lower bound on per-sub-request filter
320+
size, so the planner correctly raises when no plan can fit."""
321+
args = {
322+
"sites": ["A" * 10, "B" * 10],
323+
"filter": "x='1' OR x='" + "a" * 28 + "'", # 5-char and 33-char clauses
324+
}
325+
# base 200 + singleton sites 10 + min-clause 5 = 215 (limit 230 -> fits)
326+
# base 200 + singleton sites 10 + max-clause 33 = 243 (limit 230 -> exceeds)
327+
# With min: planner succeeds, but real URL with full filter (42) exceeds
328+
# 230 -> server 414. With max: planner raises early, as it should.
329+
with pytest.raises(RequestTooLarge):
330+
_plan_chunks(args, _fake_build, url_limit=230)
331+
332+
309333
def test_plan_chunks_still_raises_when_even_min_clause_doesnt_fit():
310334
"""If the limit is so tight that singleton + shortest-clause STILL
311335
exceeds it, filter chunker can't save us either — raise."""

0 commit comments

Comments
 (0)