Skip to content

Commit 08c36fe

Browse files
thodson-usgsclaude
andcommitted
Budget against max per-clause encoding ratio
The whole-filter ratio is an average; a chunk that happens to contain only the heavier-encoding clauses (e.g. heavy clauses clustered at one end of the filter) can exceed the average ratio and push the full URL a few bytes past _WATERDATA_URL_BYTE_LIMIT. The overflow was invisible in practice — the 8,000 declared budget vs 8,200 observed 414 cliff gave enough headroom — but the computed budget was technically being violated, and a more adversarial clause mix could grow the overflow. Compute the encoding ratio from the heaviest-encoding clause instead of the whole filter. Adds one extra chunk on adversarial inputs (8 instead of 7 for 100 heavy + 400 light) in exchange for every chunk provably staying under the declared URL limit. Verified live: the adversarial clustered-heavy filter now produces 8 chunks with max URL 7806 bytes, all returning 200 OK. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 6868247 commit 08c36fe

2 files changed

Lines changed: 46 additions & 4 deletions

File tree

dataretrieval/waterdata/utils.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -350,14 +350,17 @@ def _effective_filter_budget(args: dict[str, Any], filter_expr: str) -> int:
350350
the request with a 1-byte placeholder filter.
351351
2. Subtract from the URL limit to get the bytes available for the
352352
encoded filter value.
353-
3. Convert back to raw CQL bytes using the filter's own URL-encoding
354-
ratio (e.g. uniform time-interval clauses inflate ~1.4x; heavy
355-
special-char clauses can inflate more).
353+
3. Convert back to raw CQL bytes using the *maximum* per-clause
354+
encoding ratio, not the whole-filter average. A chunk can end up
355+
containing only the heavier-encoding clauses (e.g. heavy ones
356+
clustered at one end of the filter), so budgeting against the
357+
average lets such a chunk overflow the URL limit by a few bytes.
356358
"""
357359
probe = _construct_api_requests(**{**args, "filter": "x"})
358360
non_filter_url_bytes = len(probe.url) - 1
359361
available_url_bytes = _WATERDATA_URL_BYTE_LIMIT - non_filter_url_bytes
360-
encoding_ratio = len(quote_plus(filter_expr)) / len(filter_expr)
362+
parts = _split_top_level_or(filter_expr) or [filter_expr]
363+
encoding_ratio = max(len(quote_plus(p)) / len(p) for p in parts if p)
361364
return max(100, int(available_url_bytes / encoding_ratio))
362365

363366

tests/waterdata_utils_test.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -337,6 +337,45 @@ def test_effective_filter_budget_respects_url_limit():
337337
assert len(quote_plus(padded)) <= _WATERDATA_URL_BYTE_LIMIT
338338

339339

340+
def test_effective_filter_budget_uses_max_clause_ratio():
341+
"""Heavy clauses clustered in one part of the filter must not be able
342+
to push any chunk over the URL limit. The budget is computed against
343+
the max per-clause encoding ratio, not the whole-filter average, so
344+
a chunk of only-heaviest-clauses still fits."""
345+
from urllib.parse import quote_plus
346+
347+
heavy = (
348+
"(time >= '2023-01-15T00:00:00Z' AND time <= '2023-01-15T00:30:00Z' "
349+
"AND approval_status IN ('Approved','Provisional','Revised'))"
350+
)
351+
light = "(time >= '2023-01-15T00:00:00Z' AND time <= '2023-01-15T00:30:00Z')"
352+
# Heavy ratio < light ratio for these shapes; cluster them at opposite
353+
# ends so the chunker must produce at least one light-only chunk.
354+
clauses = [heavy] * 100 + [light] * 400
355+
expr = " OR ".join(clauses)
356+
args = {
357+
"service": "continuous",
358+
"monitoring_location_id": "USGS-02238500",
359+
"filter": expr,
360+
"filter_lang": "cql-text",
361+
}
362+
budget = _effective_filter_budget(args, expr)
363+
chunks = _chunk_cql_or(expr, max_len=budget)
364+
assert len(chunks) > 1
365+
366+
# Every chunk, once built into a full request, fits under the URL byte
367+
# limit — even the all-light chunks that have a higher-than-average ratio.
368+
for chunk in chunks:
369+
req = _construct_api_requests(**{**args, "filter": chunk})
370+
assert len(req.url) <= _WATERDATA_URL_BYTE_LIMIT, (
371+
f"chunk url {len(req.url)} exceeds {_WATERDATA_URL_BYTE_LIMIT}"
372+
)
373+
374+
# Budget should be tight enough that a chunk of only-light clauses
375+
# (the heavier-encoding shape here) still fits.
376+
assert len(quote_plus(light)) * (budget // len(light)) < _WATERDATA_URL_BYTE_LIMIT
377+
378+
340379
def test_effective_filter_budget_shrinks_with_more_url_params():
341380
"""Adding more scalar query params consumes URL bytes and should
342381
shrink the raw filter budget accordingly."""

0 commit comments

Comments
 (0)