Skip to content

Commit 7002420

Browse files
thodson-usgsclaude
andcommitted
Document filters.py scope; fix three edge-case pitfall gaps
Document the scope explicitly in the module docstring — what gets parsed vs what passes through to the server untouched — since a reviewer came away from the previous version wondering "are we only handling OR?" (answer: yes for chunking, three forms for pitfall detection, everything else passes through). While adding the docs I swept more CQL-text shapes and found three real gaps in the pitfall check: 1. Leading-dot decimals (``value > .5``, ``value > -.5``, ``value > .5e-3``) bypassed the numeric regex because it required ``\d+`` before the decimal point. Users writing fractions this way were getting server 500s instead of the client-side ValueError. Extend ``_NUM`` to also accept ``\.\d+``. 2. ``NOT IN`` / ``NOT BETWEEN`` misattributed the field. The regexes matched with ``field='NOT'`` and the error said ``against 'NOT' (``NOT IN (…)``)``, which is confusing. Add a ``(?!NOT\b)`` lookahead so ``NOT`` can't be captured as a field, and an optional ``(?P<negated>NOT\s+)?`` before the keyword so the error reports ``against 'value' (``value NOT IN (…)``)`` instead. 3. Consequence-free guard: ``NOTES = 'hello'`` (a real field name starting with "NOT") continues to pass — the ``(?!NOT\b)`` lookahead uses a word boundary so only the bare keyword is excluded. Tests: +18 cases on the raise/allow parametrize lists; +5 cases on a new ``test_pitfall_error_names_real_field_not_NOT_keyword`` that pins the attribution contract. 176 non-live tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 194c0b2 commit 7002420

2 files changed

Lines changed: 98 additions & 17 deletions

File tree

dataretrieval/waterdata/filters.py

Lines changed: 63 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,42 @@
77
``_NUMERIC_COMPARE_RE`` for why), and the ``chunked`` decorator that
88
``utils.py`` applies to its single-request fetch function.
99
10-
Isolation contract (rolling the feature back):
10+
Scope — what this module parses vs. what passes through
11+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
12+
13+
The CQL-text filter the caller supplies is **forwarded verbatim** to
14+
the server as a ``filter=`` query parameter. This module doesn't
15+
parse CQL semantics; it inspects the string for exactly two
16+
purposes:
17+
18+
1. **Chunking** (``_chunk_cql_or``) splits the filter on top-level
19+
``OR`` boundaries so the full URL can stay under the server's
20+
~8 KB limit. Only ``OR`` splits cleanly into independent
21+
sub-requests whose result sets can be union'd back together by
22+
``pd.concat`` + dedup. Everything else — ``AND`` chains,
23+
``NOT``, ``LIKE``, ``IS NULL``, spatial / temporal predicates,
24+
function calls — is sent as a single request. If such a filter
25+
is long enough to trip the URL limit the caller gets the
26+
server's ``414``; we don't try to rewrite it, because no rewrite
27+
preserves set semantics for those shapes.
28+
29+
2. **Pitfall detection** (``_check_numeric_filter_pitfall``) scans
30+
for three patterns where the user almost certainly means a
31+
numeric comparison but the server would do a lexicographic one
32+
(every queryable is string-typed):
33+
34+
- ``<field> <op> <unquoted_num>`` (and the reverse)
35+
- ``<field> [NOT] IN (<unquoted_num>, ...)``
36+
- ``<field> [NOT] BETWEEN <unquoted_num> AND <unquoted_num>``
37+
38+
``LIKE``, ``IS NULL``, function-call RHS (``COUNT(x) > 5``),
39+
``CAST`` expressions, and arithmetic (``value > 1 + 1`` —
40+
partially caught on ``value > 1``) are not flagged. The first
41+
three the server doesn't support anyway; the last is a minor
42+
offense-text imprecision rather than a correctness issue.
43+
44+
Isolation contract (rolling the feature back)
45+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1146
1247
- ``dataretrieval/waterdata/filters.py`` and
1348
``tests/waterdata_filters_test.py`` can be deleted wholesale.
@@ -22,9 +57,11 @@
2257
getters.
2358
- ``__init__.py``: drop the ``FILTER_LANG`` re-export.
2459
25-
Exports:
26-
- ``FILTER_LANG`` — type alias used by ``api.py`` and re-exported.
27-
- ``chunked`` — decorator used by ``utils.py`` on its fetch helper.
60+
Exports
61+
~~~~~~~
62+
63+
- ``FILTER_LANG`` — type alias used by ``api.py`` and re-exported.
64+
- ``chunked`` — decorator used by ``utils.py`` on its fetch helper.
2865
2966
Everything else in this module is private (leading underscore).
3067
"""
@@ -93,8 +130,12 @@
93130
# numeric>`` is a bug — quote the literal or drop the comparison and
94131
# filter in pandas.
95132

96-
# Unquoted numeric literal: integer, decimal, or scientific notation.
97-
_NUM = r"-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?"
133+
# Unquoted numeric literal: integer, decimal (with or without leading
134+
# zero), or scientific notation. ``\d+(?:\.\d+)?`` covers ``1``,
135+
# ``1.5``; ``\.\d+`` covers the leading-dot form ``.5`` that users
136+
# sometimes write as a fraction. Trailing-dot ``5.`` is deliberately
137+
# not accepted (not a common numeric spelling and not required CQL).
138+
_NUM = r"-?(?:\d+(?:\.\d+)?|\.\d+)(?:[eE][+-]?\d+)?"
98139
_IDENT = r"[A-Za-z_]\w*"
99140
_OP = r">=|<=|<>|!=|==|=|>|<"
100141

@@ -113,20 +154,23 @@
113154
re.VERBOSE,
114155
)
115156

116-
# ``<field> IN (<numeric>, ...)`` — same footgun as simple comparison
117-
# but using the list form. Caught separately because ``IN`` isn't one
118-
# of the comparison operators in ``_OP``. We only need to see one
119-
# unquoted numeric inside the parentheses to know the user intends
120-
# numeric membership.
157+
# ``<field> [NOT] IN (<numeric>, ...)`` — same footgun as a simple
158+
# comparison but using the list form. Caught separately because ``IN``
159+
# isn't one of the comparison operators in ``_OP``. The ``(?!NOT\b)``
160+
# lookahead keeps the bare keyword ``NOT`` from being captured as the
161+
# field when the caller writes ``value NOT IN (…)``; the optional
162+
# ``(?:NOT\s+)?`` after the field captures the negation so the error
163+
# message can report the offending form accurately.
121164
_IN_NUMERIC_RE = re.compile(
122-
rf"\b(?P<field>{_IDENT})\s+IN\s*\(\s*{_NUM}",
165+
rf"\b(?!NOT\b)(?P<field>{_IDENT})\s+(?P<negated>NOT\s+)?IN\s*\(\s*{_NUM}",
123166
re.IGNORECASE,
124167
)
125168

126-
# ``<field> BETWEEN <numeric> AND <numeric>`` — range form of the same
127-
# footgun.
169+
# ``<field> [NOT] BETWEEN <numeric> AND <numeric>`` — range form with
170+
# the same ``NOT`` handling as above.
128171
_BETWEEN_NUMERIC_RE = re.compile(
129-
rf"\b(?P<field>{_IDENT})\s+BETWEEN\s+{_NUM}\s+AND\s+{_NUM}\b",
172+
rf"\b(?!NOT\b)(?P<field>{_IDENT})\s+(?P<negated>NOT\s+)?"
173+
rf"BETWEEN\s+{_NUM}\s+AND\s+{_NUM}\b",
130174
re.IGNORECASE,
131175
)
132176

@@ -323,12 +367,14 @@ def _check_numeric_filter_pitfall(filter_expr: str) -> None:
323367
membership = _IN_NUMERIC_RE.search(masked)
324368
if membership:
325369
field = membership.group("field")
326-
_raise_pitfall(field, f"{field} IN (…)")
370+
op = "NOT IN" if membership.group("negated") else "IN"
371+
_raise_pitfall(field, f"{field} {op} (…)")
327372

328373
between = _BETWEEN_NUMERIC_RE.search(masked)
329374
if between:
330375
field = between.group("field")
331-
_raise_pitfall(field, f"{field} BETWEEN …")
376+
op = "NOT BETWEEN" if between.group("negated") else "BETWEEN"
377+
_raise_pitfall(field, f"{field} {op} …")
332378

333379

334380
def _raise_pitfall(field: str, offense: str) -> None:

tests/waterdata_filters_test.py

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -480,13 +480,23 @@ def fake_construct_api_requests(**kwargs):
480480
"value > 1e5",
481481
"value >= 2.5E+3",
482482
"value < 1.5e-3",
483+
# Leading-dot decimals (``.5`` is a fraction, not a typo)
484+
"value > .5",
485+
"value >= -.5",
486+
"value < .5e-3",
483487
# ``IN`` list form — same footgun, common pattern for codes
484488
"parameter_code IN (60, 61)",
485489
"value IN (10, 20, 30)",
486490
"statistic_id in (11)", # case-insensitive, single-element
491+
# ``NOT IN`` with numbers — same footgun via negation
492+
"value NOT IN (1, 2, 3)",
493+
"parameter_code not in (60, 61)",
487494
# ``BETWEEN`` range form — same footgun
488495
"value BETWEEN 5 AND 10",
489496
"channel_flow between 100 and 500",
497+
# ``NOT BETWEEN`` with numbers
498+
"value NOT BETWEEN 0 AND 100",
499+
"channel_flow not between 50 and 150",
490500
# Composite expressions
491501
"time >= '2023-01-01T00:00:00Z' AND value >= 1000",
492502
"value > 1000 OR value < 0",
@@ -529,6 +539,10 @@ def test_check_numeric_filter_pitfall_raises(expr):
529539
"parameter_code = '00060' AND statistic_id = '00011'",
530540
# CQL escape-quote (``O''Reilly``) within a quoted literal
531541
"name = 'O''Reilly 1000'",
542+
# Identifiers that start with "NOT" (e.g. ``NOTES``) must not be
543+
# mistakenly treated as the CQL negation keyword
544+
"NOTES = 'hello'",
545+
"NOTE_VAL LIKE 'anything%'",
532546
],
533547
)
534548
def test_check_numeric_filter_pitfall_allows(expr):
@@ -537,6 +551,27 @@ def test_check_numeric_filter_pitfall_allows(expr):
537551
_check_numeric_filter_pitfall(expr) # must not raise
538552

539553

554+
@pytest.mark.parametrize(
555+
"expr,field,op",
556+
[
557+
("value NOT IN (1, 2)", "value", "NOT IN"),
558+
("parameter_code NOT IN (60, 61)", "parameter_code", "NOT IN"),
559+
("value IN (1, 2)", "value", "IN"),
560+
("value NOT BETWEEN 0 AND 10", "value", "NOT BETWEEN"),
561+
("channel_flow between 100 and 500", "channel_flow", "BETWEEN"),
562+
],
563+
)
564+
def test_pitfall_error_names_real_field_not_NOT_keyword(expr, field, op):
565+
"""The CQL keyword ``NOT`` must not be reported as the offending field
566+
— the error should identify the actual column and include ``NOT`` as
567+
part of the operator form so the caller knows what to quote."""
568+
with pytest.raises(ValueError) as exc:
569+
_check_numeric_filter_pitfall(expr)
570+
msg = str(exc.value)
571+
assert f"against {field!r}" in msg, msg
572+
assert op.upper() in msg.upper(), msg
573+
574+
540575
def test_get_continuous_surfaces_pitfall_to_caller():
541576
"""End-to-end: the check runs at the ``get_continuous`` boundary,
542577
not as a deep internal-only protection, so callers see the error

0 commit comments

Comments
 (0)