Skip to content

Commit bd8c09b

Browse files
thodson-usgsclaude
andcommitted
Add CQL filter passthrough to OGC waterdata getters
Every `get_*` function that targets an OGC collection (`continuous`, `daily`, `field_measurements`, `monitoring_locations`, `time_series_metadata`, `latest_continuous`, `latest_daily`, `channel`) now accepts `filter` and `filter_lang` kwargs that are forwarded as the OGC `filter` / `filter-lang` query parameters. This unlocks server-side expressions that aren't expressible via the other kwargs. The motivating use case is pulling one-shot windows of continuous data around many field-measurement timestamps in a single request via OR'd BETWEEN clauses, instead of N round-trips. Caveats documented in each docstring and NEWS.md: - The server currently accepts `cql-text` (default) and `cql-json`; `cql2-text` / `cql2-json` are not yet supported. - Long filters can exceed the URI length limit. A `UserWarning` is emitted above 5000 characters and the practical cap is around 75 OR-clauses before the server returns HTTP 414. Includes unit tests covering the filter / filter-lang URL construction for all OGC services and the long-filter warning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 224f4bc commit bd8c09b

4 files changed

Lines changed: 238 additions & 1 deletion

File tree

NEWS.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
**04/22/2026:** The OGC `waterdata` getters (`get_continuous`, `get_daily`, `get_field_measurements`, and others) now accept `filter` and `filter_lang` kwargs that are passed through to the service's CQL filter parameter. This enables advanced server-side filtering that isn't expressible via the other kwargs — most commonly, OR'ing multiple time ranges into a single request. The server currently accepts `cql-text` (default) and `cql-json`; long filters can exceed the URI length limit, so `dataretrieval` warns above 5000 characters and the practical cap is around 75 OR-clauses before the server returns HTTP 414.
2+
13
**12/04/2025:** The `get_continuous()` function was added to the `waterdata` module, which provides access to measurements collected via automated sensors at a high frequency (often 15 minute intervals) at a monitoring location. This is an early version of the continuous endpoint and should be used with caution as the API team improves its performance. In the future, we anticipate the addition of an endpoint(s) specifically for handling large data requests, so it may make sense for power users to hold off on heavy development using the new continuous endpoint.
24

35
**11/24/2025:** `dataretrieval` is pleased to offer a new module, `waterdata`, which gives users access USGS's modernized [Water Data APIs](https://api.waterdata.usgs.gov/). The Water Data API endpoints include daily values, instantaneous values, field measurements (modernized groundwater levels service), time series metadata, and discrete water quality data from the Samples database. Though there will be a period of overlap, the functions within `waterdata` will eventually replace the `nwis` module, which currently provides access to the legacy [NWIS Water Services](https://waterservices.usgs.gov/). More example workflows and functions coming soon. Check `help(waterdata)` for more information.

dataretrieval/waterdata/api.py

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,8 @@ def get_daily(
5151
time: str | list[str] | None = None,
5252
bbox: list[float] | None = None,
5353
limit: int | None = None,
54+
filter: str | None = None,
55+
filter_lang: str | None = None,
5456
convert_type: bool = True,
5557
) -> tuple[pd.DataFrame, BaseMetadata]:
5658
"""Daily data provide one data value to represent water conditions for the
@@ -177,6 +179,18 @@ def get_daily(
177179
allowable limit is 50000. It may be beneficial to set this number lower
178180
if your internet connection is spotty. The default (NA) will set the
179181
limit to the maximum allowable limit for the service.
182+
filter : string, optional
183+
A CQL text or JSON expression passed through to the OGC API
184+
``filter`` query parameter. Commonly used to OR several time
185+
ranges into a single request. At the time of writing the server
186+
accepts ``cql-text`` (default) and ``cql-json``; ``cql2-text`` /
187+
``cql2-json`` are not yet supported. Long filters can exceed the
188+
server's URI length limit: a warning is emitted above 5000
189+
characters, and the practical cap is ~75 OR-clauses before the
190+
server returns HTTP 414.
191+
filter_lang : string, optional
192+
Language of the ``filter`` expression, for example ``cql-text``
193+
(default) or ``cql-json``. Sent as ``filter-lang`` in the URL.
180194
convert_type : boolean, optional
181195
If True, converts columns to appropriate types.
182196
@@ -228,6 +242,8 @@ def get_continuous(
228242
last_modified: str | None = None,
229243
time: str | list[str] | None = None,
230244
limit: int | None = None,
245+
filter: str | None = None,
246+
filter_lang: str | None = None,
231247
convert_type: bool = True,
232248
) -> tuple[pd.DataFrame, BaseMetadata]:
233249
"""
@@ -348,6 +364,18 @@ def get_continuous(
348364
allowable limit is 10000. It may be beneficial to set this number lower
349365
if your internet connection is spotty. The default (NA) will set the
350366
limit to the maximum allowable limit for the service.
367+
filter : string, optional
368+
A CQL text or JSON expression passed through to the OGC API
369+
``filter`` query parameter. Commonly used to OR several time
370+
ranges into a single request. At the time of writing the server
371+
accepts ``cql-text`` (default) and ``cql-json``; ``cql2-text`` /
372+
``cql2-json`` are not yet supported. Long filters can exceed the
373+
server's URI length limit: a warning is emitted above 5000
374+
characters, and the practical cap is ~75 OR-clauses before the
375+
server returns HTTP 414.
376+
filter_lang : string, optional
377+
Language of the ``filter`` expression, for example ``cql-text``
378+
(default) or ``cql-json``. Sent as ``filter-lang`` in the URL.
351379
convert_type : boolean, optional
352380
If True, the function will convert the data to dates and qualifier to
353381
string vector
@@ -426,6 +454,8 @@ def get_monitoring_locations(
426454
time: str | list[str] | None = None,
427455
bbox: list[float] | None = None,
428456
limit: int | None = None,
457+
filter: str | None = None,
458+
filter_lang: str | None = None,
429459
convert_type: bool = True,
430460
) -> tuple[pd.DataFrame, BaseMetadata]:
431461
"""Location information is basic information about the monitoring location
@@ -635,6 +665,18 @@ def get_monitoring_locations(
635665
The returning object will be a data frame with no spatial information.
636666
Note that the USGS Water Data APIs use camelCase "skipGeometry" in
637667
CQL2 queries.
668+
filter : string, optional
669+
A CQL text or JSON expression passed through to the OGC API
670+
``filter`` query parameter. Commonly used to OR several time
671+
ranges into a single request. At the time of writing the server
672+
accepts ``cql-text`` (default) and ``cql-json``; ``cql2-text`` /
673+
``cql2-json`` are not yet supported. Long filters can exceed the
674+
server's URI length limit: a warning is emitted above 5000
675+
characters, and the practical cap is ~75 OR-clauses before the
676+
server returns HTTP 414.
677+
filter_lang : string, optional
678+
Language of the ``filter`` expression, for example ``cql-text``
679+
(default) or ``cql-json``. Sent as ``filter-lang`` in the URL.
638680
convert_type : boolean, optional
639681
If True, converts columns to appropriate types.
640682
@@ -697,6 +739,8 @@ def get_time_series_metadata(
697739
time: str | list[str] | None = None,
698740
bbox: list[float] | None = None,
699741
limit: int | None = None,
742+
filter: str | None = None,
743+
filter_lang: str | None = None,
700744
convert_type: bool = True,
701745
) -> tuple[pd.DataFrame, BaseMetadata]:
702746
"""Daily data and continuous measurements are grouped into time series,
@@ -851,6 +895,18 @@ def get_time_series_metadata(
851895
allowable limit is 50000. It may be beneficial to set this number lower
852896
if your internet connection is spotty. The default (None) will set the
853897
limit to the maximum allowable limit for the service.
898+
filter : string, optional
899+
A CQL text or JSON expression passed through to the OGC API
900+
``filter`` query parameter. Commonly used to OR several time
901+
ranges into a single request. At the time of writing the server
902+
accepts ``cql-text`` (default) and ``cql-json``; ``cql2-text`` /
903+
``cql2-json`` are not yet supported. Long filters can exceed the
904+
server's URI length limit: a warning is emitted above 5000
905+
characters, and the practical cap is ~75 OR-clauses before the
906+
server returns HTTP 414.
907+
filter_lang : string, optional
908+
Language of the ``filter`` expression, for example ``cql-text``
909+
(default) or ``cql-json``. Sent as ``filter-lang`` in the URL.
854910
convert_type : boolean, optional
855911
If True, converts columns to appropriate types.
856912
@@ -903,6 +959,8 @@ def get_latest_continuous(
903959
time: str | list[str] | None = None,
904960
bbox: list[float] | None = None,
905961
limit: int | None = None,
962+
filter: str | None = None,
963+
filter_lang: str | None = None,
906964
convert_type: bool = True,
907965
) -> tuple[pd.DataFrame, BaseMetadata]:
908966
"""This endpoint provides the most recent observation for each time series
@@ -1026,6 +1084,18 @@ def get_latest_continuous(
10261084
allowable limit is 50000. It may be beneficial to set this number lower
10271085
if your internet connection is spotty. The default (None) will set the
10281086
limit to the maximum allowable limit for the service.
1087+
filter : string, optional
1088+
A CQL text or JSON expression passed through to the OGC API
1089+
``filter`` query parameter. Commonly used to OR several time
1090+
ranges into a single request. At the time of writing the server
1091+
accepts ``cql-text`` (default) and ``cql-json``; ``cql2-text`` /
1092+
``cql2-json`` are not yet supported. Long filters can exceed the
1093+
server's URI length limit: a warning is emitted above 5000
1094+
characters, and the practical cap is ~75 OR-clauses before the
1095+
server returns HTTP 414.
1096+
filter_lang : string, optional
1097+
Language of the ``filter`` expression, for example ``cql-text``
1098+
(default) or ``cql-json``. Sent as ``filter-lang`` in the URL.
10291099
convert_type : boolean, optional
10301100
If True, converts columns to appropriate types.
10311101
@@ -1075,6 +1145,8 @@ def get_latest_daily(
10751145
time: str | list[str] | None = None,
10761146
bbox: list[float] | None = None,
10771147
limit: int | None = None,
1148+
filter: str | None = None,
1149+
filter_lang: str | None = None,
10781150
convert_type: bool = True,
10791151
) -> tuple[pd.DataFrame, BaseMetadata]:
10801152
"""Daily data provide one data value to represent water conditions for the
@@ -1200,6 +1272,18 @@ def get_latest_daily(
12001272
allowable limit is 50000. It may be beneficial to set this number lower
12011273
if your internet connection is spotty. The default (None) will set the
12021274
limit to the maximum allowable limit for the service.
1275+
filter : string, optional
1276+
A CQL text or JSON expression passed through to the OGC API
1277+
``filter`` query parameter. Commonly used to OR several time
1278+
ranges into a single request. At the time of writing the server
1279+
accepts ``cql-text`` (default) and ``cql-json``; ``cql2-text`` /
1280+
``cql2-json`` are not yet supported. Long filters can exceed the
1281+
server's URI length limit: a warning is emitted above 5000
1282+
characters, and the practical cap is ~75 OR-clauses before the
1283+
server returns HTTP 414.
1284+
filter_lang : string, optional
1285+
Language of the ``filter`` expression, for example ``cql-text``
1286+
(default) or ``cql-json``. Sent as ``filter-lang`` in the URL.
12031287
convert_type : boolean, optional
12041288
If True, converts columns to appropriate types.
12051289
@@ -1251,6 +1335,8 @@ def get_field_measurements(
12511335
time: str | list[str] | None = None,
12521336
bbox: list[float] | None = None,
12531337
limit: int | None = None,
1338+
filter: str | None = None,
1339+
filter_lang: str | None = None,
12541340
convert_type: bool = True,
12551341
) -> tuple[pd.DataFrame, BaseMetadata]:
12561342
"""Field measurements are physically measured values collected during a
@@ -1366,6 +1452,18 @@ def get_field_measurements(
13661452
allowable limit is 50000. It may be beneficial to set this number lower
13671453
if your internet connection is spotty. The default (None) will set the
13681454
limit to the maximum allowable limit for the service.
1455+
filter : string, optional
1456+
A CQL text or JSON expression passed through to the OGC API
1457+
``filter`` query parameter. Commonly used to OR several time
1458+
ranges into a single request. At the time of writing the server
1459+
accepts ``cql-text`` (default) and ``cql-json``; ``cql2-text`` /
1460+
``cql2-json`` are not yet supported. Long filters can exceed the
1461+
server's URI length limit: a warning is emitted above 5000
1462+
characters, and the practical cap is ~75 OR-clauses before the
1463+
server returns HTTP 414.
1464+
filter_lang : string, optional
1465+
Language of the ``filter`` expression, for example ``cql-text``
1466+
(default) or ``cql-json``. Sent as ``filter-lang`` in the URL.
13691467
convert_type : boolean, optional
13701468
If True, converts columns to appropriate types.
13711469
@@ -2017,6 +2115,8 @@ def get_channel(
20172115
skip_geometry: bool | None = None,
20182116
bbox: list[float] | None = None,
20192117
limit: int | None = None,
2118+
filter: str | None = None,
2119+
filter_lang: str | None = None,
20202120
convert_type: bool = True,
20212121
) -> tuple[pd.DataFrame, BaseMetadata]:
20222122
"""
@@ -2123,6 +2223,18 @@ def get_channel(
21232223
vertical_velocity_description, longitudinal_velocity_description,
21242224
measurement_type, last_modified, channel_measurement_type. The default (NA) will
21252225
return all columns of the data.
2226+
filter : string, optional
2227+
A CQL text or JSON expression passed through to the OGC API
2228+
``filter`` query parameter. Commonly used to OR several time
2229+
ranges into a single request. At the time of writing the server
2230+
accepts ``cql-text`` (default) and ``cql-json``; ``cql2-text`` /
2231+
``cql2-json`` are not yet supported. Long filters can exceed the
2232+
server's URI length limit: a warning is emitted above 5000
2233+
characters, and the practical cap is ~75 OR-clauses before the
2234+
server returns HTTP 414.
2235+
filter_lang : string, optional
2236+
Language of the ``filter`` expression, for example ``cql-text``
2237+
(default) or ``cql-json``. Sent as ``filter-lang`` in the URL.
21262238
convert_type : boolean, optional
21272239
If True, the function will convert the data to dates and qualifier to
21282240
string vector

dataretrieval/waterdata/utils.py

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
import logging
55
import os
66
import re
7+
import warnings
78
from datetime import datetime
89
from typing import Any, get_args
910

@@ -419,6 +420,24 @@ def _construct_api_requests(
419420
if properties:
420421
params["properties"] = ",".join(properties)
421422

423+
# Translate CQL filter Python names to the hyphenated URL parameter that
424+
# the OGC API expects. The Python kwarg is `filter_lang` because hyphens
425+
# aren't valid in Python identifiers.
426+
if "filter_lang" in params:
427+
params["filter-lang"] = params.pop("filter_lang")
428+
# Emit a warning when a long CQL filter is at risk of exceeding the
429+
# server's URI length limit (HTTP 414). Empirically, the waterdata
430+
# continuous endpoint begins returning 414 around ~7 KB of filter text
431+
# (~75 OR-clauses of typical interval form). The threshold here is
432+
# conservative.
433+
if isinstance(params.get("filter"), str) and len(params["filter"]) > 5000:
434+
warnings.warn(
435+
"CQL `filter` is longer than 5000 characters; the server may "
436+
"return HTTP 414 (URI Too Long). Consider splitting into batched "
437+
"requests.",
438+
stacklevel=2,
439+
)
440+
422441
headers = _default_headers()
423442

424443
if POST:

tests/waterdata_utils_test.py

Lines changed: 105 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,19 @@
1+
import warnings
12
from unittest import mock
3+
from urllib.parse import parse_qs, urlsplit
24

5+
import pytest
36
import requests
47

5-
from dataretrieval.waterdata.utils import _get_args, _walk_pages
8+
from dataretrieval.waterdata.utils import (
9+
_construct_api_requests,
10+
_get_args,
11+
_walk_pages,
12+
)
13+
14+
15+
def _query_params(prepared_request):
16+
return parse_qs(urlsplit(prepared_request.url).query)
617

718

819
def test_get_args_basic():
@@ -77,3 +88,96 @@ def test_walk_pages_multiple_mocked():
7788
assert mock_client.send.called
7889
assert mock_client.request.called
7990
assert mock_client.request.call_args[0][1] == "https://example.com/page2"
91+
92+
93+
def test_construct_filter_passthrough():
94+
"""`filter` is forwarded verbatim as a query parameter."""
95+
expr = (
96+
"(time >= '2023-01-06T16:00:00Z' AND time <= '2023-01-06T18:00:00Z') "
97+
"OR (time >= '2023-01-10T18:00:00Z' AND time <= '2023-01-10T20:00:00Z')"
98+
)
99+
req = _construct_api_requests(
100+
service="continuous",
101+
monitoring_location_id="USGS-07374525",
102+
parameter_code="72255",
103+
filter=expr,
104+
)
105+
qs = _query_params(req)
106+
assert qs["filter"] == [expr]
107+
108+
109+
def test_construct_filter_lang_hyphenated():
110+
"""The Python kwarg `filter_lang` is sent as URL key `filter-lang`."""
111+
req = _construct_api_requests(
112+
service="continuous",
113+
monitoring_location_id="USGS-07374525",
114+
parameter_code="72255",
115+
filter="time >= '2023-01-01T00:00:00Z'",
116+
filter_lang="cql-text",
117+
)
118+
qs = _query_params(req)
119+
assert qs["filter-lang"] == ["cql-text"]
120+
# The underscore form must NOT appear in the URL
121+
assert "filter_lang" not in qs
122+
123+
124+
def test_construct_long_filter_emits_warning():
125+
"""CQL filter strings longer than 5000 characters warn about URI limits."""
126+
long_clause = "(time >= '2023-01-01T00:00:00Z' AND time <= '2023-01-01T00:30:00Z')"
127+
# Build a filter string that comfortably exceeds the threshold
128+
big = " OR ".join([long_clause] * 100)
129+
assert len(big) > 5000
130+
with warnings.catch_warnings(record=True) as caught:
131+
warnings.simplefilter("always")
132+
_construct_api_requests(
133+
service="continuous",
134+
monitoring_location_id="USGS-07374525",
135+
filter=big,
136+
)
137+
matching = [
138+
w
139+
for w in caught
140+
if issubclass(w.category, UserWarning) and "414" in str(w.message)
141+
]
142+
seen = [str(w.message) for w in caught]
143+
assert matching, f"expected a URI-length warning, got: {seen}"
144+
145+
146+
def test_construct_short_filter_does_not_warn():
147+
with warnings.catch_warnings(record=True) as caught:
148+
warnings.simplefilter("always")
149+
_construct_api_requests(
150+
service="continuous",
151+
monitoring_location_id="USGS-07374525",
152+
filter="time >= '2023-01-01T00:00:00Z'",
153+
)
154+
assert not [
155+
w
156+
for w in caught
157+
if issubclass(w.category, UserWarning) and "414" in str(w.message)
158+
]
159+
160+
161+
@pytest.mark.parametrize(
162+
"service",
163+
[
164+
"daily",
165+
"continuous",
166+
"monitoring-locations",
167+
"time-series-metadata",
168+
"latest-continuous",
169+
"latest-daily",
170+
"field-measurements",
171+
"channel-measurements",
172+
],
173+
)
174+
def test_construct_filter_on_all_ogc_services(service):
175+
"""Filter passthrough works uniformly for every OGC collection endpoint."""
176+
req = _construct_api_requests(
177+
service=service,
178+
filter="value > 0",
179+
filter_lang="cql-text",
180+
)
181+
qs = _query_params(req)
182+
assert qs["filter"] == ["value > 0"]
183+
assert qs["filter-lang"] == ["cql-text"]

0 commit comments

Comments
 (0)