Skip to content

Commit dffb168

Browse files
thodson-usgsclaude
andcommitted
feat(waterdata): add get_waterdata for generalized CQL2 queries
Python analogue of R ``dataRetrieval::read_waterdata``. The typed ``get_*`` wrappers (``get_daily``, ``get_continuous``, …) only support exact-equality predicates on whitelisted parameters. Some users need more — top-level ``or``, ``like`` with ``%`` wildcards, comparison operators, nested boolean trees — and today have no surface for it. ``get_waterdata(service, cql, ...)`` accepts a raw CQL2 query (``dict`` or pre-serialized JSON string) and POSTs it against any recognized collection, then walks pages and post-processes the result with the same pipeline the typed wrappers use. Reuses existing infrastructure: ``_walk_pages``, ``_deal_with_empty``, ``_arrange_cols``, ``_type_cols``, ``_sort_rows``, and ``_switch_properties_id``. The new pieces are: - ``_OUTPUT_ID_BY_SERVICE`` (utils.py) — a single mapping from service name to the renamed-``id`` column the rest of the package exposes, hoisted from the typed wrappers so the generalized entry point can pick the right one. - ``_construct_cql_request`` (utils.py) — focused POST/CQL2 request builder; distinct from ``_construct_api_requests`` because the body comes in verbatim rather than being derived from typed kwargs. - ``get_waterdata`` (api.py) — public entry point. CQL2 grammar reference: https://api.waterdata.usgs.gov/docs/ogcapi/complex-queries/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 36866a0 commit dffb168

3 files changed

Lines changed: 235 additions & 0 deletions

File tree

dataretrieval/waterdata/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
get_stats_date_range,
2929
get_stats_por,
3030
get_time_series_metadata,
31+
get_waterdata,
3132
)
3233
from .filters import FILTER_LANG
3334
from .nearest import get_nearest_continuous
@@ -64,4 +65,5 @@
6465
"get_stats_date_range",
6566
"get_stats_por",
6667
"get_time_series_metadata",
68+
"get_waterdata",
6769
]

dataretrieval/waterdata/api.py

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,20 @@
2626
SERVICES,
2727
)
2828
from dataretrieval.waterdata.utils import (
29+
_OUTPUT_ID_BY_SERVICE,
30+
GEOPANDAS,
2931
SAMPLES_URL,
32+
_arrange_cols,
3033
_check_profiles,
34+
_construct_cql_request,
35+
_deal_with_empty,
3136
_default_headers,
3237
_get_args,
38+
_normalize_str_iterable,
39+
_sort_rows,
40+
_switch_properties_id,
41+
_type_cols,
42+
_walk_pages,
3343
get_ogc_data,
3444
get_stats_data,
3545
)
@@ -2832,3 +2842,153 @@ def get_channel(
28322842
args = _get_args(locals())
28332843

28342844
return get_ogc_data(args, output_id, service)
2845+
2846+
2847+
def get_waterdata(
2848+
service: str,
2849+
cql: str | dict,
2850+
*,
2851+
properties: str | Iterable[str] | None = None,
2852+
bbox: list[float] | None = None,
2853+
limit: int | None = None,
2854+
skip_geometry: bool | None = None,
2855+
convert_type: bool = True,
2856+
client: requests.Session | None = None,
2857+
) -> tuple[pd.DataFrame, BaseMetadata]:
2858+
"""Generalized OGC API CQL2 query.
2859+
2860+
Python analogue of R ``dataRetrieval::read_waterdata``. Use this
2861+
when you need a predicate the typed wrappers (``get_daily``,
2862+
``get_continuous``, …) can't express — top-level ``or``, ``like``
2863+
with ``%`` wildcards, comparison operators, nested boolean trees,
2864+
geometry-based predicates beyond a bbox, and so on. The typed
2865+
wrappers are nicer when they cover the case; reach for this when
2866+
they don't.
2867+
2868+
The CQL2 grammar is documented at
2869+
https://api.waterdata.usgs.gov/docs/ogcapi/complex-queries/.
2870+
2871+
Parameters
2872+
----------
2873+
service : str
2874+
OGC collection name (e.g. ``"daily"``, ``"monitoring-locations"``).
2875+
See :data:`dataretrieval.waterdata.utils._OUTPUT_ID_BY_SERVICE`
2876+
for the recognized list.
2877+
cql : str or dict
2878+
CQL2 query. A ``dict`` is JSON-serialized for transport; a
2879+
``str`` is sent through unchanged. The query goes into the
2880+
HTTP POST body with ``Content-Type:
2881+
application/query-cql-json``.
2882+
properties : str or iterable of str, optional
2883+
Server-side property whitelist (passed as ``properties=`` on
2884+
the URL). Reduces payload size and bypasses the response-shape
2885+
post-processing for any column not listed. ``"id"`` resolves
2886+
to the service's ``output_id`` (e.g. ``daily_id``) the same way
2887+
it does in the typed wrappers.
2888+
bbox : list of float, optional
2889+
Bounding box ``[xmin, ymin, xmax, ymax]`` in CRS 4326.
2890+
limit : int, optional
2891+
Page size, clamped server-side to 50,000.
2892+
skip_geometry : bool, optional
2893+
If True, the server omits geometry from each feature
2894+
(``skipGeometry=true``).
2895+
convert_type : bool, default True
2896+
Coerce date/datetime/numeric columns to typed dtypes after the
2897+
DataFrame is built.
2898+
client : requests.Session, optional
2899+
Reuse an existing HTTP session (handy for batching or
2900+
injecting custom retry/auth adapters). A short-lived session
2901+
is created internally if not provided.
2902+
2903+
Returns
2904+
-------
2905+
df : pandas.DataFrame or geopandas.GeoDataFrame
2906+
Result of the query. GeoDataFrame when ``geopandas`` is
2907+
installed and geometry is present.
2908+
md : :class:`dataretrieval.utils.BaseMetadata`
2909+
Request metadata (URL, query time, response headers).
2910+
2911+
Examples
2912+
--------
2913+
.. code::
2914+
2915+
>>> # Daily values for two parameter codes at two sites
2916+
>>> # (compound AND-of-INs).
2917+
>>> from dataretrieval import waterdata
2918+
>>> cql = {
2919+
... "op": "and",
2920+
... "args": [
2921+
... {
2922+
... "op": "in",
2923+
... "args": [
2924+
... {"property": "parameter_code"},
2925+
... ["00060", "00065"],
2926+
... ],
2927+
... },
2928+
... {
2929+
... "op": "in",
2930+
... "args": [
2931+
... {"property": "monitoring_location_id"},
2932+
... ["USGS-07367300", "USGS-03277200"],
2933+
... ],
2934+
... },
2935+
... ],
2936+
... }
2937+
>>> df, md = waterdata.get_waterdata(
2938+
... service="daily",
2939+
... cql=cql,
2940+
... time=("2023-01-01", "2024-01-01"),
2941+
... )
2942+
2943+
>>> # Monitoring locations whose HUC starts with "02070010"
2944+
>>> # (LIKE with the CQL2 ``%`` wildcard).
2945+
>>> df, md = waterdata.get_waterdata(
2946+
... service="monitoring-locations",
2947+
... cql='{"op": "like", "args": ['
2948+
... '{"property": "hydrologic_unit_code"},'
2949+
... ' "02070010%"]}',
2950+
... )
2951+
"""
2952+
if service not in _OUTPUT_ID_BY_SERVICE:
2953+
raise ValueError(
2954+
f"Unknown service {service!r}. Valid services: "
2955+
f"{sorted(_OUTPUT_ID_BY_SERVICE)}."
2956+
)
2957+
output_id = _OUTPUT_ID_BY_SERVICE[service]
2958+
2959+
# ``dict`` is the pythonic input — serialize on the way out. ``str``
2960+
# is sent verbatim so callers who already have a CQL2 doc (e.g.
2961+
# imported from a config file) don't need to re-parse it.
2962+
body = json.dumps(cql, separators=(",", ":")) if isinstance(cql, dict) else cql
2963+
2964+
if isinstance(properties, str):
2965+
properties_list = [properties]
2966+
elif properties is None:
2967+
properties_list = None
2968+
else:
2969+
properties_list = _normalize_str_iterable(properties, "properties")
2970+
2971+
# Translate user-facing names (``daily_id``) to the wire-format
2972+
# ``id`` the OGC API expects, matching the typed wrappers.
2973+
wire_properties = _switch_properties_id(
2974+
properties_list, id_name=output_id, service=service
2975+
)
2976+
2977+
req = _construct_cql_request(
2978+
service=service,
2979+
cql_body=body,
2980+
properties=wire_properties or None,
2981+
bbox=bbox,
2982+
limit=limit,
2983+
skip_geometry=skip_geometry,
2984+
)
2985+
2986+
df, response = _walk_pages(geopd=GEOPANDAS, req=req, client=client)
2987+
2988+
df = _deal_with_empty(df, properties_list, service)
2989+
if convert_type:
2990+
df = _type_cols(df)
2991+
df = _arrange_cols(df, properties_list, output_id)
2992+
df = _sort_rows(df)
2993+
2994+
return df, BaseMetadata(response)

dataretrieval/waterdata/utils.py

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,26 @@ def _switch_properties_id(properties: list[str] | None, id_name: str, service: s
157157
# parameters and require POST with CQL2 JSON instead.
158158
_CQL2_REQUIRED_SERVICES = frozenset({"monitoring-locations"})
159159

160+
# Service name → the column the rest of the package exposes the
161+
# per-record OGC ``id`` under. Used by the generalized
162+
# :func:`get_waterdata` entry point, which doesn't have a per-service
163+
# wrapper to hard-code this in. Values here match what each typed
164+
# ``get_*`` function in :mod:`dataretrieval.waterdata.api` uses for
165+
# its own ``output_id``.
166+
_OUTPUT_ID_BY_SERVICE: dict[str, str] = {
167+
"daily": "daily_id",
168+
"continuous": "continuous_id",
169+
"latest-continuous": "latest_continuous_id",
170+
"latest-daily": "latest_daily_id",
171+
"field-measurements": "field_measurement_id",
172+
"field-measurements-metadata": "field_series_id",
173+
"monitoring-locations": "monitoring_location_id",
174+
"time-series-metadata": "time_series_id",
175+
"combined-metadata": "combined_meta_id",
176+
"peaks": "peak_id",
177+
"channel-measurements": "channel_measurements_id",
178+
}
179+
160180

161181
def _parse_datetime(value: str) -> datetime | None:
162182
"""Parse a single datetime string against the supported formats.
@@ -528,6 +548,59 @@ def _construct_api_requests(
528548
return request.prepare()
529549

530550

551+
def _construct_cql_request(
552+
service: str,
553+
cql_body: str,
554+
properties: list[str] | None = None,
555+
bbox: list[float] | None = None,
556+
limit: int | None = None,
557+
skip_geometry: bool | None = None,
558+
) -> requests.PreparedRequest:
559+
"""Build a POST/CQL2 request for the generalized ``get_waterdata`` path.
560+
561+
Distinct from :func:`_construct_api_requests` because that function
562+
derives the CQL2 body from typed kwargs; here the body is passed
563+
through verbatim so a caller can express predicates the typed
564+
wrappers can't (top-level ``or``, ``like`` with ``%`` wildcards,
565+
comparison operators, …).
566+
567+
Parameters
568+
----------
569+
service : str
570+
Collection name, e.g. ``"daily"``.
571+
cql_body : str
572+
Pre-serialized CQL2-JSON. Sent as the request body unchanged.
573+
properties : list of str, optional
574+
Server-side property whitelist. Joined with commas.
575+
bbox : list of float, optional
576+
Bounding box ``[xmin, ymin, xmax, ymax]``. Joined with commas.
577+
limit : int, optional
578+
Page size; clamped to the server max of 50,000.
579+
skip_geometry : bool, optional
580+
If True, sets ``skipGeometry=true`` so the server omits
581+
geometry from each feature.
582+
"""
583+
service_url = f"{OGC_API_URL}/collections/{service}/items"
584+
params: dict[str, Any] = {
585+
"skipGeometry": bool(skip_geometry),
586+
"limit": 50000 if limit is None or limit > 50000 else limit,
587+
}
588+
if bbox is not None and len(bbox) > 0:
589+
params["bbox"] = ",".join(map(str, bbox))
590+
if properties:
591+
params["properties"] = ",".join(properties)
592+
headers = _default_headers()
593+
headers["Content-Type"] = "application/query-cql-json"
594+
request = requests.Request(
595+
method="POST",
596+
url=service_url,
597+
headers=headers,
598+
data=cql_body,
599+
params=params,
600+
)
601+
return request.prepare()
602+
603+
531604
def _next_req_url(resp: requests.Response) -> str | None:
532605
"""
533606
Extracts the URL for the next page of results from an HTTP response from a

0 commit comments

Comments
 (0)