Skip to content

Commit 7532b47

Browse files
thodson-usgsclaude
andcommitted
feat(waterdata): add get_waterdata for generalized CQL2 queries
Adds get_waterdata(service, cql, ...), a Python analogue of R dataRetrieval::read_waterdata, for OGC API predicates the typed getters can't express (top-level or, like with % wildcards, comparisons, nested boolean trees). The CQL2 body (str or dict) is POSTed verbatim and the result is shaped exactly like the typed getters — wire id renamed, columns ordered/sorted, typed dtypes, (df, BaseMetadata) return. Ports the feature from the original PR onto the current async-httpx module (the branch predated the httpx migration and the async chunker), reusing existing scaffolding instead of duplicating it: - get_cql_data fetches via the same non-chunked anyio-portal path as get_stats_data (extracted as _run_sync) and shapes via the existing _finalize_ogc finalize hook. - _construct_cql_request shares the skipGeometry/limit/bbox/properties block with _construct_api_requests (extracted as _ogc_query_params). - _OUTPUT_ID_BY_SERVICE / WATERDATA_SERVICES enumerate the 11 OGC time-series collections and their id columns (kept in sync). Like get_stats_data, this is a single logical request (the CQL body is verbatim — nothing to chunk), so there is no resume-on-interrupt handle; server-side CQL errors surface as the module's standard typed errors. Tests: unit (request construction, skip_geometry omission, service validation) + live (compound AND/IN, str/dict equivalence, id translation, LIKE wildcard). Addresses #198 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 484a86a commit 7532b47

5 files changed

Lines changed: 465 additions & 15 deletions

File tree

dataretrieval/waterdata/__init__.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
get_stats_date_range,
2929
get_stats_por,
3030
get_time_series_metadata,
31+
get_waterdata,
3132
)
3233
from .filters import FILTER_LANG
3334
from .nearest import get_nearest_continuous
@@ -37,6 +38,7 @@
3738
PROFILE_LOOKUP,
3839
PROFILES,
3940
SERVICES,
41+
WATERDATA_SERVICES,
4042
)
4143

4244
__all__ = [
@@ -45,6 +47,7 @@
4547
"PROFILES",
4648
"PROFILE_LOOKUP",
4749
"SERVICES",
50+
"WATERDATA_SERVICES",
4851
"get_channel",
4952
"get_codes",
5053
"get_combined_metadata",
@@ -64,4 +67,5 @@
6467
"get_stats_date_range",
6568
"get_stats_por",
6669
"get_time_series_metadata",
70+
"get_waterdata",
6771
]

dataretrieval/waterdata/api.py

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,16 @@
2828
METADATA_COLLECTIONS,
2929
PROFILES,
3030
SERVICES,
31+
WATERDATA_SERVICES,
3132
)
3233
from dataretrieval.waterdata.utils import (
34+
_OUTPUT_ID_BY_SERVICE,
3335
SAMPLES_URL,
3436
_check_profiles,
3537
_default_headers,
3638
_get_args,
3739
_raise_for_non_200,
40+
get_cql_data,
3841
get_ogc_data,
3942
get_stats_data,
4043
)
@@ -2851,3 +2854,124 @@ def get_channel(
28512854
args = _get_args(locals())
28522855

28532856
return get_ogc_data(args, output_id, service)
2857+
2858+
2859+
def get_waterdata(
2860+
service: WATERDATA_SERVICES,
2861+
cql: str | dict,
2862+
*,
2863+
properties: str | Iterable[str] | None = None,
2864+
bbox: list[float] | None = None,
2865+
limit: int | None = None,
2866+
skip_geometry: bool | None = None,
2867+
convert_type: bool = True,
2868+
) -> tuple[pd.DataFrame, BaseMetadata]:
2869+
"""Generalized OGC API CQL2 query.
2870+
2871+
Python analogue of R ``dataRetrieval::read_waterdata``. Use this when you
2872+
need a predicate the typed wrappers (``get_daily``, ``get_continuous``, …)
2873+
can't express — top-level ``or``, ``like`` with ``%`` wildcards, comparison
2874+
operators, nested boolean trees, geometry-based predicates beyond a bbox,
2875+
and so on. The typed wrappers are nicer when they cover the case; reach for
2876+
this when they don't.
2877+
2878+
Unlike the typed getters, this issues a single request (the CQL2 body is
2879+
sent verbatim, so there are no multi-value arguments to chunk); a query
2880+
whose URL/body exceeds the server's size cap should be narrowed rather than
2881+
relying on automatic chunking.
2882+
2883+
The CQL2 grammar is documented at
2884+
https://api.waterdata.usgs.gov/docs/ogcapi/complex-queries/.
2885+
2886+
Parameters
2887+
----------
2888+
service : str
2889+
OGC collection name. Must be one of
2890+
:data:`dataretrieval.waterdata.types.WATERDATA_SERVICES`
2891+
(e.g. ``"daily"``, ``"monitoring-locations"``).
2892+
cql : str or dict
2893+
CQL2 query. A ``dict`` is JSON-serialized for transport; a ``str`` is
2894+
sent through unchanged. The query goes into the HTTP POST body with
2895+
``Content-Type: application/query-cql-json``.
2896+
properties : str or iterable of str, optional
2897+
Server-side property whitelist (passed as ``properties=`` on the URL).
2898+
Reduces payload size. ``"id"`` resolves to the service's ``output_id``
2899+
(e.g. ``daily_id``) the same way it does in the typed wrappers.
2900+
bbox : list of float, optional
2901+
Bounding box ``[xmin, ymin, xmax, ymax]`` in CRS 4326. Combines with the
2902+
CQL filter as an additional spatial predicate.
2903+
limit : int, optional
2904+
Page size, clamped server-side to 50,000.
2905+
skip_geometry : bool, optional
2906+
If True, the server omits geometry from each feature
2907+
(``skipGeometry=true``).
2908+
convert_type : bool, default True
2909+
Coerce date/datetime/numeric columns to typed dtypes after the
2910+
DataFrame is built.
2911+
2912+
Returns
2913+
-------
2914+
df : pandas.DataFrame or geopandas.GeoDataFrame
2915+
Result of the query. GeoDataFrame when ``geopandas`` is installed and
2916+
geometry is present.
2917+
md : :class:`dataretrieval.utils.BaseMetadata`
2918+
Request metadata (URL, query time, response headers).
2919+
2920+
Examples
2921+
--------
2922+
.. code::
2923+
2924+
>>> # Daily values for two parameter codes at two sites
2925+
>>> # (compound AND-of-INs).
2926+
>>> from dataretrieval import waterdata
2927+
>>> cql = {
2928+
... "op": "and",
2929+
... "args": [
2930+
... {
2931+
... "op": "in",
2932+
... "args": [
2933+
... {"property": "parameter_code"},
2934+
... ["00060", "00065"],
2935+
... ],
2936+
... },
2937+
... {
2938+
... "op": "in",
2939+
... "args": [
2940+
... {"property": "monitoring_location_id"},
2941+
... ["USGS-07367300", "USGS-03277200"],
2942+
... ],
2943+
... },
2944+
... ],
2945+
... }
2946+
>>> df, md = waterdata.get_waterdata(service="daily", cql=cql)
2947+
2948+
>>> # Monitoring locations whose HUC starts with "02070010"
2949+
>>> # (LIKE with the CQL2 ``%`` wildcard).
2950+
>>> df, md = waterdata.get_waterdata(
2951+
... service="monitoring-locations",
2952+
... cql='{"op": "like", "args": ['
2953+
... '{"property": "hydrologic_unit_code"},'
2954+
... ' "02070010%"]}',
2955+
... )
2956+
"""
2957+
if service not in _OUTPUT_ID_BY_SERVICE:
2958+
raise ValueError(
2959+
f"Unknown service {service!r}. Valid services: "
2960+
f"{sorted(_OUTPUT_ID_BY_SERVICE)}."
2961+
)
2962+
2963+
# ``dict`` is the pythonic input — serialize on the way out. ``str`` is sent
2964+
# verbatim so callers who already have a CQL2 doc (e.g. imported from a
2965+
# config file) don't need to re-parse it.
2966+
body = json.dumps(cql, separators=(",", ":")) if isinstance(cql, dict) else cql
2967+
2968+
return get_cql_data(
2969+
service,
2970+
body,
2971+
properties=properties,
2972+
output_id=_OUTPUT_ID_BY_SERVICE[service],
2973+
bbox=bbox,
2974+
limit=limit,
2975+
skip_geometry=skip_geometry,
2976+
convert_type=convert_type,
2977+
)

dataretrieval/waterdata/types.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,24 @@
4040
"results",
4141
]
4242

43+
# OGC API time-series/monitoring collections queryable via ``get_waterdata``.
44+
# Keep in sync with ``utils._OUTPUT_ID_BY_SERVICE`` (same keys): that dict maps
45+
# each service to its user-facing ``id`` column and is the runtime source of
46+
# truth ``get_waterdata`` validates against.
47+
WATERDATA_SERVICES = Literal[
48+
"channel-measurements",
49+
"combined-metadata",
50+
"continuous",
51+
"daily",
52+
"field-measurements",
53+
"field-measurements-metadata",
54+
"latest-continuous",
55+
"latest-daily",
56+
"monitoring-locations",
57+
"peaks",
58+
"time-series-metadata",
59+
]
60+
4361
PROFILES = Literal[
4462
"actgroup",
4563
"actmetric",

0 commit comments

Comments
 (0)