Skip to content

Commit d4c1f29

Browse files
thodson-usgsclaude
andcommitted
Add optional DuckDB connectors prototype for waterdata + wqp
Adds dataretrieval.duckdb_connectors, an optional extension that wraps DuckDB connections with helper methods exposing the dataretrieval waterdata (OGC) and wqp endpoints as registerable SQL views. Each helper returns a duckdb.DuckDBPyRelation; pass register_as=<name> to also publish the result as a named view that subsequent SQL can reference. Highlights: * Per-source layout (duckdb_connectors/{waterdata,wqp}.py) sharing a thin _BaseConnection in _base.py * Optional dependency: pip install dataretrieval[duckdb] * Compound spatial extra (pip install dataretrieval[spatial]) bundles geopandas + duckdb; spatial=True flag on connect() runs INSTALL spatial; LOAD spatial on the underlying connection so ST_GeomFromText etc. become available against registered views * Geometry handled by converting GeoDataFrame geometry to WKT plus longitude/latitude columns so the prototype works without the spatial extension by default * WQP connector threads connection-level legacy / ssl_check defaults through to every helper; per-call overrides supported * dataretrieval.duckdb_connector preserved as a backward-compat alias for the waterdata connector * Demo notebook covering site discovery, daily values, monthly aggregation, top-N window functions, sites x daily joins, latest readings, cross-source waterdata x WQP joins, and the spatial flag 15 tests pass; ruff check + format clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 33e928d commit d4c1f29

10 files changed

Lines changed: 1584 additions & 2 deletions

File tree

NOTES.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# DuckDB connector — design notes
2+
3+
Notes captured before implementation. The goal is a prototype that lets users
4+
query USGS waterdata endpoints via DuckDB SQL.
5+
6+
## Package conventions to follow
7+
8+
- **Style**: ruff-managed, py38 target, double quotes, docstring code
9+
formatted at width 72.
10+
- **Type hints**: full hints, use `from __future__ import annotations`,
11+
`str | list[str] | None` style.
12+
- **Docstrings**: numpy-style (Parameters / Returns sections with dashes).
13+
Module top has a short one-paragraph summary.
14+
- **Naming**: snake_case functions, leading underscore for private helpers,
15+
UPPER_SNAKE_CASE module constants.
16+
- **Logging**: `logger = logging.getLogger(__name__)` at module top.
17+
- **Errors**: raise `ImportError` with a pip install hint when an optional
18+
dep is missing, `ValueError` for bad arguments, `RuntimeError` for
19+
unexpected response shapes.
20+
21+
## Optional-dependency pattern (mirror `geopandas`)
22+
23+
`dataretrieval/waterdata/utils.py:24`:
24+
```python
25+
try:
26+
import geopandas as gpd
27+
GEOPANDAS = True
28+
except ImportError:
29+
GEOPANDAS = False
30+
```
31+
32+
For the connector we will:
33+
- attempt `import duckdb` at the top of the module;
34+
- on `ImportError` set a sentinel and raise a clear `ImportError` from
35+
the public entry point telling users to `pip install dataretrieval[duckdb]`.
36+
37+
`pyproject.toml` extras (current state):
38+
```toml
39+
[project.optional-dependencies]
40+
test = [...]
41+
doc = [...]
42+
nldi = ['geopandas>=0.10']
43+
```
44+
Add a new `duckdb = ["duckdb>=1.0.0"]` extra.
45+
46+
## Endpoint shape
47+
48+
All `dataretrieval.waterdata.api.get_*` functions return
49+
`tuple[pandas.DataFrame | geopandas.GeoDataFrame, BaseMetadata]`. Pagination
50+
is fully handled inside `_walk_pages`, so a single call is the whole
51+
result set.
52+
53+
Endpoints we will expose first (highest user value, all OGC):
54+
- `get_monitoring_locations` — site discovery (returns GeoDataFrame when
55+
geopandas installed)
56+
- `get_daily` — daily values
57+
- `get_continuous` — instantaneous values (≤3y per call by API contract)
58+
- `get_time_series_metadata` — what's available at each site
59+
- `get_latest_continuous`, `get_latest_daily` — most recent obs
60+
61+
Each accepts `monitoring_location_id`, `parameter_code`, etc. as scalar or
62+
list, plus the new `filter` / `filter_lang` CQL passthrough (#238).
63+
64+
## Architecture (after wqp + per-source split)
65+
66+
After surveying `wqp.py` we moved to a per-source connector package:
67+
68+
```
69+
dataretrieval/duckdb_connectors/
70+
├── _base.py # _require_duckdb, _flatten_geometry, _BaseConnection
71+
├── waterdata.py # WaterdataConnection + connect()
72+
└── wqp.py # WQPConnection + connect() (handles legacy / WQX3 flag)
73+
```
74+
75+
`dataretrieval/duckdb_connector.py` stays as a thin alias re-exporting
76+
the waterdata connector so the older import path keeps working.
77+
78+
WQP differences vs waterdata that the connector has to absorb:
79+
80+
* WQP getters take `**kwargs` (CamelCase URL params) rather than fully
81+
enumerated signatures, so the connector can't validate kwargs — it
82+
just forwards them.
83+
* Two parallel schemas (legacy WQX vs WQX 3.0) controlled by `legacy=`
84+
per call. The connection holds a default that callers can override
85+
per call.
86+
* `ssl_check` is also a connection-level default.
87+
* WQP returns a custom `WQP_Metadata` instead of `BaseMetadata`, but
88+
since the connector only consumes the DataFrame this doesn't matter.
89+
90+
Joining across the two sources: each connector owns its own duckdb
91+
connection, so to join you either materialise to a DataFrame and
92+
`.con.register(name, df)` it onto the other connection, or open a
93+
single `duckdb.connect()` directly and pass it into both
94+
`WaterdataConnection(con)` and `WQPConnection(con)` manually.
95+
96+
Other modules surveyed but not given connectors:
97+
98+
* `nwis` — deprecated; users are being pushed to waterdata.
99+
* `nldi` — returns GeoDataFrames / dicts; spatial-only, different
100+
contract; possible later.
101+
* `streamstats`, `nadp` — return non-tabular data (Watershed objects,
102+
zip files / TIFs); not connector candidates.
103+
* `ngwmn` — does return DataFrames but very narrow scope; could add
104+
later if needed.
105+
* `samples` — already covered by the waterdata connector via
106+
`wd.samples(...)` (the `samples.py` module is a deprecated shim
107+
that forwards to `waterdata.get_samples`).
108+
109+
## DuckDB integration choices
110+
111+
DuckDB ≥0.8 supports registering Python objects via `con.register(name, df)`
112+
which makes a pandas DataFrame queryable as a view. That's the simplest
113+
path and works with any DuckDB build — no compiled extension needed for a
114+
prototype.
115+
116+
DuckDB also supports `create_function` for **scalar** UDFs but **table**
117+
UDFs (table-valued functions callable as `FROM tvf(...)`) require either
118+
the in-progress python table-function API or a workaround. For a
119+
prototype the simpler API is preferable — register helper *methods* on a
120+
connection that take kwargs, fetch a DataFrame, register it under a
121+
caller-chosen name, and return a `duckdb.DuckDBPyRelation`. The user
122+
writes:
123+
124+
```python
125+
con = waterdata_duckdb.connect()
126+
sites = con.monitoring_locations(state_name="Illinois") # relation
127+
con.sql("SELECT * FROM sites WHERE site_type = 'Stream'")
128+
```
129+
130+
This keeps it pythonic and lets users compose with arbitrary SQL,
131+
including joins across two registered relations.
132+
133+
A second affordance: a `con.sql_table(name, fn, **kwargs)` that registers
134+
a one-shot DataFrame view by name, so:
135+
136+
```python
137+
con.sql_table("daily", waterdata.get_daily,
138+
monitoring_location_id="USGS-05586100",
139+
parameter_code="00060", time="2023/2024")
140+
con.sql("SELECT date_trunc('month', time) AS m, avg(value) "
141+
"FROM daily GROUP BY 1 ORDER BY 1")
142+
```
143+
144+
## Geometry handling
145+
146+
When geopandas is available, `get_monitoring_locations` returns a
147+
GeoDataFrame with a `geometry` column. DuckDB has a `spatial` extension
148+
that understands WKB/WKT but it isn't loaded by default. Safe path for
149+
the prototype: convert geometry to WKT string and add `longitude` /
150+
`latitude` columns. That keeps the relation queryable from plain DuckDB
151+
without extension setup.
152+
153+
## Tests
154+
155+
Existing tests (`tests/waterdata_test.py`) use `requests-mock` against
156+
real URLs. For our connector we don't need to re-test the HTTP layer —
157+
we should mock the waterdata `get_*` functions directly with
158+
`unittest.mock.patch` (this is the pattern in
159+
`tests/waterdata_nearest_test.py`) and assert that:
160+
161+
1. `connect()` raises a clean `ImportError` if duckdb isn't installed.
162+
2. Helper methods invoke the underlying `get_*` with the kwargs we passed.
163+
3. The returned object is a queryable DuckDB relation.
164+
4. `sql_table` registers a view that returns the same row count as the
165+
source DataFrame.
166+
5. Geometry conversion produces WKT + lon/lat columns and drops the
167+
GeoDataFrame `geometry` column (or keeps it as WKT) without needing
168+
the spatial extension.
169+
170+
## Notebook
171+
172+
Goes in `demos/`. Should:
173+
- show a real query against `api.waterdata.usgs.gov`
174+
- demonstrate something easier in SQL than pandas (window function over
175+
daily flow, monthly aggregation, join of monitoring-location metadata
176+
to daily values)
177+
- gracefully note that this needs `pip install dataretrieval[duckdb]`
178+
179+
Demos are excluded from ruff (`extend-exclude = ["demos"]` in
180+
pyproject.toml) so we don't have to fight formatting there.

dataretrieval/duckdb_connector.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
"""Backwards-compatible alias for :mod:`dataretrieval.duckdb_connectors.waterdata`.
2+
3+
The single-source connector originally lived here. It has since been
4+
generalised into a per-source package so additional sources (WQP,
5+
NGWMN, …) can be added without bloating one module. New code should
6+
import directly from :mod:`dataretrieval.duckdb_connectors`::
7+
8+
from dataretrieval.duckdb_connectors import waterdata
9+
10+
con = waterdata.connect()
11+
12+
This module preserves the older entry point::
13+
14+
from dataretrieval import duckdb_connector
15+
16+
con = duckdb_connector.connect()
17+
18+
which is equivalent to the waterdata connector.
19+
"""
20+
21+
from __future__ import annotations
22+
23+
from .duckdb_connectors._base import DUCKDB
24+
from .duckdb_connectors.waterdata import WaterdataConnection, connect
25+
26+
__all__ = [
27+
"DUCKDB",
28+
"WaterdataConnection",
29+
"connect",
30+
]
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
"""DuckDB connectors for ``dataretrieval`` data sources.
2+
3+
Each submodule wraps one ``dataretrieval`` source (waterdata, wqp, …)
4+
behind a duckdb connection so its endpoints can be queried as named
5+
SQL views. The connectors are an *optional* extension; install with::
6+
7+
pip install dataretrieval[duckdb]
8+
9+
Quickstart
10+
----------
11+
>>> from dataretrieval.duckdb_connectors import waterdata, wqp
12+
>>> with waterdata.connect() as con:
13+
... con.monitoring_locations(state_name="Illinois", register_as="sites")
14+
... con.sql("SELECT count(*) FROM sites").fetchone()
15+
"""
16+
17+
from __future__ import annotations
18+
19+
from . import waterdata, wqp
20+
from ._base import DUCKDB
21+
22+
__all__ = ["DUCKDB", "waterdata", "wqp"]

0 commit comments

Comments
 (0)