|
| 1 | +# DuckDB connector — design notes |
| 2 | + |
| 3 | +Notes captured before implementation. The goal is a prototype that lets users |
| 4 | +query USGS waterdata endpoints via DuckDB SQL. |
| 5 | + |
| 6 | +## Package conventions to follow |
| 7 | + |
| 8 | +- **Style**: ruff-managed, py38 target, double quotes, docstring code |
| 9 | + formatted at width 72. |
| 10 | +- **Type hints**: full hints, use `from __future__ import annotations`, |
| 11 | + `str | list[str] | None` style. |
| 12 | +- **Docstrings**: numpy-style (Parameters / Returns sections with dashes). |
| 13 | + Module top has a short one-paragraph summary. |
| 14 | +- **Naming**: snake_case functions, leading underscore for private helpers, |
| 15 | + UPPER_SNAKE_CASE module constants. |
| 16 | +- **Logging**: `logger = logging.getLogger(__name__)` at module top. |
| 17 | +- **Errors**: raise `ImportError` with a pip install hint when an optional |
| 18 | + dep is missing, `ValueError` for bad arguments, `RuntimeError` for |
| 19 | + unexpected response shapes. |
| 20 | + |
| 21 | +## Optional-dependency pattern (mirror `geopandas`) |
| 22 | + |
| 23 | +`dataretrieval/waterdata/utils.py:24`: |
| 24 | +```python |
| 25 | +try: |
| 26 | + import geopandas as gpd |
| 27 | + GEOPANDAS = True |
| 28 | +except ImportError: |
| 29 | + GEOPANDAS = False |
| 30 | +``` |
| 31 | + |
| 32 | +For the connector we will: |
| 33 | +- attempt `import duckdb` at the top of the module; |
| 34 | +- on `ImportError` set a sentinel and raise a clear `ImportError` from |
| 35 | + the public entry point telling users to `pip install dataretrieval[duckdb]`. |
| 36 | + |
| 37 | +`pyproject.toml` extras (current state): |
| 38 | +```toml |
| 39 | +[project.optional-dependencies] |
| 40 | +test = [...] |
| 41 | +doc = [...] |
| 42 | +nldi = ['geopandas>=0.10'] |
| 43 | +``` |
| 44 | +Add a new `duckdb = ["duckdb>=1.0.0"]` extra. |
| 45 | + |
| 46 | +## Endpoint shape |
| 47 | + |
| 48 | +All `dataretrieval.waterdata.api.get_*` functions return |
| 49 | +`tuple[pandas.DataFrame | geopandas.GeoDataFrame, BaseMetadata]`. Pagination |
| 50 | +is fully handled inside `_walk_pages`, so a single call is the whole |
| 51 | +result set. |
| 52 | + |
| 53 | +Endpoints we will expose first (highest user value, all OGC): |
| 54 | +- `get_monitoring_locations` — site discovery (returns GeoDataFrame when |
| 55 | + geopandas installed) |
| 56 | +- `get_daily` — daily values |
| 57 | +- `get_continuous` — instantaneous values (≤3y per call by API contract) |
| 58 | +- `get_time_series_metadata` — what's available at each site |
| 59 | +- `get_latest_continuous`, `get_latest_daily` — most recent obs |
| 60 | + |
| 61 | +Each accepts `monitoring_location_id`, `parameter_code`, etc. as scalar or |
| 62 | +list, plus the new `filter` / `filter_lang` CQL passthrough (#238). |
| 63 | + |
| 64 | +## Architecture (after wqp + per-source split) |
| 65 | + |
| 66 | +After surveying `wqp.py` we moved to a per-source connector package: |
| 67 | + |
| 68 | +``` |
| 69 | +dataretrieval/duckdb_connectors/ |
| 70 | +├── _base.py # _require_duckdb, _flatten_geometry, _BaseConnection |
| 71 | +├── waterdata.py # WaterdataConnection + connect() |
| 72 | +└── wqp.py # WQPConnection + connect() (handles legacy / WQX3 flag) |
| 73 | +``` |
| 74 | + |
| 75 | +`dataretrieval/duckdb_connector.py` stays as a thin alias re-exporting |
| 76 | +the waterdata connector so the older import path keeps working. |
| 77 | + |
| 78 | +WQP differences vs waterdata that the connector has to absorb: |
| 79 | + |
| 80 | +* WQP getters take `**kwargs` (CamelCase URL params) rather than fully |
| 81 | + enumerated signatures, so the connector can't validate kwargs — it |
| 82 | + just forwards them. |
| 83 | +* Two parallel schemas (legacy WQX vs WQX 3.0) controlled by `legacy=` |
| 84 | + per call. The connection holds a default that callers can override |
| 85 | + per call. |
| 86 | +* `ssl_check` is also a connection-level default. |
| 87 | +* WQP returns a custom `WQP_Metadata` instead of `BaseMetadata`, but |
| 88 | + since the connector only consumes the DataFrame this doesn't matter. |
| 89 | + |
| 90 | +Joining across the two sources: each connector owns its own duckdb |
| 91 | +connection, so to join you either materialise to a DataFrame and |
| 92 | +`.con.register(name, df)` it onto the other connection, or open a |
| 93 | +single `duckdb.connect()` directly and pass it into both |
| 94 | +`WaterdataConnection(con)` and `WQPConnection(con)` manually. |
| 95 | + |
| 96 | +Other modules surveyed but not given connectors: |
| 97 | + |
| 98 | +* `nwis` — deprecated; users are being pushed to waterdata. |
| 99 | +* `nldi` — returns GeoDataFrames / dicts; spatial-only, different |
| 100 | + contract; possible later. |
| 101 | +* `streamstats`, `nadp` — return non-tabular data (Watershed objects, |
| 102 | + zip files / TIFs); not connector candidates. |
| 103 | +* `ngwmn` — does return DataFrames but very narrow scope; could add |
| 104 | + later if needed. |
| 105 | +* `samples` — already covered by the waterdata connector via |
| 106 | + `wd.samples(...)` (the `samples.py` module is a deprecated shim |
| 107 | + that forwards to `waterdata.get_samples`). |
| 108 | + |
| 109 | +## DuckDB integration choices |
| 110 | + |
| 111 | +DuckDB ≥0.8 supports registering Python objects via `con.register(name, df)` |
| 112 | +which makes a pandas DataFrame queryable as a view. That's the simplest |
| 113 | +path and works with any DuckDB build — no compiled extension needed for a |
| 114 | +prototype. |
| 115 | + |
| 116 | +DuckDB also supports `create_function` for **scalar** UDFs but **table** |
| 117 | +UDFs (table-valued functions callable as `FROM tvf(...)`) require either |
| 118 | +the in-progress python table-function API or a workaround. For a |
| 119 | +prototype the simpler API is preferable — register helper *methods* on a |
| 120 | +connection that take kwargs, fetch a DataFrame, register it under a |
| 121 | +caller-chosen name, and return a `duckdb.DuckDBPyRelation`. The user |
| 122 | +writes: |
| 123 | + |
| 124 | +```python |
| 125 | +con = waterdata_duckdb.connect() |
| 126 | +sites = con.monitoring_locations(state_name="Illinois") # relation |
| 127 | +con.sql("SELECT * FROM sites WHERE site_type = 'Stream'") |
| 128 | +``` |
| 129 | + |
| 130 | +This keeps it pythonic and lets users compose with arbitrary SQL, |
| 131 | +including joins across two registered relations. |
| 132 | + |
| 133 | +A second affordance: a `con.sql_table(name, fn, **kwargs)` that registers |
| 134 | +a one-shot DataFrame view by name, so: |
| 135 | + |
| 136 | +```python |
| 137 | +con.sql_table("daily", waterdata.get_daily, |
| 138 | + monitoring_location_id="USGS-05586100", |
| 139 | + parameter_code="00060", time="2023/2024") |
| 140 | +con.sql("SELECT date_trunc('month', time) AS m, avg(value) " |
| 141 | + "FROM daily GROUP BY 1 ORDER BY 1") |
| 142 | +``` |
| 143 | + |
| 144 | +## Geometry handling |
| 145 | + |
| 146 | +When geopandas is available, `get_monitoring_locations` returns a |
| 147 | +GeoDataFrame with a `geometry` column. DuckDB has a `spatial` extension |
| 148 | +that understands WKB/WKT but it isn't loaded by default. Safe path for |
| 149 | +the prototype: convert geometry to WKT string and add `longitude` / |
| 150 | +`latitude` columns. That keeps the relation queryable from plain DuckDB |
| 151 | +without extension setup. |
| 152 | + |
| 153 | +## Tests |
| 154 | + |
| 155 | +Existing tests (`tests/waterdata_test.py`) use `requests-mock` against |
| 156 | +real URLs. For our connector we don't need to re-test the HTTP layer — |
| 157 | +we should mock the waterdata `get_*` functions directly with |
| 158 | +`unittest.mock.patch` (this is the pattern in |
| 159 | +`tests/waterdata_nearest_test.py`) and assert that: |
| 160 | + |
| 161 | +1. `connect()` raises a clean `ImportError` if duckdb isn't installed. |
| 162 | +2. Helper methods invoke the underlying `get_*` with the kwargs we passed. |
| 163 | +3. The returned object is a queryable DuckDB relation. |
| 164 | +4. `sql_table` registers a view that returns the same row count as the |
| 165 | + source DataFrame. |
| 166 | +5. Geometry conversion produces WKT + lon/lat columns and drops the |
| 167 | + GeoDataFrame `geometry` column (or keeps it as WKT) without needing |
| 168 | + the spatial extension. |
| 169 | + |
| 170 | +## Notebook |
| 171 | + |
| 172 | +Goes in `demos/`. Should: |
| 173 | +- show a real query against `api.waterdata.usgs.gov` |
| 174 | +- demonstrate something easier in SQL than pandas (window function over |
| 175 | + daily flow, monthly aggregation, join of monitoring-location metadata |
| 176 | + to daily values) |
| 177 | +- gracefully note that this needs `pip install dataretrieval[duckdb]` |
| 178 | + |
| 179 | +Demos are excluded from ruff (`extend-exclude = ["demos"]` in |
| 180 | +pyproject.toml) so we don't have to fight formatting there. |
0 commit comments