From 1783f2c88dfb8b30d028d28016a15c2c76b404d3 Mon Sep 17 00:00:00 2001 From: thodson-usgs Date: Wed, 27 May 2026 08:08:34 -0500 Subject: [PATCH 1/2] docs: add Python ports of the new USGS Water Data API vignettes Port five new R dataRetrieval Water Data API vignettes to the Python `waterdata` module as executable demo notebooks, wired into the Sphinx docs under a new "USGS Water Data API vignettes" section: - USGS_WaterData_Introduction_Examples (read_waterdata_functions.Rmd) - USGS_WaterData_DiscreteSamples_Examples (samples_data.Rmd) - USGS_WaterData_DailyStatistics_Examples (daily_data_statistics.Rmd) - USGS_WaterData_ContinuousData_Examples (continuous_pr.Rmd) - USGS_WaterData_ReferenceLists_Examples (Reference_Lists.Rmd) Each notebook was executed end-to-end against the live USGS Water Data API during development; outputs are cleared per the repo convention (the Sphinx docs build re-executes notebooks at build time). Co-Authored-By: Claude Opus 4.7 (1M context) --- ...GS_WaterData_ContinuousData_Examples.ipynb | 257 +++++++ ...S_WaterData_DailyStatistics_Examples.ipynb | 437 ++++++++++++ ...S_WaterData_DiscreteSamples_Examples.ipynb | 541 ++++++++++++++ ...USGS_WaterData_Introduction_Examples.ipynb | 660 ++++++++++++++++++ ...GS_WaterData_ReferenceLists_Examples.ipynb | 138 ++++ ...S_WaterData_ContinuousData_Examples.nblink | 3 + ..._WaterData_DailyStatistics_Examples.nblink | 3 + ..._WaterData_DiscreteSamples_Examples.nblink | 3 + ...SGS_WaterData_Introduction_Examples.nblink | 3 + ...S_WaterData_ReferenceLists_Examples.nblink | 3 + docs/source/examples/index.rst | 17 + 11 files changed, 2065 insertions(+) create mode 100644 demos/USGS_WaterData_ContinuousData_Examples.ipynb create mode 100644 demos/USGS_WaterData_DailyStatistics_Examples.ipynb create mode 100644 demos/USGS_WaterData_DiscreteSamples_Examples.ipynb create mode 100644 demos/USGS_WaterData_Introduction_Examples.ipynb create mode 100644 demos/USGS_WaterData_ReferenceLists_Examples.ipynb create mode 100644 docs/source/examples/USGS_WaterData_ContinuousData_Examples.nblink create mode 100644 docs/source/examples/USGS_WaterData_DailyStatistics_Examples.nblink create mode 100644 docs/source/examples/USGS_WaterData_DiscreteSamples_Examples.nblink create mode 100644 docs/source/examples/USGS_WaterData_Introduction_Examples.nblink create mode 100644 docs/source/examples/USGS_WaterData_ReferenceLists_Examples.nblink diff --git a/demos/USGS_WaterData_ContinuousData_Examples.ipynb b/demos/USGS_WaterData_ContinuousData_Examples.ipynb new file mode 100644 index 00000000..735e5439 --- /dev/null +++ b/demos/USGS_WaterData_ContinuousData_Examples.ipynb @@ -0,0 +1,257 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "d664492b", + "metadata": {}, + "source": [ + "# Continuous Data\n", + "\n", + "Continuous data are collected by automated sensors, typically at a fixed\n", + "15-minute interval (you may also hear them called \"instantaneous values\" or\n", + "\"IV\"). They are described by parameter name and parameter code, and retrieved\n", + "with `get_continuous`.\n", + "\n", + "This notebook covers the two things that matter when a continuous pull gets\n", + "large: `dataretrieval` **chunks big requests for you** and can **resume** a pull\n", + "that was interrupted partway through, and the one case you still handle yourself\n", + "— the service's 3-year-per-request time limit." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e7e06e81", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "\n", + "from dataretrieval import waterdata\n", + "\n", + "site = \"USGS-0208458892\"" + ] + }, + { + "cell_type": "markdown", + "id": "b0136bd1", + "metadata": {}, + "source": [ + "## What continuous data are available?\n", + "\n", + "Filter the combined metadata to `data_type=\"Continuous values\"` to see which\n", + "time series a site offers and how far back each goes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f8a9d87", + "metadata": {}, + "outputs": [], + "source": [ + "continuous_available, _ = waterdata.get_combined_metadata(\n", + " monitoring_location_id=site,\n", + " data_type=\"Continuous values\",\n", + ")\n", + "avail = continuous_available[[\"parameter_code\", \"parameter_name\", \"begin\", \"end\"]]\n", + "avail.sort_values(\"parameter_code\").reset_index(drop=True)" + ] + }, + { + "cell_type": "markdown", + "id": "fdaa8150", + "metadata": {}, + "source": [ + "## Large requests are chunked for you\n", + "\n", + "Any list-valued argument — a long list of monitoring locations, several parameter\n", + "codes, a complex CQL filter — can push a single request URL past the server's\n", + "~8 KB limit. `dataretrieval` handles this automatically: it splits the query into\n", + "URL-sized sub-requests, issues them, and recombines (and de-duplicates) the\n", + "results into one frame. **You never need to loop over sites yourself** — request\n", + "everything in one call.\n", + "\n", + "For example, asking for several parameter codes at once just returns one combined\n", + "long-format frame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6bc05102", + "metadata": {}, + "outputs": [], + "source": [ + "multi, _ = waterdata.get_continuous(\n", + " monitoring_location_id=site,\n", + " parameter_code=[\"00095\", \"00010\"], # specific conductance + water temperature\n", + " time=\"2024-07-01/2024-07-02\",\n", + ")\n", + "multi.groupby(\"parameter_code\")[\"value\"].agg([\"count\", \"min\", \"max\"])" + ] + }, + { + "cell_type": "markdown", + "id": "353ad4ec", + "metadata": {}, + "source": [ + "## Resilient pulls: resume after an interruption\n", + "\n", + "A large request becomes many sub-requests under the hood, so a long pull can be\n", + "interrupted partway through by a rate limit (HTTP 429) or a transient server\n", + "error (HTTP 5xx). Rather than discard the work already done, `dataretrieval`\n", + "raises a `ChunkInterrupted` that **preserves the completed sub-requests** and\n", + "lets you continue:\n", + "\n", + "- `QuotaExhausted` (429) and `ServiceInterrupted` (5xx) both subclass\n", + " `ChunkInterrupted`.\n", + "- `exc.partial_frame` holds whatever completed before the failure.\n", + "- `exc.retry_after` is the server's suggested wait (when provided).\n", + "- `exc.call.resume()` re-issues **only the still-pending** sub-requests and\n", + " returns the full `(data, metadata)`.\n", + "\n", + "The pattern below waits out the interruption and resumes until the pull\n", + "finishes. (In normal conditions the request completes on the first try and the\n", + "`except` block never runs.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e2e9ddff", + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "\n", + "from dataretrieval.waterdata.chunking import ChunkInterrupted\n", + "\n", + "try:\n", + " sensor_data, _ = waterdata.get_continuous(\n", + " monitoring_location_id=site,\n", + " parameter_code=\"00095\",\n", + " time=\"2024-07-01/2024-07-08\",\n", + " )\n", + "except ChunkInterrupted as exc:\n", + " print(\n", + " f\"interrupted after {exc.completed_chunks}/{exc.total_chunks} chunks; resuming\"\n", + " )\n", + " while True:\n", + " time.sleep(exc.retry_after or 5 * 60) # honor Retry-After, else back off\n", + " try:\n", + " sensor_data, _ = exc.call.resume()\n", + " break\n", + " except ChunkInterrupted as again:\n", + " exc = again\n", + "\n", + "print(f\"{len(sensor_data):,} rows\")\n", + "sensor_data[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()" + ] + }, + { + "cell_type": "markdown", + "id": "397e87b5", + "metadata": {}, + "source": [ + "## The 3-year window: the one axis you split yourself\n", + "\n", + "There is one limit the library does **not** chunk for you: the continuous service\n", + "returns at most **3 years of data per request**, and a time window is not a\n", + "list-shaped axis it can fan out. (With no `time` argument the service returns the\n", + "latest year; continuous data also has no geometry column and ignores bounding-box\n", + "queries.)\n", + "\n", + "So a multi-year, single-site pull is the one place you still split by time. The\n", + "service is most efficient one calendar year at a time, so build a list of yearly\n", + "windows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd26d199", + "metadata": {}, + "outputs": [], + "source": [ + "# Split [start, end] into per-calendar-year (start, end) date strings.\n", + "def year_chunks(start, end):\n", + " start, end = pd.Timestamp(start), pd.Timestamp(end)\n", + " edges = pd.to_datetime([f\"{y}-01-01\" for y in range(start.year + 1, end.year + 1)])\n", + " starts = [start, *edges]\n", + " ends = [*(edges - pd.Timedelta(days=1)), end]\n", + " return [\n", + " (s.strftime(\"%Y-%m-%d\"), e.strftime(\"%Y-%m-%d\")) for s, e in zip(starts, ends)\n", + " ]\n", + "\n", + "\n", + "# Covering a full multi-year record (no data downloaded here):\n", + "pd.DataFrame(year_chunks(\"2012-10-01\", \"2025-09-30\"), columns=[\"start\", \"end\"])" + ] + }, + { + "cell_type": "markdown", + "id": "3bc4f40f", + "metadata": {}, + "source": [ + "Then request each window and concatenate. (We use a short two-window span here so\n", + "the notebook runs quickly; widen the dates for a full period of record.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "01ebb4a0", + "metadata": {}, + "outputs": [], + "source": [ + "chunks = year_chunks(\"2023-10-01\", \"2024-03-31\")\n", + "\n", + "frames = []\n", + "for start, end in chunks:\n", + " part, _ = waterdata.get_continuous(\n", + " monitoring_location_id=site,\n", + " parameter_code=\"00095\",\n", + " time=f\"{start}/{end}\",\n", + " )\n", + " frames.append(part)\n", + "\n", + "por = pd.concat(frames, ignore_index=True)\n", + "print(\n", + " f\"{len(por):,} rows from {len(chunks)} windows, \"\n", + " f\"{por['time'].min()} -> {por['time'].max()}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "e2487bf4", + "metadata": {}, + "source": [ + "Wrap each window's call in the resume pattern above for an unattended,\n", + "restart-safe pull. USGS also expects to offer a direct full-period-of-record\n", + "download before the legacy NWIS services are decommissioned, which may make\n", + "time-window splitting unnecessary — check the documentation for updates.\n", + "\n", + "## More help\n", + "\n", + "- Documentation: \n", + "- Chunking and resume internals: `dataretrieval.waterdata.chunking`\n", + "- Issues / questions: \n", + "- Equivalent R article: [Continuous Data](https://doi-usgs.github.io/dataRetrieval/articles/continuous_pr.html)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demos/USGS_WaterData_DailyStatistics_Examples.ipynb b/demos/USGS_WaterData_DailyStatistics_Examples.ipynb new file mode 100644 index 00000000..f35f52c9 --- /dev/null +++ b/demos/USGS_WaterData_DailyStatistics_Examples.ipynb @@ -0,0 +1,437 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "fe73969b", + "metadata": {}, + "source": [ + "# Daily statistics: `get_stats_por` and `get_stats_date_range`\n", + "\n", + "`get_stats_por` and `get_stats_date_range` return pre-computed temporal\n", + "statistics from the [modernized statistics API](https://api.waterdata.usgs.gov/statistics/v0/docs),\n", + "the modern replacement for the legacy NWIS statistics service. The two functions wrap\n", + "endpoints that look similar but answer different questions:\n", + "\n", + "| Function | API endpoint | Returns |\n", + "| --- | --- | --- |\n", + "| `get_stats_por` | `observationNormals` | day-of-year and month-of-year statistics across the period of record |\n", + "| `get_stats_date_range` | `observationIntervals` | monthly and annual statistics within a requested date range |\n", + "\n", + "A couple of usage notes:\n", + "\n", + "- Pass `computation_type=` to choose the statistic — `arithmetic_mean`,\n", + " `median`, `minimum`, `maximum`, or `percentile`.\n", + "- There is no dedicated argument to return only day-of-year vs. month-of-year\n", + " (or only calendar vs. water year), so filter the returned `time_of_year_type`\n", + " / `interval_type` column in pandas, as shown below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6ab1ce4", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.dates as mdates\n", + "import matplotlib.pyplot as plt\n", + "import pandas as pd\n", + "\n", + "from dataretrieval import waterdata\n", + "\n", + "%matplotlib inline\n", + "\n", + "site1 = \"USGS-02037500\"" + ] + }, + { + "cell_type": "markdown", + "id": "cf8868ae", + "metadata": {}, + "source": [ + "## Fetching day-of-year and month-of-year statistics\n", + "\n", + "Day-of-year and month-of-year statistics aggregate observations for the same\n", + "calendar day or month across many years to describe typical seasonal conditions\n", + "(all Januarys, or all January 1sts). Below we request day-of-year discharge\n", + "averages for January 1 and 2 — note `start_date`/`end_date` are in `MM-DD`\n", + "format:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0ab13bb", + "metadata": {}, + "outputs": [], + "source": [ + "jan_por_mean, _ = waterdata.get_stats_por(\n", + " monitoring_location_id=site1,\n", + " parameter_code=\"00060\",\n", + " computation_type=\"arithmetic_mean\",\n", + " start_date=\"01-01\",\n", + " end_date=\"01-02\",\n", + ")\n", + "jan_por_mean[[\"time_of_year\", \"time_of_year_type\", \"computation\", \"value\"]]" + ] + }, + { + "cell_type": "markdown", + "id": "3dc2b04f", + "metadata": {}, + "source": [ + "The first two rows are the day-of-year averages. What's the third row? Its\n", + "`time_of_year_type` is `month_of_year` — it's the average across all *Januarys*.\n", + "This is a quirk of the statistics API: whenever the `start_date`–`end_date` range\n", + "overlaps the first day of a month (here `01-01`), you also get the month-of-year\n", + "summary.\n", + "\n", + "To return only one type, filter the `time_of_year_type` column — here,\n", + "month-of-year only:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4d561aba", + "metadata": {}, + "outputs": [], + "source": [ + "moy = jan_por_mean[jan_por_mean[\"time_of_year_type\"] == \"month_of_year\"]\n", + "moy[[\"time_of_year\", \"time_of_year_type\", \"computation\", \"value\"]]" + ] + }, + { + "cell_type": "markdown", + "id": "43fe1eef", + "metadata": {}, + "source": [ + "### Percentile band plot\n", + "\n", + "Now an example that shows the power of the statistics API: we pull *all*\n", + "day-of-year discharge percentiles for the site. Computing these without the API\n", + "would mean downloading the entire daily period of record and computing\n", + "percentiles by hand.\n", + "\n", + "By default `get_stats_por` sets `expand_percentiles=True`, returning one row per\n", + "percentile with the value in `value` and the threshold in `percentile`\n", + "(minimum is reported as percentile 0, maximum as 100)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18bd842c", + "metadata": {}, + "outputs": [], + "source": [ + "full_por_percentiles, _ = waterdata.get_stats_por(\n", + " monitoring_location_id=site1,\n", + " parameter_code=\"00060\",\n", + " computation_type=[\"minimum\", \"maximum\", \"percentile\"],\n", + " start_date=\"01-01\",\n", + " end_date=\"12-31\",\n", + ")\n", + "# The January 1 day-of-year percentiles (used on the WDFN state pages):\n", + "jan1 = full_por_percentiles[\n", + " (full_por_percentiles[\"time_of_year\"] == \"01-01\")\n", + " & (full_por_percentiles[\"time_of_year_type\"] == \"day_of_year\")\n", + "]\n", + "jan1.sort_values(\"percentile\")[[\"time_of_year\", \"computation\", \"percentile\", \"value\"]]" + ] + }, + { + "cell_type": "markdown", + "id": "c8fc4a28", + "metadata": {}, + "source": [ + "Pivoting the day-of-year rows so each percentile is a column lets us draw the\n", + "percentile \"ribbons\" — each band spans two adjacent percentiles (min–5th, 5th–10th, …):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "aaa72823", + "metadata": {}, + "outputs": [], + "source": [ + "doy = full_por_percentiles[\n", + " full_por_percentiles[\"time_of_year_type\"] == \"day_of_year\"\n", + "].copy()\n", + "doy[\"value\"] = pd.to_numeric(doy[\"value\"], errors=\"coerce\") # API returns strings\n", + "bands = doy.pivot_table(index=\"time_of_year\", columns=\"percentile\", values=\"value\")\n", + "bands.columns = [int(c) for c in bands.columns]\n", + "bands = bands.sort_index() # \"MM-DD\" strings sort chronologically within a year\n", + "\n", + "# x positions: map MM-DD onto a reference (leap) year so 02-29 is included\n", + "x = pd.to_datetime(\"2024-\" + bands.index, format=\"%Y-%m-%d\")\n", + "\n", + "# (lo, hi) percentile range, fill color, legend label\n", + "band_defs = [\n", + " ((95, 100), \"#292f6b\", \"95th Percentile - Max\"),\n", + " ((90, 95), \"#5699c0\", \"90th - 95th Percentile\"),\n", + " ((75, 90), \"#aacee0\", \"75th - 90th Percentile\"),\n", + " ((25, 75), \"#e9e9e9\", \"25th - 75th Percentile\"),\n", + " ((10, 25), \"#ebd6ab\", \"10th - 25th Percentile\"),\n", + " ((5, 10), \"#dcb668\", \"5th - 10th Percentile\"),\n", + " ((0, 5), \"#8f4f1f\", \"Min - 5th Percentile\"),\n", + "]\n", + "\n", + "fig, ax = plt.subplots(figsize=(9, 5))\n", + "for (lo, hi), color, label in band_defs:\n", + " ax.fill_between(x, bands[lo], bands[hi], facecolor=color, alpha=0.7, label=label)\n", + "ax.set_yscale(\"log\")\n", + "ax.xaxis.set_major_locator(mdates.MonthLocator())\n", + "ax.xaxis.set_major_formatter(mdates.DateFormatter(\"%b\"))\n", + "ax.set_xlabel(\"Month\")\n", + "ax.set_ylabel(\"Discharge, cubic feet per second\")\n", + "ax.set_title(\"Day-of-year percentile bands (USGS-02037500)\")\n", + "ax.legend(title=\"Historical percentiles\", fontsize=7, loc=\"upper right\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "7b4075bd", + "metadata": {}, + "source": [ + "Finally, overlay the actual daily mean discharge so we can see where recent\n", + "conditions fall relative to the historical bands — exactly the view on the\n", + "[Water Data for the Nation (WDFN) statistical graphs](https://waterdata.usgs.gov/monitoring-location/USGS-02037500/statistical-graphs/).\n", + "We pull two water years of daily means and join them to the bands by month-day." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "961eea3a", + "metadata": {}, + "outputs": [], + "source": [ + "daily, _ = waterdata.get_daily(\n", + " monitoring_location_id=site1,\n", + " parameter_code=\"00060\",\n", + " statistic_id=\"00003\",\n", + " time=[\"2024-01-01\", \"2025-12-31\"],\n", + ")\n", + "daily = daily.sort_values(\"time\").reset_index(drop=True)\n", + "daily[\"md\"] = daily[\"time\"].dt.strftime(\"%m-%d\")\n", + "\n", + "# Repeat the day-of-year bands across each actual calendar date\n", + "b = bands.reindex(daily[\"md\"]).reset_index(drop=True)\n", + "\n", + "fig, ax = plt.subplots(figsize=(9, 5))\n", + "for (lo, hi), color, label in band_defs:\n", + " ax.fill_between(\n", + " daily[\"time\"], b[lo], b[hi], facecolor=color, alpha=0.7, label=label\n", + " )\n", + "ax.plot(daily[\"time\"], daily[\"value\"], color=\"black\", lw=0.9, label=\"Daily mean\")\n", + "prov = daily[daily[\"approval_status\"] == \"Provisional\"]\n", + "ax.scatter(prov[\"time\"], prov[\"value\"], color=\"red\", s=5, zorder=3, label=\"Provisional\")\n", + "ax.set_yscale(\"log\")\n", + "ax.xaxis.set_major_locator(mdates.MonthLocator(interval=3))\n", + "ax.xaxis.set_major_formatter(mdates.DateFormatter(\"%b %Y\"))\n", + "ax.set_ylabel(\"Discharge, cubic feet per second\")\n", + "ax.set_title(\"Daily mean discharge vs. historical percentile bands\")\n", + "ax.legend(fontsize=7, ncol=2, loc=\"upper right\")\n", + "fig.autofmt_xdate()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "e31f1726", + "metadata": {}, + "source": [ + "## Fetching monthly and annual statistics within a date range\n", + "\n", + "Unlike the day-/month-of-year normals, `get_stats_date_range` summarizes specific\n", + "months and years inside a requested window. Here we ask for the average discharge\n", + "for January 2024 — note the `YYYY-MM-DD` date format:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0bc8cd83", + "metadata": {}, + "outputs": [], + "source": [ + "jan_daterange_mean, _ = waterdata.get_stats_date_range(\n", + " monitoring_location_id=site1,\n", + " parameter_code=\"00060\",\n", + " computation_type=\"arithmetic_mean\",\n", + " start_date=\"2024-01-01\",\n", + " end_date=\"2024-01-31\",\n", + ")\n", + "jan_daterange_mean[[\"start_date\", \"end_date\", \"interval_type\", \"value\"]]" + ] + }, + { + "cell_type": "markdown", + "id": "7d915aed", + "metadata": {}, + "source": [ + "Instead of `time_of_year`, the output has `start_date`, `end_date`, and\n", + "`interval_type`. The first row is the monthly average; the API also returns the\n", + "**calendar year** and **water year** averages for any year intersecting the\n", + "range. A 93-day window can therefore touch two calendar and two water years:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfe28029", + "metadata": {}, + "outputs": [], + "source": [ + "multiyear, _ = waterdata.get_stats_date_range(\n", + " monitoring_location_id=site1,\n", + " parameter_code=\"00060\",\n", + " computation_type=\"arithmetic_mean\",\n", + " start_date=\"2023-09-30\",\n", + " end_date=\"2024-01-01\",\n", + ")\n", + "multiyear[[\"start_date\", \"end_date\", \"interval_type\", \"value\"]]" + ] + }, + { + "cell_type": "markdown", + "id": "9c30978f", + "metadata": {}, + "source": [ + "Filter the `interval_type` column (values `month`, `calendar_year`,\n", + "`water_year`) to keep only certain intervals — here, the annual rows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7ff90e81", + "metadata": {}, + "outputs": [], + "source": [ + "multiyear[multiyear[\"interval_type\"].isin([\"calendar_year\", \"water_year\"])][\n", + " [\"start_date\", \"end_date\", \"interval_type\", \"value\"]\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "061b9cbe", + "metadata": {}, + "source": [ + "### Monthly mean table\n", + "\n", + "We can reproduce something like a Water Year Summary monthly-mean table. We pull\n", + "the full period of record (no dates), keep the monthly intervals, and aggregate\n", + "by calendar month in water-year order. (Values may differ slightly from the\n", + "official summaries due to rounding.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1c705056", + "metadata": {}, + "outputs": [], + "source": [ + "monthly_raw, _ = waterdata.get_stats_date_range(\n", + " monitoring_location_id=site1,\n", + " parameter_code=\"00060\",\n", + " computation_type=\"arithmetic_mean\",\n", + ")\n", + "m = monthly_raw[monthly_raw[\"interval_type\"] == \"month\"].copy()\n", + "m[\"start_date\"] = pd.to_datetime(m[\"start_date\"])\n", + "m[\"value\"] = pd.to_numeric(m[\"value\"], errors=\"coerce\")\n", + "m = m[(m[\"start_date\"] >= \"2004-10-01\") & (m[\"start_date\"] < \"2025-09-01\")]\n", + "m = m.dropna(subset=[\"value\"])\n", + "m[\"month\"] = m[\"start_date\"].dt.strftime(\"%b\")\n", + "m[\"water_year\"] = (m[\"start_date\"] + pd.DateOffset(months=3)).dt.year\n", + "\n", + "\n", + "def summarize(g):\n", + " hi = g.loc[g[\"value\"].idxmax()]\n", + " lo = g.loc[g[\"value\"].idxmin()]\n", + " return pd.Series(\n", + " {\n", + " \"Mean\": round(g[\"value\"].mean()),\n", + " \"Max (WY)\": f\"{round(hi['value'])} ({int(hi['water_year'])})\",\n", + " \"Min (WY)\": f\"{round(lo['value'])} ({int(lo['water_year'])})\",\n", + " }\n", + " )\n", + "\n", + "\n", + "wy_order = [\n", + " \"Oct\",\n", + " \"Nov\",\n", + " \"Dec\",\n", + " \"Jan\",\n", + " \"Feb\",\n", + " \"Mar\",\n", + " \"Apr\",\n", + " \"May\",\n", + " \"Jun\",\n", + " \"Jul\",\n", + " \"Aug\",\n", + " \"Sep\",\n", + "]\n", + "table = m.groupby(\"month\")[[\"value\", \"water_year\"]].apply(summarize).reindex(wy_order)\n", + "table.T" + ] + }, + { + "cell_type": "markdown", + "id": "31cd8b14", + "metadata": {}, + "source": [ + "## Statistics API tips\n", + "\n", + "The statistics API does **not** follow the OGC standards used by the\n", + "`api.waterdata.usgs.gov/ogcapi/v0/` endpoints. A few things to keep in mind:\n", + "\n", + "- **Higher rate limits.** At the time of writing the statistics API allows ~4000\n", + " requests/hour per IP (per token if a token is supplied).\n", + "- **All columns, always.** There is no `skip_geometry` or `properties` argument —\n", + " the API returns the full column set.\n", + "- **Month-of-year normals.** To get month-of-year statistics from\n", + " `get_stats_por`, make the `start_date`–`end_date` range overlap the first of\n", + " the month (e.g. `01-01`–`03-01` returns the January, February, and March\n", + " month-of-year stats in addition to each day-of-year).\n", + "- **Monthly/annual intervals.** `get_stats_date_range` returns a summary for\n", + " every calendar month, calendar year, and water year that intersects the range.\n", + "- **Median = the 50th percentile.** Requesting both `median` and `percentile`\n", + " duplicates the median; you rarely need both.\n", + "- **Min/max are not percentiles.** Use\n", + " `computation_type=[\"minimum\", \"maximum\", \"percentile\"]` for a complete set of\n", + " order statistics (as we did for the band plot).\n", + "- **Fixed percentiles.** `percentile` only ever returns the 5th, 10th, 25th,\n", + " 50th, 75th, 90th, and 95th. For other percentiles, pull the daily record with\n", + " `get_daily` and compute them yourself.\n", + "- **Watch `sample_count`.** It's the number of observations behind a statistic;\n", + " there is no minimum, so a monthly/annual value can rest on a single daily\n", + " observation.\n", + "\n", + "## More help\n", + "\n", + "- Documentation: \n", + "- Statistics documentation: \n", + "- Equivalent R article: [daily statistics](https://doi-usgs.github.io/dataRetrieval/articles/daily_data_statistics.html)\n", + "- Issues / questions: " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demos/USGS_WaterData_DiscreteSamples_Examples.ipynb b/demos/USGS_WaterData_DiscreteSamples_Examples.ipynb new file mode 100644 index 00000000..a58e6f56 --- /dev/null +++ b/demos/USGS_WaterData_DiscreteSamples_Examples.ipynb @@ -0,0 +1,541 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "438bbb08", + "metadata": {}, + "source": [ + "# Discrete water-quality samples: `get_samples`\n", + "\n", + "As USGS retires the legacy NWIS discrete water-quality services, the new\n", + "*Water Data for the Nation* samples service takes their place. In Python it is\n", + "exposed through three functions in `dataretrieval.waterdata`:\n", + "\n", + "- `get_samples` — retrieve discrete water-quality results (or, with `service=`,\n", + " the matching locations, activities, projects, or organizations).\n", + "- `get_samples_summary` — summarize what data a single site has.\n", + "- `get_codes` — list the allowable values for the categorical query arguments.\n", + "\n", + "We'll cover retrieving data from a known site, using geographic filters, and\n", + "discovering what data are available. The interactive web UI is at\n", + " and the API docs are at\n", + ".\n", + "\n", + "> Column names: unlike the OGC `get_daily` / `get_monitoring_locations`\n", + "> functions, the samples service uses WQX3-style names such as\n", + "> `Location_Latitude`, `Activity_StartDateTime`, and `Result_Measure`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "257b6197", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import pandas as pd\n", + "\n", + "from dataretrieval import waterdata\n", + "from dataretrieval.waterdata import PROFILE_LOOKUP\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams[\"figure.figsize\"] = (7, 4)\n", + "\n", + "\n", + "# Scatter plot of sample-site locations (a static map; use folium for an\n", + "# interactive version).\n", + "def map_sites(df, title=\"\"):\n", + " lon = pd.to_numeric(df[\"Location_Longitude\"], errors=\"coerce\")\n", + " lat = pd.to_numeric(df[\"Location_Latitude\"], errors=\"coerce\")\n", + " ax = plt.subplots(figsize=(7, 5))[1]\n", + " ax.scatter(lon, lat, s=10, color=\"red\", alpha=0.7)\n", + " ax.set_xlabel(\"Longitude\")\n", + " ax.set_ylabel(\"Latitude\")\n", + " ax.set_title(f\"{title} ({len(df)} sites)\")\n", + " plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "3c166c18", + "metadata": {}, + "source": [ + "## Retrieving data from a known site\n", + "\n", + "Given a USGS site, `get_samples_summary` reports what discrete-sample data are\n", + "available there — one row per (characteristic group, characteristic,\n", + "user-supplied characteristic) with result and activity counts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "27e0d33a", + "metadata": {}, + "outputs": [], + "source": [ + "site = \"USGS-04183500\"\n", + "data_at_site, _ = waterdata.get_samples_summary(monitoringLocationIdentifier=site)\n", + "data_at_site.sort_values(\"resultCount\", ascending=False).head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "e388b2d5", + "metadata": {}, + "source": [ + "Note the `characteristicUserSupplied` column: asking for a bare characteristic\n", + "like *Phosphorus* would return both filtered and unfiltered values mixed\n", + "together. `characteristicUserSupplied` is a very specific descriptor (similar to\n", + "a long-form USGS parameter code) that lets you isolate exactly the constituent\n", + "you want. To pull the underlying data, use `get_samples`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "86bfc2b5", + "metadata": {}, + "outputs": [], + "source": [ + "user_char = \"Phosphorus as phosphorus, water, unfiltered\"\n", + "phos_data, _ = waterdata.get_samples(\n", + " monitoringLocationIdentifier=site,\n", + " characteristicUserSupplied=user_char,\n", + ")\n", + "print(f\"default ('fullphyschem') profile -> {phos_data.shape[1]} columns\")" + ] + }, + { + "cell_type": "markdown", + "id": "593529c6", + "metadata": {}, + "source": [ + "The default profile (`fullphyschem`, the \"Full physical chemical\" profile) is\n", + "comprehensive, hence the very wide table. For plotting we usually only need a few\n", + "columns, so ask for the `narrow` profile instead:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "682226d1", + "metadata": {}, + "outputs": [], + "source": [ + "phos_narrow, _ = waterdata.get_samples(\n", + " monitoringLocationIdentifier=site,\n", + " characteristicUserSupplied=user_char,\n", + " profile=\"narrow\",\n", + ")\n", + "print(f\"'narrow' profile -> {phos_narrow.shape[1]} columns\")\n", + "phos_narrow[[\"Activity_StartDateTime\", \"Result_Measure\", \"Result_MeasureUnit\"]].head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "697e0827", + "metadata": {}, + "outputs": [], + "source": [ + "x = pd.to_datetime(phos_narrow[\"Activity_StartDateTime\"], errors=\"coerce\")\n", + "y = pd.to_numeric(phos_narrow[\"Result_Measure\"], errors=\"coerce\")\n", + "fig, ax = plt.subplots(figsize=(7, 4))\n", + "ax.scatter(x, y, s=10)\n", + "ax.set_xlabel(\"Date\")\n", + "ax.set_ylabel(user_char, wrap=True)\n", + "ax.set_title(phos_narrow[\"Location_Name\"].iloc[0])\n", + "fig.autofmt_xdate()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "6573353a", + "metadata": {}, + "source": [ + "## Return data types\n", + "\n", + "Two arguments control what comes back: `service` defines the *kind* of data and\n", + "`profile` defines which columns of that kind are returned. The valid combinations\n", + "are published in `PROFILE_LOOKUP`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "49ceacca", + "metadata": {}, + "outputs": [], + "source": [ + "PROFILE_LOOKUP" + ] + }, + { + "cell_type": "markdown", + "id": "380fb6d1", + "metadata": {}, + "source": [ + "## Geographic filters\n", + "\n", + "Often you don't know a site number but you do have an area of interest. Below we\n", + "keep the queries lightweight by setting `service=\"locations\"` and\n", + "`profile=\"site\"` (so we get *where* data exists, not the result values\n", + "themselves) and filter on our phosphorus characteristic.\n", + "\n", + "### Bounding box\n", + "\n", + "A bounding box is `[west, south, east, north]` (longitudes then latitudes):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d2d582ff", + "metadata": {}, + "outputs": [], + "source": [ + "bbox = [-90.8, 44.2, -89.9, 45.0]\n", + "bbox_sites, _ = waterdata.get_samples(\n", + " boundingBox=bbox,\n", + " characteristicUserSupplied=user_char,\n", + " service=\"locations\",\n", + " profile=\"site\",\n", + ")\n", + "map_sites(bbox_sites, \"Phosphorus sites in bounding box\")" + ] + }, + { + "cell_type": "markdown", + "id": "c05a5786", + "metadata": {}, + "source": [ + "### Hydrologic unit codes (HUCs)\n", + "\n", + "HUCs identify drainage areas; this filter accepts 2-, 4-, 6-, 8-, 10-, or\n", + "12-digit codes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fbbf7898", + "metadata": {}, + "outputs": [], + "source": [ + "huc_sites, _ = waterdata.get_samples(\n", + " hydrologicUnit=\"070700\",\n", + " characteristicUserSupplied=user_char,\n", + " service=\"locations\",\n", + " profile=\"site\",\n", + ")\n", + "map_sites(huc_sites, \"Phosphorus sites in HUC 070700\")" + ] + }, + { + "cell_type": "markdown", + "id": "151d88ba", + "metadata": {}, + "source": [ + "### Distance from a point\n", + "\n", + "Supply a latitude, longitude, and radius in miles:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9711e26c", + "metadata": {}, + "outputs": [], + "source": [ + "point_sites, _ = waterdata.get_samples(\n", + " pointLocationLatitude=43.074680,\n", + " pointLocationLongitude=-89.428054,\n", + " pointLocationWithinMiles=20,\n", + " characteristicUserSupplied=user_char,\n", + " service=\"locations\",\n", + " profile=\"site\",\n", + ")\n", + "map_sites(point_sites, \"Phosphorus sites within 20 mi of Madison, WI\")" + ] + }, + { + "cell_type": "markdown", + "id": "ec22beac", + "metadata": {}, + "source": [ + "### County FIPS\n", + "\n", + "County FIPS codes take the form `US:SS:CCC`. Wisconsin's state code is available\n", + "from `dataretrieval.codes`, and Dane County's full FIPS is `US:55:025`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f07b210b", + "metadata": {}, + "outputs": [], + "source": [ + "from dataretrieval.codes import states\n", + "\n", + "wi = states.fips_codes[\"Wisconsin\"] # \"55\"\n", + "dane_county = f\"US:{wi}:025\"\n", + "county_sites, _ = waterdata.get_samples(\n", + " countyFips=dane_county,\n", + " characteristicUserSupplied=user_char,\n", + " service=\"locations\",\n", + " profile=\"site\",\n", + ")\n", + "map_sites(county_sites, \"Phosphorus sites in Dane County, WI\")" + ] + }, + { + "cell_type": "markdown", + "id": "8e43b993", + "metadata": {}, + "source": [ + "### State FIPS\n", + "\n", + "State FIPS codes take the form `US:SS`. A whole-state query can return a lot of\n", + "sites, so here we also constrain the activity start date to October–November 2024\n", + "(see *Additional query parameters* below):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "83519737", + "metadata": {}, + "outputs": [], + "source": [ + "state_fip = f\"US:{wi}\" # \"US:55\"\n", + "state_sites_recent, _ = waterdata.get_samples(\n", + " stateFips=state_fip,\n", + " characteristicUserSupplied=user_char,\n", + " service=\"locations\",\n", + " activityStartDateLower=\"2024-10-01\",\n", + " activityStartDateUpper=\"2024-11-30\",\n", + " profile=\"site\",\n", + ")\n", + "map_sites(state_sites_recent, \"WI phosphorus sites, Oct-Nov 2024\")" + ] + }, + { + "cell_type": "markdown", + "id": "0aab190b", + "metadata": {}, + "source": [ + "## Additional query parameters\n", + "\n", + "Several parameters narrow the results further. The allowable values for the\n", + "categorical ones come from `get_codes`. Note that `get_codes` returns a plain\n", + "`DataFrame` (no metadata tuple).\n", + "\n", + "### `siteTypeCode` / `siteTypeName`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f21e23e7", + "metadata": {}, + "outputs": [], + "source": [ + "site_type_info = waterdata.get_codes(code_service=\"sitetype\")\n", + "site_type_info[[\"typeCode\", \"typeLongName\"]].head(10)" + ] + }, + { + "cell_type": "markdown", + "id": "fcdf0025", + "metadata": {}, + "source": [ + "### `activityMediaName`\n", + "\n", + "The environmental medium that was sampled or analyzed:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "64369260", + "metadata": {}, + "outputs": [], + "source": [ + "waterdata.get_codes(code_service=\"samplemedia\")[\"activityMedia\"].tolist()" + ] + }, + { + "cell_type": "markdown", + "id": "647de77a", + "metadata": {}, + "source": [ + "### `characteristicGroup`\n", + "\n", + "A broad category describing the measurement (generally following the Water\n", + "Quality Portal groups):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1b139a9", + "metadata": {}, + "outputs": [], + "source": [ + "waterdata.get_codes(code_service=\"characteristicgroup\")[\"characteristicGroup\"].tolist()" + ] + }, + { + "cell_type": "markdown", + "id": "cfa4bfbf", + "metadata": {}, + "source": [ + "### `characteristic` and `usgsPCode`\n", + "\n", + "The `characteristics` table lists specific constituents along with their USGS\n", + "parameter codes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "72c32873", + "metadata": {}, + "outputs": [], + "source": [ + "characteristic_info = waterdata.get_codes(code_service=\"characteristics\")\n", + "print(\"unique characteristic names:\")\n", + "print(characteristic_info[\"characteristicName\"].drop_duplicates().head().tolist())\n", + "print(\"\\nexample USGS parameter codes:\")\n", + "print(characteristic_info[\"parameterCode\"].dropna().drop_duplicates().head().tolist())" + ] + }, + { + "cell_type": "markdown", + "id": "c0872a69", + "metadata": {}, + "source": [ + "### `characteristicUserSupplied`\n", + "\n", + "The USGS \"observed property\" — the detailed descriptor that replaces the old\n", + "parameter name / pcode for discrete data, and the value we filtered on above:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "236c0f76", + "metadata": {}, + "outputs": [], + "source": [ + "waterdata.get_codes(code_service=\"observedproperty\")[\"observedProperty\"].head().tolist()" + ] + }, + { + "cell_type": "markdown", + "id": "3caa694d", + "metadata": {}, + "source": [ + "Other filters worth knowing about: `projectIdentifier` (needs prior project\n", + "info), `recordIdentifierUserSupplied` (needs the supplier's record id), and\n", + "`activityStartDateLower` / `activityStartDateUpper` for date ranges (used above)." + ] + }, + { + "cell_type": "markdown", + "id": "6dfb0d2f", + "metadata": {}, + "source": [ + "## Data discovery\n", + "\n", + "Combining a geographic filter with site-type and characteristic filters lets you\n", + "zero in on candidate sites. For example, lakes in Dane County, WI that measured\n", + "our phosphorus characteristic:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8af3af88", + "metadata": {}, + "outputs": [], + "source": [ + "county_lake_sites, _ = waterdata.get_samples(\n", + " countyFips=dane_county,\n", + " characteristicUserSupplied=user_char,\n", + " siteTypeName=\"Lake, Reservoir, Impoundment\",\n", + " service=\"locations\",\n", + " profile=\"site\",\n", + ")\n", + "print(f\"{len(county_lake_sites)} lake sites measuring phosphorus in Dane County, WI\")" + ] + }, + { + "cell_type": "markdown", + "id": "87f31bda", + "metadata": {}, + "source": [ + "`get_samples_summary` accepts one site at a time, so we loop over the candidate\n", + "sites to tally how much phosphorus data each has — useful for deciding which\n", + "sites to actually pull results from." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "421b6982", + "metadata": {}, + "outputs": [], + "source": [ + "rows = []\n", + "for loc_id in county_lake_sites[\"Location_Identifier\"]:\n", + " avail, _ = waterdata.get_samples_summary(monitoringLocationIdentifier=loc_id)\n", + " rows.append(avail[avail[\"characteristicUserSupplied\"] == user_char])\n", + "\n", + "all_data = pd.concat(rows, ignore_index=True)\n", + "all_data.sort_values(\"resultCount\", ascending=False)[\n", + " [\n", + " \"monitoringLocationIdentifier\",\n", + " \"resultCount\",\n", + " \"activityCount\",\n", + " \"firstActivity\",\n", + " \"mostRecentActivity\",\n", + " ]\n", + "]" + ] + }, + { + "cell_type": "markdown", + "id": "d2614654", + "metadata": {}, + "source": [ + "This summary helps narrow down which sites to request data from — whether you\n", + "need sites with recent data, lots of data, or just any measurement at all.\n", + "\n", + "## More help\n", + "\n", + "- Documentation: \n", + "- Samples API docs: \n", + "- Equivalent R article: [Introducing read_waterdata_samples](https://doi-usgs.github.io/dataRetrieval/articles/samples_data.html)\n", + "- Issues / questions: " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demos/USGS_WaterData_Introduction_Examples.ipynb b/demos/USGS_WaterData_Introduction_Examples.ipynb new file mode 100644 index 00000000..3ca30420 --- /dev/null +++ b/demos/USGS_WaterData_Introduction_Examples.ipynb @@ -0,0 +1,660 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "a8818c89", + "metadata": {}, + "source": [ + "# Introduction to the USGS Water Data APIs\n", + "\n", + "The [USGS Water Data APIs](https://api.waterdata.usgs.gov/ogcapi/v0/) are the\n", + "modern, OGC-based replacement for the legacy NWIS web services. In Python they are\n", + "exposed through the `dataretrieval.waterdata` module, which will gradually replace\n", + "the older `dataretrieval.nwis` functions.\n", + "\n", + "This notebook tours each new function. The NWIS shut-down timeline is still\n", + "uncertain, so we recommend migrating to the `waterdata` functions sooner rather\n", + "than later.\n", + "\n", + "If you are coming from the R `dataRetrieval` package, the functions map across as\n", + "follows:\n", + "\n", + "| R `dataRetrieval` | Python `dataretrieval.waterdata` |\n", + "| --- | --- |\n", + "| `read_waterdata_monitoring_location` | `get_monitoring_locations` |\n", + "| `read_waterdata_ts_meta` / `read_waterdata_combined_meta` | `get_time_series_metadata` / `get_combined_metadata` |\n", + "| `read_waterdata_parameter_codes` | `get_reference_table(collection=\"parameter-codes\")` |\n", + "| `read_waterdata_daily` | `get_daily` |\n", + "| `read_waterdata_continuous` | `get_continuous` |\n", + "| `read_waterdata_field_measurements` | `get_field_measurements` |\n", + "| `read_waterdata_channel` | `get_channel` |\n", + "| `read_waterdata_latest_continuous` / `read_waterdata_latest_daily` | `get_latest_continuous` / `get_latest_daily` |\n", + "| `read_waterdata` (CQL) | the `filter` / `filter_lang` arguments on any function |\n", + "| `read_waterdata_metadata` | `get_reference_table` |\n", + "| `read_waterdata_samples` | `get_samples` |\n", + "| `read_waterdata_stats_por` / `read_waterdata_stats_daterange` | `get_stats_por` / `get_stats_date_range` |" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "03b51493", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import pandas as pd\n", + "\n", + "from dataretrieval import waterdata\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams[\"figure.figsize\"] = (7, 4)" + ] + }, + { + "cell_type": "markdown", + "id": "27cea444", + "metadata": {}, + "source": [ + "> **Return values.** Every `dataretrieval.waterdata` function returns a\n", + "> `(data, metadata)` tuple. The first element is a `pandas.DataFrame` (or a\n", + "> `geopandas.GeoDataFrame` when the service returns a geometry column); the\n", + "> second is a small metadata object describing the request. Throughout this\n", + "> notebook we unpack the tuple as `df, md = waterdata.get_...(...)`." + ] + }, + { + "cell_type": "markdown", + "id": "1e38880f", + "metadata": {}, + "source": [ + "## New features\n", + "\n", + "The new API endpoints each deliver a different type of USGS water data, and they\n", + "all share features the legacy services lacked.\n", + "\n", + "### Flexible queries\n", + "\n", + "The new functions expose **all** of the query parameters the API supports, each\n", + "defaulting to `None`. You do **not** need to (and usually should not) specify\n", + "them all. Filters are combined with a Boolean *AND*: passing both a list of\n", + "monitoring locations and a list of parameter codes returns only the\n", + "combinations of the two. Because every argument is named, your IDE can\n", + "autocomplete the options.\n", + "\n", + "### Flexible columns returned\n", + "\n", + "Use the `properties` argument to choose which columns come back. The full set of\n", + "available properties for a collection is published in that collection's schema,\n", + "e.g. ." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d59a461b", + "metadata": {}, + "outputs": [], + "source": [ + "# Ask for just a few columns instead of the full ~40-column record.\n", + "site_info, _ = waterdata.get_monitoring_locations(\n", + " monitoring_location_id=\"USGS-01491000\",\n", + " properties=[\n", + " \"monitoring_location_id\",\n", + " \"site_type\",\n", + " \"drainage_area\",\n", + " \"monitoring_location_name\",\n", + " ],\n", + ")\n", + "site_info.drop(columns=\"geometry\")" + ] + }, + { + "cell_type": "markdown", + "id": "54ace335", + "metadata": {}, + "source": [ + "### API tokens\n", + "\n", + "USGS now rate-limits requests per IP address per hour. If you hit the limit you\n", + "can request a free API token at . Keep it\n", + "out of shared scripts and version control. (At the time of writing the Python\n", + "`dataretrieval` package does not yet wire a token into these calls; the rate\n", + "limits are generous for the queries below.)\n", + "\n", + "### Contextual Query Language (CQL2)\n", + "\n", + "The APIs accept [CQL2](https://www.loc.gov/standards/sru/cql/) expressions for\n", + "complex queries through the `filter` / `filter_lang` arguments. See the\n", + "[General retrieval and CQL2](#General-retrieval-and-CQL2) section below.\n", + "\n", + "### Simple features\n", + "\n", + "Spatial collections return a `geometry` column, so `get_*` calls give you a\n", + "`geopandas.GeoDataFrame` that drops straight into geospatial workflows. Pass `skip_geometry=True` to get a plain `DataFrame`." + ] + }, + { + "cell_type": "markdown", + "id": "20c5dd03", + "metadata": {}, + "source": [ + "## Lessons learned\n", + "\n", + "### Request many sites in one call\n", + "\n", + "`dataretrieval` automatically splits a large request — many monitoring\n", + "locations, several parameter codes, or a complex filter — into URL-sized\n", + "sub-requests and recombines the results, and it can resume a long pull that hits\n", + "a rate limit or transient server error without refetching completed work. So\n", + "pass all your sites in one call rather than looping over them.\n", + "\n", + "The main exception is **continuous** data, which is capped at 3 years per\n", + "request. See the *Continuous Data* notebook for large continuous pulls." + ] + }, + { + "cell_type": "markdown", + "id": "e49f3ad0", + "metadata": {}, + "source": [ + "## New functions\n", + "\n", + "### Monitoring location\n", + "\n", + "`get_monitoring_locations` returns site metadata. To browse the service in a\n", + "web browser, visit\n", + ".\n", + "\n", + "A simple request for one known USGS site:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d42fc61a", + "metadata": {}, + "outputs": [], + "source": [ + "sites_information, _ = waterdata.get_monitoring_locations(\n", + " monitoring_location_id=\"USGS-01491000\"\n", + ")\n", + "print(f\"{sites_information.shape[1]} columns returned\")\n", + "sites_information.drop(columns=\"geometry\").T" + ] + }, + { + "cell_type": "markdown", + "id": "29d1be4d", + "metadata": {}, + "source": [ + "Any returned column can also be used as an input filter. For example, to find\n", + "every stream site in Wisconsin:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf090884", + "metadata": {}, + "outputs": [], + "source": [ + "sites_wi, _ = waterdata.get_monitoring_locations(\n", + " state_name=\"Wisconsin\",\n", + " site_type=\"Stream\",\n", + ")\n", + "print(f\"{len(sites_wi)} Wisconsin stream sites\")\n", + "sites_wi[[\"monitoring_location_id\", \"monitoring_location_name\", \"geometry\"]].head()" + ] + }, + { + "cell_type": "markdown", + "id": "c4096bf3", + "metadata": {}, + "source": [ + "Because the result is a `GeoDataFrame`, plotting the locations is a one-liner.\n", + "For an interactive map, `folium` works well with the same data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ce4c88a7", + "metadata": {}, + "outputs": [], + "source": [ + "ax = sites_wi.plot(markersize=4, figsize=(7, 5))\n", + "ax.set_title(\"USGS stream monitoring locations in Wisconsin\")\n", + "ax.set_xlabel(\"Longitude\")\n", + "ax.set_ylabel(\"Latitude\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "1e162fcf", + "metadata": {}, + "source": [ + "### Time series & combined metadata\n", + "\n", + "`get_combined_metadata` merges time-series metadata\n", + "(`get_time_series_metadata`) and field-measurement metadata by site, telling you\n", + "which time series a site offers and the span of each." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a593b5e8", + "metadata": {}, + "outputs": [], + "source": [ + "ts_available, _ = waterdata.get_combined_metadata(\n", + " monitoring_location_id=\"USGS-01491000\",\n", + " parameter_code=[\"00060\", \"00010\"],\n", + ")\n", + "cols = [\"parameter_name\", \"statistic_id\", \"begin\", \"end\", \"last_modified\"]\n", + "ts_available[cols]" + ] + }, + { + "cell_type": "markdown", + "id": "c294fee3", + "metadata": {}, + "source": [ + "### Parameter codes\n", + "\n", + "Parameter-code descriptions come from the `parameter-codes` reference table:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cc1601c0", + "metadata": {}, + "outputs": [], + "source": [ + "pcode_info, _ = waterdata.get_reference_table(\n", + " collection=\"parameter-codes\",\n", + " query={\"id\": \"00660\"},\n", + ")\n", + "pcode_info.T" + ] + }, + { + "cell_type": "markdown", + "id": "330064f2", + "metadata": {}, + "source": [ + "### Daily values\n", + "\n", + "`get_daily` returns daily values. Browse it at\n", + "." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d1fef3df", + "metadata": {}, + "outputs": [], + "source": [ + "daily_modern, _ = waterdata.get_daily(\n", + " monitoring_location_id=\"USGS-01491000\",\n", + " parameter_code=[\"00060\", \"00010\"],\n", + " statistic_id=\"00003\",\n", + " time=[\"2023-10-01\", \"2024-09-30\"],\n", + ")\n", + "daily_modern[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()" + ] + }, + { + "cell_type": "markdown", + "id": "4817c8c9", + "metadata": {}, + "source": [ + "Notice the data come back in **long** format — one observation per row. Long\n", + "data are usually easier to work with; here we facet by `parameter_code`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f0578529", + "metadata": {}, + "outputs": [], + "source": [ + "params = sorted(daily_modern[\"parameter_code\"].unique())\n", + "fig, axes = plt.subplots(len(params), 1, figsize=(7, 5), sharex=True, squeeze=False)\n", + "axes = axes[:, 0] # squeeze=False -> always a 2-D array, even for one param\n", + "for ax, pcode in zip(axes, params):\n", + " sub = daily_modern[daily_modern[\"parameter_code\"] == pcode]\n", + " ax.scatter(sub[\"time\"], sub[\"value\"], s=4)\n", + " ax.set_ylabel(pcode)\n", + "axes[0].set_title(\"Daily values at USGS-01491000 (water year 2024)\")\n", + "axes[-1].set_xlabel(\"time\")\n", + "fig.autofmt_xdate()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "8af565c7", + "metadata": {}, + "source": [ + "### Continuous\n", + "\n", + "`get_continuous` returns instantaneous (sensor) values. Browse it at\n", + ".\n", + "\n", + "This service currently allows at most **3 years** of data per request; with no\n", + "`time` argument it returns the latest year. Continuous data have no geometry\n", + "column and do not support bounding-box queries. For large pulls, see the\n", + "*Continuous Data* notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2dbcdd47", + "metadata": {}, + "outputs": [], + "source": [ + "sensor_data, _ = waterdata.get_continuous(\n", + " monitoring_location_id=\"USGS-01491000\",\n", + " parameter_code=\"00060\",\n", + " time=\"2024-09-01/2024-09-03\",\n", + ")\n", + "sensor_data[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()" + ] + }, + { + "cell_type": "markdown", + "id": "6b4fa772", + "metadata": {}, + "source": [ + "### Field measurements\n", + "\n", + "`get_field_measurements` returns discrete field measurements, including\n", + "groundwater levels." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12d4649a", + "metadata": {}, + "outputs": [], + "source": [ + "field_modern, _ = waterdata.get_field_measurements(\n", + " monitoring_location_id=[\n", + " \"USGS-451605097071701\",\n", + " \"USGS-263819081585801\",\n", + " ],\n", + " time=[\"2023-10-01\", \"2024-09-30\"],\n", + ")\n", + "field_modern[[\"time\", \"monitoring_location_id\", \"parameter_code\", \"value\"]].head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6a6f9ba4", + "metadata": {}, + "outputs": [], + "source": [ + "fig, ax = plt.subplots(figsize=(7, 4))\n", + "for site, sub in field_modern.groupby(\"monitoring_location_id\"):\n", + " ax.scatter(sub[\"time\"], sub[\"value\"], s=12, label=site)\n", + "ax.set_ylabel(\"value\")\n", + "ax.set_title(\"Field measurements\")\n", + "ax.legend(fontsize=7)\n", + "fig.autofmt_xdate()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "5a709774", + "metadata": {}, + "source": [ + "### Channel measurements\n", + "\n", + "`get_channel` returns channel-geometry measurements that accompany\n", + "`get_field_measurements`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb3105ff", + "metadata": {}, + "outputs": [], + "source": [ + "channel, _ = waterdata.get_channel(monitoring_location_id=\"USGS-02238500\")\n", + "channel[[\"time\", \"channel_width\", \"channel_area\", \"channel_velocity\"]].head()" + ] + }, + { + "cell_type": "markdown", + "id": "0495e7ac", + "metadata": {}, + "source": [ + "### Latest continuous & latest daily\n", + "\n", + "`get_latest_continuous` and `get_latest_daily` have no NWIS equivalent — they\n", + "return the single most recent observation for each time series." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d82d74ba", + "metadata": {}, + "outputs": [], + "source": [ + "latest_uv, _ = waterdata.get_latest_continuous(\n", + " monitoring_location_id=\"USGS-01491000\",\n", + " parameter_code=\"00060\",\n", + ")\n", + "cols = [\"time\", \"value\", \"approval_status\", \"parameter_code\", \"unit_of_measure\"]\n", + "latest_uv[cols].T" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a624271d", + "metadata": {}, + "outputs": [], + "source": [ + "latest_dv, _ = waterdata.get_latest_daily(\n", + " monitoring_location_id=\"USGS-01491000\",\n", + " parameter_code=\"00060\",\n", + ")\n", + "latest_dv[cols].T" + ] + }, + { + "cell_type": "markdown", + "id": "65390398", + "metadata": {}, + "source": [ + "### General retrieval and CQL2\n", + "\n", + "The OGC `get_*` functions accept a CQL2 expression through the `filter` /\n", + "`filter_lang` arguments, so even complex queries run against these same\n", + "functions — there is no separate \"general retrieval\" call.\n", + "\n", + "CQL2 supports a wildcard via `LIKE` (`%` matches any trailing characters). This\n", + "is handy for hydrologic unit codes, which may be stored as `02070010` or as a\n", + "longer code beginning with those digits. To get every site whose HUC starts with\n", + "`02070010`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "455de9d3", + "metadata": {}, + "outputs": [], + "source": [ + "what_huc_sites, _ = waterdata.get_monitoring_locations(\n", + " filter=\"hydrologic_unit_code LIKE '02070010%'\",\n", + " filter_lang=\"cql-text\",\n", + ")\n", + "print(f\"{len(what_huc_sites)} sites in HUC 02070010\")\n", + "ax = what_huc_sites.plot(markersize=2, figsize=(7, 5))\n", + "ax.set_title(\"Sites within HUC 02070010\")\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "3fb5d920", + "metadata": {}, + "source": [ + "> **Numeric filters.** Every queryable on the Water Data API is typed as a\n", + "> *string*, so an unquoted numeric comparison like `drainage_area > 1000` is\n", + "> rejected by the server (and quoting it gives a misleading lexicographic\n", + "> comparison). `dataretrieval` catches this and raises a `ValueError`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "82f8f1b5", + "metadata": {}, + "outputs": [], + "source": [ + "try:\n", + " waterdata.get_monitoring_locations(\n", + " filter=\"drainage_area > 1000\",\n", + " filter_lang=\"cql-text\",\n", + " )\n", + "except ValueError as e:\n", + " print(type(e).__name__, \"->\", str(e)[:120], \"...\")" + ] + }, + { + "cell_type": "markdown", + "id": "dd8ae008", + "metadata": {}, + "source": [ + "The recommended pattern is to filter on the string-valued attributes the server\n", + "understands (state, site type, HUC, …) and then do the **numeric** reduction in\n", + "pandas. For example, large-drainage stream sites in Wisconsin and Minnesota:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a13e984e", + "metadata": {}, + "outputs": [], + "source": [ + "sites, _ = waterdata.get_monitoring_locations(\n", + " state_name=[\"Wisconsin\", \"Minnesota\"],\n", + " site_type=\"Stream\",\n", + " properties=[\n", + " \"monitoring_location_id\",\n", + " \"monitoring_location_name\",\n", + " \"state_name\",\n", + " \"drainage_area\",\n", + " ],\n", + ")\n", + "big = sites[pd.to_numeric(sites[\"drainage_area\"], errors=\"coerce\") > 1000]\n", + "print(f\"{len(big)} of {len(sites)} WI/MN stream sites drain > 1000 sq mi\")\n", + "big.drop(columns=\"geometry\").head()" + ] + }, + { + "cell_type": "markdown", + "id": "10f0a74d", + "metadata": {}, + "source": [ + "### Reference tables\n", + "\n", + "`get_reference_table` exposes a variety of metadata tables. Any returned column\n", + "can be filtered on. See the\n", + "*USGS Reference Lists* notebook for the full list of collections.\n", + "\n", + "### Discrete samples\n", + "\n", + "Discrete USGS water-quality data are served from a separate (non-OGC) endpoint\n", + "via `get_samples`. See the *Discrete water-quality samples* notebook.\n", + "\n", + "### Daily data statistics\n", + "\n", + "Pre-computed temporal summary statistics are available through `get_stats_por`\n", + "(day-of-year / month-of-year) and `get_stats_date_range` (calendar month, calendar\n", + "year, water year). See the *Daily statistics* notebook." + ] + }, + { + "cell_type": "markdown", + "id": "21d144d3", + "metadata": {}, + "source": [ + "## More notes\n", + "\n", + "### `limit` and paging\n", + "\n", + "The `limit` argument sets how many rows come back **per page**, not the overall\n", + "total — by default `dataretrieval` pages through everything. You rarely need to\n", + "touch it; lowering it can help on a spotty connection.\n", + "\n", + "### The `id` column\n", + "\n", + "Each endpoint natively returns an `id` column, and that value is used as an input\n", + "to *other* endpoints under a different name (the monitoring-locations `id` is the\n", + "`monitoring_location_id` everywhere else). `dataretrieval` renames `id`\n", + "accordingly, but you can request the raw `id` column via `properties`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa2f8528", + "metadata": {}, + "outputs": [], + "source": [ + "site = \"USGS-02238500\"\n", + "site_1, _ = waterdata.get_monitoring_locations(\n", + " monitoring_location_id=site,\n", + " properties=[\"monitoring_location_id\", \"state_name\", \"country_name\"],\n", + ")\n", + "site_2, _ = waterdata.get_monitoring_locations(\n", + " monitoring_location_id=site,\n", + " properties=[\"id\", \"state_name\", \"country_name\"],\n", + ")\n", + "print(\"renamed:\", [c for c in site_1.columns if c != \"geometry\"])\n", + "print(\"raw id :\", [c for c in site_2.columns if c != \"geometry\"])" + ] + }, + { + "cell_type": "markdown", + "id": "7dcc03a9", + "metadata": {}, + "source": [ + "## More help\n", + "\n", + "- Documentation: \n", + "- R package docs (source of these examples): \n", + "- Issues / questions: " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/demos/USGS_WaterData_ReferenceLists_Examples.ipynb b/demos/USGS_WaterData_ReferenceLists_Examples.ipynb new file mode 100644 index 00000000..9799ba16 --- /dev/null +++ b/demos/USGS_WaterData_ReferenceLists_Examples.ipynb @@ -0,0 +1,138 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "aef324b2", + "metadata": {}, + "source": [ + "# USGS Reference Lists\n", + "\n", + "`get_reference_table` returns the metadata \"reference\" tables for the USGS Water\n", + "Data API. These tables enumerate the allowable values for the filter arguments\n", + "used elsewhere in the `waterdata` module — for example, the `site-types` table\n", + "lists every valid `site_type_code`, and `parameter-codes` lists every valid\n", + "`parameter_code`.\n", + "\n", + "`get_reference_table` returns a `(data, metadata)` tuple." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f365047", + "metadata": {}, + "outputs": [], + "source": [ + "from typing import get_args\n", + "\n", + "from IPython.display import Markdown, display\n", + "\n", + "from dataretrieval import waterdata\n", + "from dataretrieval.waterdata.types import METADATA_COLLECTIONS\n", + "\n", + "collections = list(get_args(METADATA_COLLECTIONS))\n", + "collections" + ] + }, + { + "cell_type": "markdown", + "id": "af9731ff", + "metadata": {}, + "source": [ + "## A single reference table\n", + "\n", + "Fetch one table by name. The first column is the table's primary code (the\n", + "collection name, singularized, with hyphens turned into underscores — e.g.\n", + "`site-types` -> `site_type`):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9840b289", + "metadata": {}, + "outputs": [], + "source": [ + "site_types, _ = waterdata.get_reference_table(collection=\"site-types\")\n", + "print(f\"{len(site_types)} rows\")\n", + "site_types.head()" + ] + }, + { + "cell_type": "markdown", + "id": "ec3a00c8", + "metadata": {}, + "source": [ + "You can also pass a `query` to retrieve a subset — for instance specific\n", + "parameter codes by `id`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "09b5de2d", + "metadata": {}, + "outputs": [], + "source": [ + "some_pcodes, _ = waterdata.get_reference_table(\n", + " collection=\"parameter-codes\",\n", + " query={\"id\": \"00060,00065,00010\"},\n", + ")\n", + "some_pcodes[[\"parameter_code\", \"parameter_name\", \"unit_of_measure\"]]" + ] + }, + { + "cell_type": "markdown", + "id": "b02202cf", + "metadata": {}, + "source": [ + "## All reference tables\n", + "\n", + "The full set of collections is enumerated in `METADATA_COLLECTIONS`. Below we\n", + "preview the first few rows of each. (Most are small lookup tables; a couple —\n", + "notably `parameter-codes` and `hydrologic-unit-codes` — are large, so we only\n", + "display the head.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "514392c0", + "metadata": {}, + "outputs": [], + "source": [ + "for collection in collections:\n", + " df, _ = waterdata.get_reference_table(collection=collection)\n", + " preview = df.drop(columns=\"geometry\") if \"geometry\" in df.columns else df\n", + " display(Markdown(f\"### `{collection}` \\n{len(df):,} rows\"))\n", + " display(preview.head(3))" + ] + }, + { + "cell_type": "markdown", + "id": "9d820806", + "metadata": {}, + "source": [ + "## More help\n", + "\n", + "- Documentation: \n", + "- See the *Introduction to the USGS Water Data APIs* notebook for how these reference\n", + " values feed the `get_*` filter arguments.\n", + "- Equivalent R article: [USGS Reference Lists](https://doi-usgs.github.io/dataRetrieval/articles/Reference_Lists.html)\n", + "- Issues / questions: " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/docs/source/examples/USGS_WaterData_ContinuousData_Examples.nblink b/docs/source/examples/USGS_WaterData_ContinuousData_Examples.nblink new file mode 100644 index 00000000..b169abdf --- /dev/null +++ b/docs/source/examples/USGS_WaterData_ContinuousData_Examples.nblink @@ -0,0 +1,3 @@ +{ + "path": "../../../demos/USGS_WaterData_ContinuousData_Examples.ipynb" +} diff --git a/docs/source/examples/USGS_WaterData_DailyStatistics_Examples.nblink b/docs/source/examples/USGS_WaterData_DailyStatistics_Examples.nblink new file mode 100644 index 00000000..c12f7840 --- /dev/null +++ b/docs/source/examples/USGS_WaterData_DailyStatistics_Examples.nblink @@ -0,0 +1,3 @@ +{ + "path": "../../../demos/USGS_WaterData_DailyStatistics_Examples.ipynb" +} diff --git a/docs/source/examples/USGS_WaterData_DiscreteSamples_Examples.nblink b/docs/source/examples/USGS_WaterData_DiscreteSamples_Examples.nblink new file mode 100644 index 00000000..4729fe36 --- /dev/null +++ b/docs/source/examples/USGS_WaterData_DiscreteSamples_Examples.nblink @@ -0,0 +1,3 @@ +{ + "path": "../../../demos/USGS_WaterData_DiscreteSamples_Examples.ipynb" +} diff --git a/docs/source/examples/USGS_WaterData_Introduction_Examples.nblink b/docs/source/examples/USGS_WaterData_Introduction_Examples.nblink new file mode 100644 index 00000000..9a442fe4 --- /dev/null +++ b/docs/source/examples/USGS_WaterData_Introduction_Examples.nblink @@ -0,0 +1,3 @@ +{ + "path": "../../../demos/USGS_WaterData_Introduction_Examples.ipynb" +} diff --git a/docs/source/examples/USGS_WaterData_ReferenceLists_Examples.nblink b/docs/source/examples/USGS_WaterData_ReferenceLists_Examples.nblink new file mode 100644 index 00000000..0600ecac --- /dev/null +++ b/docs/source/examples/USGS_WaterData_ReferenceLists_Examples.nblink @@ -0,0 +1,3 @@ +{ + "path": "../../../demos/USGS_WaterData_ReferenceLists_Examples.ipynb" +} diff --git a/docs/source/examples/index.rst b/docs/source/examples/index.rst index 6011fc4b..de6f1b25 100644 --- a/docs/source/examples/index.rst +++ b/docs/source/examples/index.rst @@ -15,6 +15,23 @@ covers a basic introduction to module functions and usage. WaterData_demo +USGS Water Data API vignettes +----------------------------- +These notebooks are Python ports of the new USGS Water Data API vignettes from +the R `dataRetrieval`_ package. Each introduces a family of ``waterdata`` +functions and is executed against the live USGS Water Data API. + +.. _dataRetrieval: https://doi-usgs.github.io/dataRetrieval/ + +.. toctree:: + :maxdepth: 1 + + USGS_WaterData_Introduction_Examples + USGS_WaterData_DiscreteSamples_Examples + USGS_WaterData_DailyStatistics_Examples + USGS_WaterData_ContinuousData_Examples + USGS_WaterData_ReferenceLists_Examples + Simple uses of the ``dataretrieval`` package -------------------------------------------- From 95c44d523f3eee16c5a7ce2a05f518b9bbfcf22b Mon Sep 17 00:00:00 2001 From: thodson-usgs Date: Thu, 28 May 2026 10:28:50 -0400 Subject: [PATCH 2/2] docs: clean up Water Data API vignettes MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Rename `daily_modern`/`field_modern` → `daily_data`/`field_data` (the "_modern" suffix was a leftover NWIS-vs-modern comparison artifact) - Rename `what_huc_sites` → `huc_sites`, `sites_information` → `sites_info`, `site_1`/`site_2` → `renamed`/`raw_id`, and `site1` → `site` in the daily statistics notebook (no `site2` ever existed) - Fix broken intra-notebook anchor `#General-retrieval-and-CQL2` → `#general-retrieval-and-cql2` - Simplify the daily-values facet plot by dropping the unnecessary `squeeze=False` + axes-indexing workaround - Clean up the `map_sites` helper in the samples notebook to use the conventional `fig, ax = plt.subplots(...)` unpacking and a docstring Co-Authored-By: Claude Opus 4.7 (1M context) --- ...S_WaterData_DailyStatistics_Examples.ipynb | 14 ++--- ...S_WaterData_DiscreteSamples_Examples.ipynb | 18 ++++-- ...USGS_WaterData_Introduction_Examples.ipynb | 56 +++++++++++-------- 3 files changed, 52 insertions(+), 36 deletions(-) diff --git a/demos/USGS_WaterData_DailyStatistics_Examples.ipynb b/demos/USGS_WaterData_DailyStatistics_Examples.ipynb index f35f52c9..ffe9647d 100644 --- a/demos/USGS_WaterData_DailyStatistics_Examples.ipynb +++ b/demos/USGS_WaterData_DailyStatistics_Examples.ipynb @@ -41,7 +41,7 @@ "\n", "%matplotlib inline\n", "\n", - "site1 = \"USGS-02037500\"" + "site = \"USGS-02037500\"" ] }, { @@ -66,7 +66,7 @@ "outputs": [], "source": [ "jan_por_mean, _ = waterdata.get_stats_por(\n", - " monitoring_location_id=site1,\n", + " monitoring_location_id=site,\n", " parameter_code=\"00060\",\n", " computation_type=\"arithmetic_mean\",\n", " start_date=\"01-01\",\n", @@ -126,7 +126,7 @@ "outputs": [], "source": [ "full_por_percentiles, _ = waterdata.get_stats_por(\n", - " monitoring_location_id=site1,\n", + " monitoring_location_id=site,\n", " parameter_code=\"00060\",\n", " computation_type=[\"minimum\", \"maximum\", \"percentile\"],\n", " start_date=\"01-01\",\n", @@ -210,7 +210,7 @@ "outputs": [], "source": [ "daily, _ = waterdata.get_daily(\n", - " monitoring_location_id=site1,\n", + " monitoring_location_id=site,\n", " parameter_code=\"00060\",\n", " statistic_id=\"00003\",\n", " time=[\"2024-01-01\", \"2025-12-31\"],\n", @@ -259,7 +259,7 @@ "outputs": [], "source": [ "jan_daterange_mean, _ = waterdata.get_stats_date_range(\n", - " monitoring_location_id=site1,\n", + " monitoring_location_id=site,\n", " parameter_code=\"00060\",\n", " computation_type=\"arithmetic_mean\",\n", " start_date=\"2024-01-01\",\n", @@ -287,7 +287,7 @@ "outputs": [], "source": [ "multiyear, _ = waterdata.get_stats_date_range(\n", - " monitoring_location_id=site1,\n", + " monitoring_location_id=site,\n", " parameter_code=\"00060\",\n", " computation_type=\"arithmetic_mean\",\n", " start_date=\"2023-09-30\",\n", @@ -338,7 +338,7 @@ "outputs": [], "source": [ "monthly_raw, _ = waterdata.get_stats_date_range(\n", - " monitoring_location_id=site1,\n", + " monitoring_location_id=site,\n", " parameter_code=\"00060\",\n", " computation_type=\"arithmetic_mean\",\n", ")\n", diff --git a/demos/USGS_WaterData_DiscreteSamples_Examples.ipynb b/demos/USGS_WaterData_DiscreteSamples_Examples.ipynb index a58e6f56..ea8deac2 100644 --- a/demos/USGS_WaterData_DiscreteSamples_Examples.ipynb +++ b/demos/USGS_WaterData_DiscreteSamples_Examples.ipynb @@ -43,12 +43,11 @@ "plt.rcParams[\"figure.figsize\"] = (7, 4)\n", "\n", "\n", - "# Scatter plot of sample-site locations (a static map; use folium for an\n", - "# interactive version).\n", "def map_sites(df, title=\"\"):\n", + " \"\"\"Static scatter plot of sample-site locations. Use folium for interactive.\"\"\"\n", " lon = pd.to_numeric(df[\"Location_Longitude\"], errors=\"coerce\")\n", " lat = pd.to_numeric(df[\"Location_Latitude\"], errors=\"coerce\")\n", - " ax = plt.subplots(figsize=(7, 5))[1]\n", + " fig, ax = plt.subplots(figsize=(7, 5))\n", " ax.scatter(lon, lat, s=10, color=\"red\", alpha=0.7)\n", " ax.set_xlabel(\"Longitude\")\n", " ax.set_ylabel(\"Latitude\")\n", @@ -528,12 +527,21 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" } }, "nbformat": 4, diff --git a/demos/USGS_WaterData_Introduction_Examples.ipynb b/demos/USGS_WaterData_Introduction_Examples.ipynb index 3ca30420..4c8b9935 100644 --- a/demos/USGS_WaterData_Introduction_Examples.ipynb +++ b/demos/USGS_WaterData_Introduction_Examples.ipynb @@ -97,7 +97,7 @@ "outputs": [], "source": [ "# Ask for just a few columns instead of the full ~40-column record.\n", - "site_info, _ = waterdata.get_monitoring_locations(\n", + "sites_info, _ = waterdata.get_monitoring_locations(\n", " monitoring_location_id=\"USGS-01491000\",\n", " properties=[\n", " \"monitoring_location_id\",\n", @@ -106,7 +106,7 @@ " \"monitoring_location_name\",\n", " ],\n", ")\n", - "site_info.drop(columns=\"geometry\")" + "sites_info.drop(columns=\"geometry\")" ] }, { @@ -126,7 +126,7 @@ "\n", "The APIs accept [CQL2](https://www.loc.gov/standards/sru/cql/) expressions for\n", "complex queries through the `filter` / `filter_lang` arguments. See the\n", - "[General retrieval and CQL2](#General-retrieval-and-CQL2) section below.\n", + "[General retrieval and CQL2](#general-retrieval-and-cql2) section below.\n", "\n", "### Simple features\n", "\n", @@ -176,11 +176,11 @@ "metadata": {}, "outputs": [], "source": [ - "sites_information, _ = waterdata.get_monitoring_locations(\n", + "sites_info, _ = waterdata.get_monitoring_locations(\n", " monitoring_location_id=\"USGS-01491000\"\n", ")\n", - "print(f\"{sites_information.shape[1]} columns returned\")\n", - "sites_information.drop(columns=\"geometry\").T" + "print(f\"{sites_info.shape[1]} columns returned\")\n", + "sites_info.drop(columns=\"geometry\").T" ] }, { @@ -299,13 +299,13 @@ "metadata": {}, "outputs": [], "source": [ - "daily_modern, _ = waterdata.get_daily(\n", + "daily_data, _ = waterdata.get_daily(\n", " monitoring_location_id=\"USGS-01491000\",\n", " parameter_code=[\"00060\", \"00010\"],\n", " statistic_id=\"00003\",\n", " time=[\"2023-10-01\", \"2024-09-30\"],\n", ")\n", - "daily_modern[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()" + "daily_data[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()" ] }, { @@ -324,11 +324,10 @@ "metadata": {}, "outputs": [], "source": [ - "params = sorted(daily_modern[\"parameter_code\"].unique())\n", - "fig, axes = plt.subplots(len(params), 1, figsize=(7, 5), sharex=True, squeeze=False)\n", - "axes = axes[:, 0] # squeeze=False -> always a 2-D array, even for one param\n", + "params = sorted(daily_data[\"parameter_code\"].unique())\n", + "fig, axes = plt.subplots(len(params), 1, figsize=(7, 5), sharex=True)\n", "for ax, pcode in zip(axes, params):\n", - " sub = daily_modern[daily_modern[\"parameter_code\"] == pcode]\n", + " sub = daily_data[daily_data[\"parameter_code\"] == pcode]\n", " ax.scatter(sub[\"time\"], sub[\"value\"], s=4)\n", " ax.set_ylabel(pcode)\n", "axes[0].set_title(\"Daily values at USGS-01491000 (water year 2024)\")\n", @@ -386,14 +385,14 @@ "metadata": {}, "outputs": [], "source": [ - "field_modern, _ = waterdata.get_field_measurements(\n", + "field_data, _ = waterdata.get_field_measurements(\n", " monitoring_location_id=[\n", " \"USGS-451605097071701\",\n", " \"USGS-263819081585801\",\n", " ],\n", " time=[\"2023-10-01\", \"2024-09-30\"],\n", ")\n", - "field_modern[[\"time\", \"monitoring_location_id\", \"parameter_code\", \"value\"]].head()" + "field_data[[\"time\", \"monitoring_location_id\", \"parameter_code\", \"value\"]].head()" ] }, { @@ -404,7 +403,7 @@ "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(7, 4))\n", - "for site, sub in field_modern.groupby(\"monitoring_location_id\"):\n", + "for site, sub in field_data.groupby(\"monitoring_location_id\"):\n", " ax.scatter(sub[\"time\"], sub[\"value\"], s=12, label=site)\n", "ax.set_ylabel(\"value\")\n", "ax.set_title(\"Field measurements\")\n", @@ -499,12 +498,12 @@ "metadata": {}, "outputs": [], "source": [ - "what_huc_sites, _ = waterdata.get_monitoring_locations(\n", + "huc_sites, _ = waterdata.get_monitoring_locations(\n", " filter=\"hydrologic_unit_code LIKE '02070010%'\",\n", " filter_lang=\"cql-text\",\n", ")\n", - "print(f\"{len(what_huc_sites)} sites in HUC 02070010\")\n", - "ax = what_huc_sites.plot(markersize=2, figsize=(7, 5))\n", + "print(f\"{len(huc_sites)} sites in HUC 02070010\")\n", + "ax = huc_sites.plot(markersize=2, figsize=(7, 5))\n", "ax.set_title(\"Sites within HUC 02070010\")\n", "plt.show()" ] @@ -620,16 +619,16 @@ "outputs": [], "source": [ "site = \"USGS-02238500\"\n", - "site_1, _ = waterdata.get_monitoring_locations(\n", + "renamed, _ = waterdata.get_monitoring_locations(\n", " monitoring_location_id=site,\n", " properties=[\"monitoring_location_id\", \"state_name\", \"country_name\"],\n", ")\n", - "site_2, _ = waterdata.get_monitoring_locations(\n", + "raw_id, _ = waterdata.get_monitoring_locations(\n", " monitoring_location_id=site,\n", " properties=[\"id\", \"state_name\", \"country_name\"],\n", ")\n", - "print(\"renamed:\", [c for c in site_1.columns if c != \"geometry\"])\n", - "print(\"raw id :\", [c for c in site_2.columns if c != \"geometry\"])" + "print(\"renamed:\", [c for c in renamed.columns if c != \"geometry\"])\n", + "print(\"raw id :\", [c for c in raw_id.columns if c != \"geometry\"])" ] }, { @@ -647,12 +646,21 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { - "name": "python" + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" } }, "nbformat": 4,