DOI-USGS
diff --git a/‎demos/USGS_WaterData_ContinuousData_Examples.ipynb‎
Lines changed: 279 additions & 0 deletions b/‎demos/USGS_WaterData_ContinuousData_Examples.ipynb‎
Lines changed: 279 additions & 0 deletions
@@ -0,0 +1,279 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "543bc3c5",
+   "metadata": {},
+   "source": [
+    "# Continuous Data\n",
+    "\n",
+    "This notebook is the Python (`dataretrieval`) equivalent of the R `dataRetrieval`\n",
+    "vignette [*Continuous Data*](https://doi-usgs.github.io/dataRetrieval/articles/continuous_pr.html).\n",
+    "\n",
+    "Continuous data are collected by automated sensors, typically at a fixed\n",
+    "15-minute interval (you may also hear them called \"instantaneous values\" or\n",
+    "\"IV\"). They are described by parameter name and parameter code.\n",
+    "\n",
+    "The service behind `get_continuous` currently allows **at most 3 years of data\n",
+    "per request**, and — unlike the multi-site list arguments — `dataretrieval` does\n",
+    "**not** automatically chunk a long *time* window for you. So to assemble a long\n",
+    "period of record you split the date range into chunks and combine the results.\n",
+    "This notebook shows how.\n",
+    "\n",
+    "> The executed examples below deliberately use a short window so the notebook\n",
+    "> runs quickly; the same pattern scales to the full period of record by widening\n",
+    "> the date range."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e2f4be5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from concurrent.futures import ThreadPoolExecutor\n",
+    "\n",
+    "import pandas as pd\n",
+    "\n",
+    "from dataretrieval import waterdata\n",
+    "\n",
+    "site = \"USGS-0208458892\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19c2ac87",
+   "metadata": {},
+   "source": [
+    "## What continuous data is available?\n",
+    "\n",
+    "First, see which continuous time series exist at the site by filtering the\n",
+    "combined metadata to `data_type=\"Continuous values\"`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d51b7ebd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "continuous_available, _ = waterdata.get_combined_metadata(\n",
+    "    monitoring_location_id=site,\n",
+    "    data_type=\"Continuous values\",\n",
+    ")\n",
+    "avail = continuous_available[[\"parameter_code\", \"parameter_name\", \"begin\", \"end\"]]\n",
+    "avail.sort_values(\"parameter_code\").reset_index(drop=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "09d6f697",
+   "metadata": {},
+   "source": [
+    "Say we're interested in \"Specific cond at 25C\" (`00095`). Its record spans well\n",
+    "over a decade, so a full-period-of-record pull must be chunked.\n",
+    "\n",
+    "## Building the date chunks\n",
+    "\n",
+    "The services are most efficient when queried one **calendar year** at a time, so\n",
+    "we generate a list of `(start, end)` windows, each ending the day before the next\n",
+    "begins:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "da7db223",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Split [start, end] into per-calendar-year (start, end) date strings.\n",
+    "def year_chunks(start, end):\n",
+    "    start, end = pd.Timestamp(start), pd.Timestamp(end)\n",
+    "    edges = pd.to_datetime([f\"{y}-01-01\" for y in range(start.year + 1, end.year + 1)])\n",
+    "    starts = [start, *edges]\n",
+    "    ends = [*(edges - pd.Timedelta(days=1)), end]\n",
+    "    return [\n",
+    "        (s.strftime(\"%Y-%m-%d\"), e.strftime(\"%Y-%m-%d\")) for s, e in zip(starts, ends)\n",
+    "    ]\n",
+    "\n",
+    "\n",
+    "# The chunks needed to cover the full period of record (no data downloaded here):\n",
+    "por_chunks = year_chunks(\"2012-10-01\", \"2025-09-30\")\n",
+    "pd.DataFrame(por_chunks, columns=[\"start\", \"end\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cfae6e1b",
+   "metadata": {},
+   "source": [
+    "That is 14 requests for the full record. For the executed examples we use a short\n",
+    "window that still crosses a year boundary (so we get two chunks):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "864313d1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunks = year_chunks(\"2023-10-01\", \"2024-03-31\")\n",
+    "chunks"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5d04013a",
+   "metadata": {},
+   "source": [
+    "## Sequential pull (a `for` loop)\n",
+    "\n",
+    "The Python equivalent of the R `for` / `apply` / `purrr` examples: loop over the\n",
+    "chunks, collect each frame, and concatenate."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c1ffc3ad",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "frames = []\n",
+    "for start, end in chunks:\n",
+    "    sub, _ = waterdata.get_continuous(\n",
+    "        monitoring_location_id=site,\n",
+    "        parameter_code=\"00095\",\n",
+    "        time=f\"{start}/{end}\",\n",
+    "    )\n",
+    "    frames.append(sub)\n",
+    "\n",
+    "all_data = pd.concat(frames, ignore_index=True)\n",
+    "print(\n",
+    "    f\"{len(all_data):,} rows from {len(chunks)} chunks, \"\n",
+    "    f\"{all_data['time'].min()} -> {all_data['time'].max()}\"\n",
+    ")\n",
+    "all_data[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf200ce5",
+   "metadata": {},
+   "source": [
+    "## Resilient pulls\n",
+    "\n",
+    "The loop above is fine if every request succeeds. In a long real-world pull,\n",
+    "network hiccups, service outages, or rate limits can interrupt it. R reaches for a\n",
+    "[`targets`](https://books.ropensci.org/targets/) pipeline to make the work\n",
+    "restartable; in Python you can get most of the benefit by recording which chunks\n",
+    "failed so you can retry only those."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e2b4358",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Fetch each chunk, returning (combined_frame, list_of_failed_chunks).\n",
+    "def fetch_continuous(chunks, **kwargs):\n",
+    "    frames, failed = [], []\n",
+    "    for start, end in chunks:\n",
+    "        try:\n",
+    "            sub, _ = waterdata.get_continuous(time=f\"{start}/{end}\", **kwargs)\n",
+    "            frames.append(sub)\n",
+    "        except Exception as exc:  # network / service / rate-limit errors\n",
+    "            failed.append((start, end))\n",
+    "            print(f\"  failed {start}..{end}: {exc}\")\n",
+    "    combined = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()\n",
+    "    return combined, failed\n",
+    "\n",
+    "\n",
+    "all_data, failed = fetch_continuous(\n",
+    "    chunks, monitoring_location_id=site, parameter_code=\"00095\"\n",
+    ")\n",
+    "print(f\"{len(all_data):,} rows; {len(failed)} failed chunk(s) to retry: {failed}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "905017f5",
+   "metadata": {},
+   "source": [
+    "You would then re-run `fetch_continuous(failed, ...)` until `failed` is empty.\n",
+    "\n",
+    "## Pulling in parallel\n",
+    "\n",
+    "On a standard laptop you can speed things up by issuing requests concurrently\n",
+    "with a thread pool (the equivalent of R's `future.apply` / `furrr`). **Do not**\n",
+    "run many parallel requests from shared infrastructure (HPC, CI runners) — they\n",
+    "may be throttled or killed, and you can blow through the API rate limit. Keep the\n",
+    "pool small."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "974e28c6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def fetch_one(window):\n",
+    "    start, end = window\n",
+    "    df, _ = waterdata.get_continuous(\n",
+    "        monitoring_location_id=site,\n",
+    "        parameter_code=\"00095\",\n",
+    "        time=f\"{start}/{end}\",\n",
+    "    )\n",
+    "    return df\n",
+    "\n",
+    "\n",
+    "with ThreadPoolExecutor(max_workers=2) as pool:\n",
+    "    frames = list(pool.map(fetch_one, chunks))\n",
+    "\n",
+    "all_data_parallel = pd.concat(frames, ignore_index=True)\n",
+    "print(f\"{len(all_data_parallel):,} rows pulled in parallel\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "29d71d71",
+   "metadata": {},
+   "source": [
+    "For production-scale pipelines (the role `targets` plays in R), Python users\n",
+    "typically reach for a workflow tool such as\n",
+    "[Prefect](https://www.prefect.io/), [Dask](https://www.dask.org/), or\n",
+    "[Snakemake](https://snakemake.github.io/), wrapping the same chunk-and-combine\n",
+    "logic with caching and automatic retries.\n",
+    "\n",
+    "> **Heads up:** USGS expects to offer a direct full-period-of-record download for\n",
+    "> continuous data before the NWIS services are decommissioned, at which point\n",
+    "> these chunking workflows may become unnecessary. Check the docs for updates.\n",
+    "\n",
+    "## More help\n",
+    "\n",
+    "- Documentation: <https://doi-usgs.github.io/dataretrieval-python/>\n",
+    "- See the *USGS Water Data API Introduction* notebook for `get_continuous` basics.\n",
+    "- Issues / questions: <https://github.com/DOI-USGS/dataretrieval-python/issues>"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}