DOI-USGS
diff --git a/‎demos/USGS_WaterData_ContinuousData_Examples.ipynb‎
Lines changed: 257 additions & 0 deletions b/‎demos/USGS_WaterData_ContinuousData_Examples.ipynb‎
Lines changed: 257 additions & 0 deletions
@@ -0,0 +1,257 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d664492b",
+   "metadata": {},
+   "source": [
+    "# Continuous Data\n",
+    "\n",
+    "Continuous data are collected by automated sensors, typically at a fixed\n",
+    "15-minute interval (you may also hear them called \"instantaneous values\" or\n",
+    "\"IV\"). They are described by parameter name and parameter code, and retrieved\n",
+    "with `get_continuous`.\n",
+    "\n",
+    "This notebook covers the two things that matter when a continuous pull gets\n",
+    "large: `dataretrieval` **chunks big requests for you** and can **resume** a pull\n",
+    "that was interrupted partway through, and the one case you still handle yourself\n",
+    "— the service's 3-year-per-request time limit."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e7e06e81",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "from dataretrieval import waterdata\n",
+    "\n",
+    "site = \"USGS-0208458892\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b0136bd1",
+   "metadata": {},
+   "source": [
+    "## What continuous data are available?\n",
+    "\n",
+    "Filter the combined metadata to `data_type=\"Continuous values\"` to see which\n",
+    "time series a site offers and how far back each goes:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6f8a9d87",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "continuous_available, _ = waterdata.get_combined_metadata(\n",
+    "    monitoring_location_id=site,\n",
+    "    data_type=\"Continuous values\",\n",
+    ")\n",
+    "avail = continuous_available[[\"parameter_code\", \"parameter_name\", \"begin\", \"end\"]]\n",
+    "avail.sort_values(\"parameter_code\").reset_index(drop=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fdaa8150",
+   "metadata": {},
+   "source": [
+    "## Large requests are chunked for you\n",
+    "\n",
+    "Any list-valued argument — a long list of monitoring locations, several parameter\n",
+    "codes, a complex CQL filter — can push a single request URL past the server's\n",
+    "~8 KB limit. `dataretrieval` handles this automatically: it splits the query into\n",
+    "URL-sized sub-requests, issues them, and recombines (and de-duplicates) the\n",
+    "results into one frame. **You never need to loop over sites yourself** — request\n",
+    "everything in one call.\n",
+    "\n",
+    "For example, asking for several parameter codes at once just returns one combined\n",
+    "long-format frame:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6bc05102",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "multi, _ = waterdata.get_continuous(\n",
+    "    monitoring_location_id=site,\n",
+    "    parameter_code=[\"00095\", \"00010\"],  # specific conductance + water temperature\n",
+    "    time=\"2024-07-01/2024-07-02\",\n",
+    ")\n",
+    "multi.groupby(\"parameter_code\")[\"value\"].agg([\"count\", \"min\", \"max\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "353ad4ec",
+   "metadata": {},
+   "source": [
+    "## Resilient pulls: resume after an interruption\n",
+    "\n",
+    "A large request becomes many sub-requests under the hood, so a long pull can be\n",
+    "interrupted partway through by a rate limit (HTTP 429) or a transient server\n",
+    "error (HTTP 5xx). Rather than discard the work already done, `dataretrieval`\n",
+    "raises a `ChunkInterrupted` that **preserves the completed sub-requests** and\n",
+    "lets you continue:\n",
+    "\n",
+    "- `QuotaExhausted` (429) and `ServiceInterrupted` (5xx) both subclass\n",
+    "  `ChunkInterrupted`.\n",
+    "- `exc.partial_frame` holds whatever completed before the failure.\n",
+    "- `exc.retry_after` is the server's suggested wait (when provided).\n",
+    "- `exc.call.resume()` re-issues **only the still-pending** sub-requests and\n",
+    "  returns the full `(data, metadata)`.\n",
+    "\n",
+    "The pattern below waits out the interruption and resumes until the pull\n",
+    "finishes. (In normal conditions the request completes on the first try and the\n",
+    "`except` block never runs.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e2e9ddff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import time\n",
+    "\n",
+    "from dataretrieval.waterdata.chunking import ChunkInterrupted\n",
+    "\n",
+    "try:\n",
+    "    sensor_data, _ = waterdata.get_continuous(\n",
+    "        monitoring_location_id=site,\n",
+    "        parameter_code=\"00095\",\n",
+    "        time=\"2024-07-01/2024-07-08\",\n",
+    "    )\n",
+    "except ChunkInterrupted as exc:\n",
+    "    print(\n",
+    "        f\"interrupted after {exc.completed_chunks}/{exc.total_chunks} chunks; resuming\"\n",
+    "    )\n",
+    "    while True:\n",
+    "        time.sleep(exc.retry_after or 5 * 60)  # honor Retry-After, else back off\n",
+    "        try:\n",
+    "            sensor_data, _ = exc.call.resume()\n",
+    "            break\n",
+    "        except ChunkInterrupted as again:\n",
+    "            exc = again\n",
+    "\n",
+    "print(f\"{len(sensor_data):,} rows\")\n",
+    "sensor_data[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "397e87b5",
+   "metadata": {},
+   "source": [
+    "## The 3-year window: the one axis you split yourself\n",
+    "\n",
+    "There is one limit the library does **not** chunk for you: the continuous service\n",
+    "returns at most **3 years of data per request**, and a time window is not a\n",
+    "list-shaped axis it can fan out. (With no `time` argument the service returns the\n",
+    "latest year; continuous data also has no geometry column and ignores bounding-box\n",
+    "queries.)\n",
+    "\n",
+    "So a multi-year, single-site pull is the one place you still split by time. The\n",
+    "service is most efficient one calendar year at a time, so build a list of yearly\n",
+    "windows:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bd26d199",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Split [start, end] into per-calendar-year (start, end) date strings.\n",
+    "def year_chunks(start, end):\n",
+    "    start, end = pd.Timestamp(start), pd.Timestamp(end)\n",
+    "    edges = pd.to_datetime([f\"{y}-01-01\" for y in range(start.year + 1, end.year + 1)])\n",
+    "    starts = [start, *edges]\n",
+    "    ends = [*(edges - pd.Timedelta(days=1)), end]\n",
+    "    return [\n",
+    "        (s.strftime(\"%Y-%m-%d\"), e.strftime(\"%Y-%m-%d\")) for s, e in zip(starts, ends)\n",
+    "    ]\n",
+    "\n",
+    "\n",
+    "# Covering a full multi-year record (no data downloaded here):\n",
+    "pd.DataFrame(year_chunks(\"2012-10-01\", \"2025-09-30\"), columns=[\"start\", \"end\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3bc4f40f",
+   "metadata": {},
+   "source": [
+    "Then request each window and concatenate. (We use a short two-window span here so\n",
+    "the notebook runs quickly; widen the dates for a full period of record.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "01ebb4a0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "chunks = year_chunks(\"2023-10-01\", \"2024-03-31\")\n",
+    "\n",
+    "frames = []\n",
+    "for start, end in chunks:\n",
+    "    part, _ = waterdata.get_continuous(\n",
+    "        monitoring_location_id=site,\n",
+    "        parameter_code=\"00095\",\n",
+    "        time=f\"{start}/{end}\",\n",
+    "    )\n",
+    "    frames.append(part)\n",
+    "\n",
+    "por = pd.concat(frames, ignore_index=True)\n",
+    "print(\n",
+    "    f\"{len(por):,} rows from {len(chunks)} windows, \"\n",
+    "    f\"{por['time'].min()} -> {por['time'].max()}\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e2487bf4",
+   "metadata": {},
+   "source": [
+    "Wrap each window's call in the resume pattern above for an unattended,\n",
+    "restart-safe pull. USGS also expects to offer a direct full-period-of-record\n",
+    "download before the legacy NWIS services are decommissioned, which may make\n",
+    "time-window splitting unnecessary — check the documentation for updates.\n",
+    "\n",
+    "## More help\n",
+    "\n",
+    "- Documentation: <https://doi-usgs.github.io/dataretrieval-python/>\n",
+    "- Chunking and resume internals: `dataretrieval.waterdata.chunking`\n",
+    "- Issues / questions: <https://github.com/DOI-USGS/dataretrieval-python/issues>\n",
+    "- Equivalent R article: [Continuous Data](https://doi-usgs.github.io/dataRetrieval/articles/continuous_pr.html)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}