Skip to content

Commit 45dca96

Browse files
thodson-usgsclaude
andcommitted
docs: add Python ports of the new USGS Water Data API vignettes
Port five new R dataRetrieval Water Data API vignettes to the Python `waterdata` module as executable demo notebooks, wired into the Sphinx docs under a new "USGS Water Data API vignettes" section: - USGS_WaterData_Introduction_Examples (read_waterdata_functions.Rmd) - USGS_WaterData_DiscreteSamples_Examples (samples_data.Rmd) - USGS_WaterData_DailyStatistics_Examples (daily_data_statistics.Rmd) - USGS_WaterData_ContinuousData_Examples (continuous_pr.Rmd) - USGS_WaterData_ReferenceLists_Examples (Reference_Lists.Rmd) Each notebook was executed end-to-end against the live USGS Water Data API during development; outputs are cleared per the repo convention (the Sphinx docs build re-executes notebooks at build time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 7f64c2d commit 45dca96

11 files changed

Lines changed: 2097 additions & 0 deletions
Lines changed: 279 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,279 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "543bc3c5",
6+
"metadata": {},
7+
"source": [
8+
"# Continuous Data\n",
9+
"\n",
10+
"This notebook is the Python (`dataretrieval`) equivalent of the R `dataRetrieval`\n",
11+
"vignette [*Continuous Data*](https://doi-usgs.github.io/dataRetrieval/articles/continuous_pr.html).\n",
12+
"\n",
13+
"Continuous data are collected by automated sensors, typically at a fixed\n",
14+
"15-minute interval (you may also hear them called \"instantaneous values\" or\n",
15+
"\"IV\"). They are described by parameter name and parameter code.\n",
16+
"\n",
17+
"The service behind `get_continuous` currently allows **at most 3 years of data\n",
18+
"per request**, and — unlike the multi-site list arguments — `dataretrieval` does\n",
19+
"**not** automatically chunk a long *time* window for you. So to assemble a long\n",
20+
"period of record you split the date range into chunks and combine the results.\n",
21+
"This notebook shows how.\n",
22+
"\n",
23+
"> The executed examples below deliberately use a short window so the notebook\n",
24+
"> runs quickly; the same pattern scales to the full period of record by widening\n",
25+
"> the date range."
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": null,
31+
"id": "9e2f4be5",
32+
"metadata": {},
33+
"outputs": [],
34+
"source": [
35+
"from concurrent.futures import ThreadPoolExecutor\n",
36+
"\n",
37+
"import pandas as pd\n",
38+
"\n",
39+
"from dataretrieval import waterdata\n",
40+
"\n",
41+
"site = \"USGS-0208458892\""
42+
]
43+
},
44+
{
45+
"cell_type": "markdown",
46+
"id": "19c2ac87",
47+
"metadata": {},
48+
"source": [
49+
"## What continuous data is available?\n",
50+
"\n",
51+
"First, see which continuous time series exist at the site by filtering the\n",
52+
"combined metadata to `data_type=\"Continuous values\"`:"
53+
]
54+
},
55+
{
56+
"cell_type": "code",
57+
"execution_count": null,
58+
"id": "d51b7ebd",
59+
"metadata": {},
60+
"outputs": [],
61+
"source": [
62+
"continuous_available, _ = waterdata.get_combined_metadata(\n",
63+
" monitoring_location_id=site,\n",
64+
" data_type=\"Continuous values\",\n",
65+
")\n",
66+
"avail = continuous_available[[\"parameter_code\", \"parameter_name\", \"begin\", \"end\"]]\n",
67+
"avail.sort_values(\"parameter_code\").reset_index(drop=True)"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"id": "09d6f697",
73+
"metadata": {},
74+
"source": [
75+
"Say we're interested in \"Specific cond at 25C\" (`00095`). Its record spans well\n",
76+
"over a decade, so a full-period-of-record pull must be chunked.\n",
77+
"\n",
78+
"## Building the date chunks\n",
79+
"\n",
80+
"The services are most efficient when queried one **calendar year** at a time, so\n",
81+
"we generate a list of `(start, end)` windows, each ending the day before the next\n",
82+
"begins:"
83+
]
84+
},
85+
{
86+
"cell_type": "code",
87+
"execution_count": null,
88+
"id": "da7db223",
89+
"metadata": {},
90+
"outputs": [],
91+
"source": [
92+
"# Split [start, end] into per-calendar-year (start, end) date strings.\n",
93+
"def year_chunks(start, end):\n",
94+
" start, end = pd.Timestamp(start), pd.Timestamp(end)\n",
95+
" edges = pd.to_datetime([f\"{y}-01-01\" for y in range(start.year + 1, end.year + 1)])\n",
96+
" starts = [start, *edges]\n",
97+
" ends = [*(edges - pd.Timedelta(days=1)), end]\n",
98+
" return [\n",
99+
" (s.strftime(\"%Y-%m-%d\"), e.strftime(\"%Y-%m-%d\")) for s, e in zip(starts, ends)\n",
100+
" ]\n",
101+
"\n",
102+
"\n",
103+
"# The chunks needed to cover the full period of record (no data downloaded here):\n",
104+
"por_chunks = year_chunks(\"2012-10-01\", \"2025-09-30\")\n",
105+
"pd.DataFrame(por_chunks, columns=[\"start\", \"end\"])"
106+
]
107+
},
108+
{
109+
"cell_type": "markdown",
110+
"id": "cfae6e1b",
111+
"metadata": {},
112+
"source": [
113+
"That is 14 requests for the full record. For the executed examples we use a short\n",
114+
"window that still crosses a year boundary (so we get two chunks):"
115+
]
116+
},
117+
{
118+
"cell_type": "code",
119+
"execution_count": null,
120+
"id": "864313d1",
121+
"metadata": {},
122+
"outputs": [],
123+
"source": [
124+
"chunks = year_chunks(\"2023-10-01\", \"2024-03-31\")\n",
125+
"chunks"
126+
]
127+
},
128+
{
129+
"cell_type": "markdown",
130+
"id": "5d04013a",
131+
"metadata": {},
132+
"source": [
133+
"## Sequential pull (a `for` loop)\n",
134+
"\n",
135+
"The Python equivalent of the R `for` / `apply` / `purrr` examples: loop over the\n",
136+
"chunks, collect each frame, and concatenate."
137+
]
138+
},
139+
{
140+
"cell_type": "code",
141+
"execution_count": null,
142+
"id": "c1ffc3ad",
143+
"metadata": {},
144+
"outputs": [],
145+
"source": [
146+
"frames = []\n",
147+
"for start, end in chunks:\n",
148+
" sub, _ = waterdata.get_continuous(\n",
149+
" monitoring_location_id=site,\n",
150+
" parameter_code=\"00095\",\n",
151+
" time=f\"{start}/{end}\",\n",
152+
" )\n",
153+
" frames.append(sub)\n",
154+
"\n",
155+
"all_data = pd.concat(frames, ignore_index=True)\n",
156+
"print(\n",
157+
" f\"{len(all_data):,} rows from {len(chunks)} chunks, \"\n",
158+
" f\"{all_data['time'].min()} -> {all_data['time'].max()}\"\n",
159+
")\n",
160+
"all_data[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()"
161+
]
162+
},
163+
{
164+
"cell_type": "markdown",
165+
"id": "bf200ce5",
166+
"metadata": {},
167+
"source": [
168+
"## Resilient pulls\n",
169+
"\n",
170+
"The loop above is fine if every request succeeds. In a long real-world pull,\n",
171+
"network hiccups, service outages, or rate limits can interrupt it. R reaches for a\n",
172+
"[`targets`](https://books.ropensci.org/targets/) pipeline to make the work\n",
173+
"restartable; in Python you can get most of the benefit by recording which chunks\n",
174+
"failed so you can retry only those."
175+
]
176+
},
177+
{
178+
"cell_type": "code",
179+
"execution_count": null,
180+
"id": "9e2b4358",
181+
"metadata": {},
182+
"outputs": [],
183+
"source": [
184+
"# Fetch each chunk, returning (combined_frame, list_of_failed_chunks).\n",
185+
"def fetch_continuous(chunks, **kwargs):\n",
186+
" frames, failed = [], []\n",
187+
" for start, end in chunks:\n",
188+
" try:\n",
189+
" sub, _ = waterdata.get_continuous(time=f\"{start}/{end}\", **kwargs)\n",
190+
" frames.append(sub)\n",
191+
" except Exception as exc: # network / service / rate-limit errors\n",
192+
" failed.append((start, end))\n",
193+
" print(f\" failed {start}..{end}: {exc}\")\n",
194+
" combined = pd.concat(frames, ignore_index=True) if frames else pd.DataFrame()\n",
195+
" return combined, failed\n",
196+
"\n",
197+
"\n",
198+
"all_data, failed = fetch_continuous(\n",
199+
" chunks, monitoring_location_id=site, parameter_code=\"00095\"\n",
200+
")\n",
201+
"print(f\"{len(all_data):,} rows; {len(failed)} failed chunk(s) to retry: {failed}\")"
202+
]
203+
},
204+
{
205+
"cell_type": "markdown",
206+
"id": "905017f5",
207+
"metadata": {},
208+
"source": [
209+
"You would then re-run `fetch_continuous(failed, ...)` until `failed` is empty.\n",
210+
"\n",
211+
"## Pulling in parallel\n",
212+
"\n",
213+
"On a standard laptop you can speed things up by issuing requests concurrently\n",
214+
"with a thread pool (the equivalent of R's `future.apply` / `furrr`). **Do not**\n",
215+
"run many parallel requests from shared infrastructure (HPC, CI runners) — they\n",
216+
"may be throttled or killed, and you can blow through the API rate limit. Keep the\n",
217+
"pool small."
218+
]
219+
},
220+
{
221+
"cell_type": "code",
222+
"execution_count": null,
223+
"id": "974e28c6",
224+
"metadata": {},
225+
"outputs": [],
226+
"source": [
227+
"def fetch_one(window):\n",
228+
" start, end = window\n",
229+
" df, _ = waterdata.get_continuous(\n",
230+
" monitoring_location_id=site,\n",
231+
" parameter_code=\"00095\",\n",
232+
" time=f\"{start}/{end}\",\n",
233+
" )\n",
234+
" return df\n",
235+
"\n",
236+
"\n",
237+
"with ThreadPoolExecutor(max_workers=2) as pool:\n",
238+
" frames = list(pool.map(fetch_one, chunks))\n",
239+
"\n",
240+
"all_data_parallel = pd.concat(frames, ignore_index=True)\n",
241+
"print(f\"{len(all_data_parallel):,} rows pulled in parallel\")"
242+
]
243+
},
244+
{
245+
"cell_type": "markdown",
246+
"id": "29d71d71",
247+
"metadata": {},
248+
"source": [
249+
"For production-scale pipelines (the role `targets` plays in R), Python users\n",
250+
"typically reach for a workflow tool such as\n",
251+
"[Prefect](https://www.prefect.io/), [Dask](https://www.dask.org/), or\n",
252+
"[Snakemake](https://snakemake.github.io/), wrapping the same chunk-and-combine\n",
253+
"logic with caching and automatic retries.\n",
254+
"\n",
255+
"> **Heads up:** USGS expects to offer a direct full-period-of-record download for\n",
256+
"> continuous data before the NWIS services are decommissioned, at which point\n",
257+
"> these chunking workflows may become unnecessary. Check the docs for updates.\n",
258+
"\n",
259+
"## More help\n",
260+
"\n",
261+
"- Documentation: <https://doi-usgs.github.io/dataretrieval-python/>\n",
262+
"- See the *USGS Water Data API Introduction* notebook for `get_continuous` basics.\n",
263+
"- Issues / questions: <https://github.com/DOI-USGS/dataretrieval-python/issues>"
264+
]
265+
}
266+
],
267+
"metadata": {
268+
"kernelspec": {
269+
"display_name": "Python 3",
270+
"language": "python",
271+
"name": "python3"
272+
},
273+
"language_info": {
274+
"name": "python"
275+
}
276+
},
277+
"nbformat": 4,
278+
"nbformat_minor": 5
279+
}

0 commit comments

Comments
 (0)