Skip to content

Commit 569ff38

Browse files
thodson-usgsclaude
andauthored
docs: Python ports of the new USGS Water Data API vignettes (#291)
Port five new R dataRetrieval Water Data API vignettes to the Python `waterdata` module as executable demo notebooks, wired into the Sphinx docs under a new "USGS Water Data API vignettes" section: - USGS_WaterData_Introduction_Examples (read_waterdata_functions.Rmd) - USGS_WaterData_DiscreteSamples_Examples (samples_data.Rmd) - USGS_WaterData_DailyStatistics_Examples (daily_data_statistics.Rmd) - USGS_WaterData_ContinuousData_Examples (continuous_pr.Rmd) - USGS_WaterData_ReferenceLists_Examples (Reference_Lists.Rmd) Each notebook was executed end-to-end against the live USGS Water Data API during development; outputs are cleared per the repo convention (the Sphinx docs build re-executes notebooks at build time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9161fbd commit 569ff38

11 files changed

Lines changed: 2081 additions & 0 deletions
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "d664492b",
6+
"metadata": {},
7+
"source": [
8+
"# Continuous Data\n",
9+
"\n",
10+
"Continuous data are collected by automated sensors, typically at a fixed\n",
11+
"15-minute interval (you may also hear them called \"instantaneous values\" or\n",
12+
"\"IV\"). They are described by parameter name and parameter code, and retrieved\n",
13+
"with `get_continuous`.\n",
14+
"\n",
15+
"This notebook covers the two things that matter when a continuous pull gets\n",
16+
"large: `dataretrieval` **chunks big requests for you** and can **resume** a pull\n",
17+
"that was interrupted partway through, and the one case you still handle yourself\n",
18+
"— the service's 3-year-per-request time limit."
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"id": "e7e06e81",
25+
"metadata": {},
26+
"outputs": [],
27+
"source": [
28+
"import pandas as pd\n",
29+
"\n",
30+
"from dataretrieval import waterdata\n",
31+
"\n",
32+
"site = \"USGS-0208458892\""
33+
]
34+
},
35+
{
36+
"cell_type": "markdown",
37+
"id": "b0136bd1",
38+
"metadata": {},
39+
"source": [
40+
"## What continuous data are available?\n",
41+
"\n",
42+
"Filter the combined metadata to `data_type=\"Continuous values\"` to see which\n",
43+
"time series a site offers and how far back each goes:"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"id": "6f8a9d87",
50+
"metadata": {},
51+
"outputs": [],
52+
"source": [
53+
"continuous_available, _ = waterdata.get_combined_metadata(\n",
54+
" monitoring_location_id=site,\n",
55+
" data_type=\"Continuous values\",\n",
56+
")\n",
57+
"avail = continuous_available[[\"parameter_code\", \"parameter_name\", \"begin\", \"end\"]]\n",
58+
"avail.sort_values(\"parameter_code\").reset_index(drop=True)"
59+
]
60+
},
61+
{
62+
"cell_type": "markdown",
63+
"id": "fdaa8150",
64+
"metadata": {},
65+
"source": [
66+
"## Large requests are chunked for you\n",
67+
"\n",
68+
"Any list-valued argument — a long list of monitoring locations, several parameter\n",
69+
"codes, a complex CQL filter — can push a single request URL past the server's\n",
70+
"~8 KB limit. `dataretrieval` handles this automatically: it splits the query into\n",
71+
"URL-sized sub-requests, issues them, and recombines (and de-duplicates) the\n",
72+
"results into one frame. **You never need to loop over sites yourself** — request\n",
73+
"everything in one call.\n",
74+
"\n",
75+
"For example, asking for several parameter codes at once just returns one combined\n",
76+
"long-format frame:"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": null,
82+
"id": "6bc05102",
83+
"metadata": {},
84+
"outputs": [],
85+
"source": [
86+
"multi, _ = waterdata.get_continuous(\n",
87+
" monitoring_location_id=site,\n",
88+
" parameter_code=[\"00095\", \"00010\"], # specific conductance + water temperature\n",
89+
" time=\"2024-07-01/2024-07-02\",\n",
90+
")\n",
91+
"multi.groupby(\"parameter_code\")[\"value\"].agg([\"count\", \"min\", \"max\"])"
92+
]
93+
},
94+
{
95+
"cell_type": "markdown",
96+
"id": "353ad4ec",
97+
"metadata": {},
98+
"source": [
99+
"## Resilient pulls: resume after an interruption\n",
100+
"\n",
101+
"A large request becomes many sub-requests under the hood, so a long pull can be\n",
102+
"interrupted partway through by a rate limit (HTTP 429) or a transient server\n",
103+
"error (HTTP 5xx). Rather than discard the work already done, `dataretrieval`\n",
104+
"raises a `ChunkInterrupted` that **preserves the completed sub-requests** and\n",
105+
"lets you continue:\n",
106+
"\n",
107+
"- `QuotaExhausted` (429) and `ServiceInterrupted` (5xx) both subclass\n",
108+
" `ChunkInterrupted`.\n",
109+
"- `exc.partial_frame` holds whatever completed before the failure.\n",
110+
"- `exc.retry_after` is the server's suggested wait (when provided).\n",
111+
"- `exc.call.resume()` re-issues **only the still-pending** sub-requests and\n",
112+
" returns the full `(data, metadata)`.\n",
113+
"\n",
114+
"The pattern below waits out the interruption and resumes until the pull\n",
115+
"finishes. (In normal conditions the request completes on the first try and the\n",
116+
"`except` block never runs.)"
117+
]
118+
},
119+
{
120+
"cell_type": "code",
121+
"execution_count": null,
122+
"id": "e2e9ddff",
123+
"metadata": {},
124+
"outputs": [],
125+
"source": [
126+
"import time\n",
127+
"\n",
128+
"from dataretrieval.waterdata.chunking import ChunkInterrupted\n",
129+
"\n",
130+
"try:\n",
131+
" sensor_data, _ = waterdata.get_continuous(\n",
132+
" monitoring_location_id=site,\n",
133+
" parameter_code=\"00095\",\n",
134+
" time=\"2024-07-01/2024-07-08\",\n",
135+
" )\n",
136+
"except ChunkInterrupted as exc:\n",
137+
" print(\n",
138+
" f\"interrupted after {exc.completed_chunks}/{exc.total_chunks} chunks; resuming\"\n",
139+
" )\n",
140+
" while True:\n",
141+
" time.sleep(exc.retry_after or 5 * 60) # honor Retry-After, else back off\n",
142+
" try:\n",
143+
" sensor_data, _ = exc.call.resume()\n",
144+
" break\n",
145+
" except ChunkInterrupted as again:\n",
146+
" exc = again\n",
147+
"\n",
148+
"print(f\"{len(sensor_data):,} rows\")\n",
149+
"sensor_data[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()"
150+
]
151+
},
152+
{
153+
"cell_type": "markdown",
154+
"id": "397e87b5",
155+
"metadata": {},
156+
"source": [
157+
"## The 3-year window: the one axis you split yourself\n",
158+
"\n",
159+
"There is one limit the library does **not** chunk for you: the continuous service\n",
160+
"returns at most **3 years of data per request**, and a time window is not a\n",
161+
"list-shaped axis it can fan out. (With no `time` argument the service returns the\n",
162+
"latest year; continuous data also has no geometry column and ignores bounding-box\n",
163+
"queries.)\n",
164+
"\n",
165+
"So a multi-year, single-site pull is the one place you still split by time. The\n",
166+
"service is most efficient one calendar year at a time, so build a list of yearly\n",
167+
"windows:"
168+
]
169+
},
170+
{
171+
"cell_type": "code",
172+
"execution_count": null,
173+
"id": "bd26d199",
174+
"metadata": {},
175+
"outputs": [],
176+
"source": [
177+
"# Split [start, end] into per-calendar-year (start, end) date strings.\n",
178+
"def year_chunks(start, end):\n",
179+
" start, end = pd.Timestamp(start), pd.Timestamp(end)\n",
180+
" edges = pd.to_datetime([f\"{y}-01-01\" for y in range(start.year + 1, end.year + 1)])\n",
181+
" starts = [start, *edges]\n",
182+
" ends = [*(edges - pd.Timedelta(days=1)), end]\n",
183+
" return [\n",
184+
" (s.strftime(\"%Y-%m-%d\"), e.strftime(\"%Y-%m-%d\")) for s, e in zip(starts, ends)\n",
185+
" ]\n",
186+
"\n",
187+
"\n",
188+
"# Covering a full multi-year record (no data downloaded here):\n",
189+
"pd.DataFrame(year_chunks(\"2012-10-01\", \"2025-09-30\"), columns=[\"start\", \"end\"])"
190+
]
191+
},
192+
{
193+
"cell_type": "markdown",
194+
"id": "3bc4f40f",
195+
"metadata": {},
196+
"source": [
197+
"Then request each window and concatenate. (We use a short two-window span here so\n",
198+
"the notebook runs quickly; widen the dates for a full period of record.)"
199+
]
200+
},
201+
{
202+
"cell_type": "code",
203+
"execution_count": null,
204+
"id": "01ebb4a0",
205+
"metadata": {},
206+
"outputs": [],
207+
"source": [
208+
"chunks = year_chunks(\"2023-10-01\", \"2024-03-31\")\n",
209+
"\n",
210+
"frames = []\n",
211+
"for start, end in chunks:\n",
212+
" part, _ = waterdata.get_continuous(\n",
213+
" monitoring_location_id=site,\n",
214+
" parameter_code=\"00095\",\n",
215+
" time=f\"{start}/{end}\",\n",
216+
" )\n",
217+
" frames.append(part)\n",
218+
"\n",
219+
"por = pd.concat(frames, ignore_index=True)\n",
220+
"print(\n",
221+
" f\"{len(por):,} rows from {len(chunks)} windows, \"\n",
222+
" f\"{por['time'].min()} -> {por['time'].max()}\"\n",
223+
")"
224+
]
225+
},
226+
{
227+
"cell_type": "markdown",
228+
"id": "e2487bf4",
229+
"metadata": {},
230+
"source": [
231+
"Wrap each window's call in the resume pattern above for an unattended,\n",
232+
"restart-safe pull. USGS also expects to offer a direct full-period-of-record\n",
233+
"download before the legacy NWIS services are decommissioned, which may make\n",
234+
"time-window splitting unnecessary — check the documentation for updates.\n",
235+
"\n",
236+
"## More help\n",
237+
"\n",
238+
"- Documentation: <https://doi-usgs.github.io/dataretrieval-python/>\n",
239+
"- Chunking and resume internals: `dataretrieval.waterdata.chunking`\n",
240+
"- Issues / questions: <https://github.com/DOI-USGS/dataretrieval-python/issues>\n",
241+
"- Equivalent R article: [Continuous Data](https://doi-usgs.github.io/dataRetrieval/articles/continuous_pr.html)"
242+
]
243+
}
244+
],
245+
"metadata": {
246+
"kernelspec": {
247+
"display_name": "Python 3",
248+
"language": "python",
249+
"name": "python3"
250+
},
251+
"language_info": {
252+
"name": "python"
253+
}
254+
},
255+
"nbformat": 4,
256+
"nbformat_minor": 5
257+
}

0 commit comments

Comments
 (0)