Skip to content

Commit 0f90912

Browse files
thodson-usgsclaude
andauthored
Parse Date/Time/TimeZone triplets in samples and WQP (#272)
Add a shared utils.attach_datetime_columns helper that scans a CSV-derived DataFrame for <prefix>Date / <prefix>Time / <prefix>TimeZone triplets and appends a derived <prefix>DateTime UTC column for each one, leaving the original triplet columns intact. Recognizes both the WQX3 / Samples naming (Activity_StartDate, Activity_StartTime, Activity_StartTimeZone) and the legacy WQP naming (ActivityStartDate, ActivityStartTime/Time, ActivityStartTime/TimeZoneCode). Mirrors R dataRetrieval's create_dateTime. Wired into waterdata.get_samples and wqp.get_results. Closes #266. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 18f831f commit 0f90912

11 files changed

Lines changed: 245 additions & 44 deletions

File tree

NEWS.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
1+
**05/07/2026:** Bumped the declared minimum Python version from **3.8** to **3.9** (`pyproject.toml`'s `requires-python` and the ruff target). This brings the manifest in line with what was already being tested — CI's matrix has long covered only 3.9, 3.13, and 3.14, the `waterdata` test module already skipped itself on Python < 3.10, and several modules already use 3.9-only stdlib (e.g. `zoneinfo`). Users on 3.8 will no longer be able to install the package; please upgrade.
2+
3+
**05/07/2026:** `waterdata.get_samples()` and `wqp.get_results()` now append a derived `<prefix>DateTime` UTC column for every Date/Time/TimeZone triplet in the response (e.g. `Activity_StartDate` + `Activity_StartTime` + `Activity_StartTimeZone``Activity_StartDateTime`). Both the WQX3 (`<X>Date`/`<X>Time`/`<X>TimeZone`) and legacy WQP (`<X>Date`/`<X>Time/Time`/`<X>Time/TimeZoneCode`) shapes are recognized; abbreviations like EST/EDT/CST/PST resolve to a UTC `Timestamp`, unknown codes resolve to `NaT`, and the original triplet columns are preserved. Returned rows are also now sorted by `Activity_StartDateTime` (or the legacy `ActivityStartDateTime`) — the underlying APIs return rows in an unstable order. Mirrors R's `create_dateTime` and end-of-pipeline sort. Closes #266.
4+
15
**05/06/2026:** Each remaining active function in `dataretrieval.nwis` now emits a per-function `DeprecationWarning` naming the `waterdata` replacement to migrate to (visible the first time users call each getter). The `nwis` module is scheduled for removal on or after **2027-05-06**.
26

37
**05/06/2026:** Added `waterdata.get_ratings(...)` — wraps the new Water Data STAC catalog (`api.waterdata.usgs.gov/stac/v0/search`) for USGS stage-discharge rating curves. Returns parsed `exsa` / `base` / `corr` rating tables as a dict of DataFrames keyed by feature ID, or just the list of available STAC features when `download_and_parse=False`. Mirrors R's `read_waterdata_ratings`.

dataretrieval/utils.py

Lines changed: 102 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,108 @@ def format_datetime(df, date_field, time_field, tz_field):
9494
return df
9595

9696

97+
# (time-suffix, tz-suffix) pairs that follow a "<prefix>Date" column.
98+
_TIME_TZ_SUFFIXES = (
99+
# WQX3 / Samples, e.g.
100+
# Activity_StartDate / Activity_StartTime / Activity_StartTimeZone
101+
("Time", "TimeZone"),
102+
# Legacy WQP (slash-separated), e.g.
103+
# ActivityStartDate / ActivityStartTime/Time / ActivityStartTime/TimeZoneCode
104+
("Time/Time", "Time/TimeZoneCode"),
105+
)
106+
107+
108+
def _build_utc_datetime(
109+
date_series: pd.Series, time_series: pd.Series, tz_series: pd.Series
110+
) -> pd.Series:
111+
"""Combine date + time + tz-abbreviation columns into a UTC pandas Series.
112+
113+
Unknown timezone codes (and rows missing any of the three values) yield
114+
``NaT``. The input columns are not mutated.
115+
"""
116+
offsets = tz_series.map(tz)
117+
combined = (
118+
date_series.astype("string")
119+
+ " "
120+
+ time_series.astype("string")
121+
+ " "
122+
+ offsets.astype("string")
123+
)
124+
return pd.to_datetime(
125+
combined, format="%Y-%m-%d %H:%M:%S %z", utc=True, errors="coerce"
126+
)
127+
128+
129+
def _attach_datetime_columns(df: pd.DataFrame) -> pd.DataFrame:
130+
"""Add ``<prefix>DateTime`` UTC columns for any Date/Time/TimeZone triplets
131+
and sort the frame by the activity-start datetime.
132+
133+
Detects two naming patterns that appear in USGS Samples and Water Quality
134+
Portal CSV responses:
135+
136+
* **WQX3** — ``<prefix>Date``, ``<prefix>Time``, ``<prefix>TimeZone``
137+
* **Legacy WQP** — ``<prefix>Date``, ``<prefix>Time/Time``,
138+
``<prefix>Time/TimeZoneCode``
139+
140+
For every triplet present, a new ``<prefix>DateTime`` column is appended
141+
holding a UTC ``Timestamp`` (offsets resolved via
142+
:data:`dataretrieval.codes.tz`). The original Date/Time/TimeZone columns
143+
are left intact, and an existing ``<prefix>DateTime`` column is never
144+
overwritten.
145+
146+
Rows are sorted (and the index reset) by the canonical activity-start
147+
datetime when present — ``Activity_StartDateTime`` (WQX3) or
148+
``ActivityStartDateTime`` (legacy WQP) — falling back to the first
149+
detected ``*Date`` column. Mirrors R ``dataRetrieval``'s
150+
end-of-pipeline sort in ``importWQP.R``.
151+
152+
Parameters
153+
----------
154+
df : ``pandas.DataFrame``
155+
DataFrame returned from a Samples or WQP CSV endpoint.
156+
157+
Returns
158+
-------
159+
df : ``pandas.DataFrame``
160+
A new DataFrame with derivable ``<prefix>DateTime`` columns appended
161+
and rows sorted by the activity-start datetime (if any date column
162+
was detected).
163+
"""
164+
columns = set(df.columns)
165+
new_columns = {}
166+
first_date_col = None
167+
for col in df.columns:
168+
if not col.endswith("Date"):
169+
continue
170+
if first_date_col is None:
171+
first_date_col = col
172+
prefix = col.removesuffix("Date")
173+
target = prefix + "DateTime"
174+
if target in columns or target in new_columns:
175+
continue
176+
for time_suffix, tz_suffix in _TIME_TZ_SUFFIXES:
177+
time_col = prefix + time_suffix
178+
tz_col = prefix + tz_suffix
179+
if time_col in columns and tz_col in columns:
180+
new_columns[target] = _build_utc_datetime(
181+
df[col], df[time_col], df[tz_col]
182+
)
183+
break
184+
if new_columns:
185+
# Concat in one shot — per-column assignment on a wide CSV-derived
186+
# frame triggers pandas' fragmentation PerformanceWarning.
187+
df = pd.concat([df, pd.DataFrame(new_columns, index=df.index)], axis=1)
188+
if "Activity_StartDateTime" in df.columns:
189+
sort_key = "Activity_StartDateTime"
190+
elif "ActivityStartDateTime" in df.columns:
191+
sort_key = "ActivityStartDateTime"
192+
else:
193+
sort_key = first_date_col
194+
if sort_key is not None:
195+
df = df.sort_values(by=sort_key, ignore_index=True)
196+
return df
197+
198+
97199
class BaseMetadata:
98200
"""Base class for metadata.
99201

dataretrieval/waterdata/api.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
import requests
1717
from requests.models import PreparedRequest
1818

19-
from dataretrieval.utils import BaseMetadata, to_str
19+
from dataretrieval.utils import BaseMetadata, _attach_datetime_columns, to_str
2020
from dataretrieval.waterdata.filters import FILTER_LANG
2121
from dataretrieval.waterdata.types import (
2222
CODE_SERVICES,
@@ -2266,7 +2266,15 @@ def get_samples(
22662266
Returns
22672267
-------
22682268
df : ``pandas.DataFrame``
2269-
Formatted data returned from the API query.
2269+
Formatted data returned from the API query. For each
2270+
``<prefix>Date`` / ``<prefix>Time`` / ``<prefix>TimeZone`` triplet in
2271+
the response (e.g. ``Activity_StartDate``, ``Activity_StartTime``,
2272+
``Activity_StartTimeZone``), an additional ``<prefix>DateTime`` column
2273+
is appended holding a UTC ``Timestamp`` derived from the three. The
2274+
original Date/Time/TimeZone columns are left intact; rows whose
2275+
timezone abbreviation is not recognized resolve to ``NaT``. Rows are
2276+
sorted by ``Activity_StartDateTime`` when present (the API's default
2277+
order is unstable).
22702278
md : :obj:`dataretrieval.utils.Metadata`
22712279
Custom ``dataretrieval`` metadata object pertaining to the query.
22722280
@@ -2323,6 +2331,7 @@ def get_samples(
23232331
response.raise_for_status()
23242332

23252333
df = pd.read_csv(StringIO(response.text), delimiter=",")
2334+
df = _attach_datetime_columns(df)
23262335

23272336
return df, BaseMetadata(response)
23282337

dataretrieval/waterdata/ratings.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,8 @@
1414

1515
import logging
1616
import os
17-
from typing import Any, Iterable, Literal, get_args
17+
from collections.abc import Iterable
18+
from typing import Any, Literal, get_args
1819

1920
import pandas as pd
2021
import requests

dataretrieval/waterdata/utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@
66
import re
77
from datetime import datetime
88
from typing import Any, get_args
9+
from zoneinfo import ZoneInfo
910

1011
import pandas as pd
1112
import requests
12-
from zoneinfo import ZoneInfo
1313

1414
from dataretrieval import __version__
1515
from dataretrieval.utils import BaseMetadata

dataretrieval/wqp.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717

1818
import pandas as pd
1919

20-
from .utils import BaseMetadata, query
20+
from .utils import BaseMetadata, _attach_datetime_columns, query
2121

2222
if TYPE_CHECKING:
2323
from pandas import DataFrame
@@ -101,7 +101,14 @@ def get_results(
101101
Returns
102102
-------
103103
df : ``pandas.DataFrame``
104-
Formatted data returned from the API query.
104+
Formatted data returned from the API query. For each
105+
``<prefix>Date`` / ``<prefix>Time`` / ``<prefix>TimeZone`` triplet in
106+
the response (legacy WQP uses ``<prefix>Time/Time`` and
107+
``<prefix>Time/TimeZoneCode``), an additional ``<prefix>DateTime``
108+
column is appended holding a UTC ``Timestamp``. Original triplet
109+
columns are preserved; unrecognized timezone codes yield ``NaT``.
110+
Rows are sorted by ``ActivityStartDateTime`` (or ``Activity_StartDateTime``
111+
for WQX3 responses) when present.
105112
md : :obj:`dataretrieval.utils.Metadata`
106113
Custom ``dataretrieval`` metadata object pertaining to the query.
107114
@@ -147,6 +154,7 @@ def get_results(
147154
response = query(url, kwargs, delimiter=";", ssl_check=ssl_check)
148155

149156
df = pd.read_csv(StringIO(response.text), delimiter=",", low_memory=False)
157+
df = _attach_datetime_columns(df)
150158
return df, WQP_Metadata(response)
151159

152160

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ build-backend = "setuptools.build_meta"
66
name = "dataretrieval"
77
description = "Discover and retrieve water data from U.S. federal hydrologic web services."
88
readme = "README.md"
9-
requires-python = ">=3.8"
9+
requires-python = ">=3.9"
1010
keywords = ["USGS", "water data"]
1111
license = "CC0-1.0"
1212
license-files = ["LICENSE.md"]
@@ -63,7 +63,7 @@ repository = "https://github.com/DOI-USGS/dataretrieval-python.git"
6363
write_to = "dataretrieval/_version.py"
6464

6565
[tool.ruff]
66-
target-version = "py38"
66+
target-version = "py39"
6767
extend-exclude = ["demos"]
6868

6969
[tool.ruff.lint]

tests/utils_test.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,3 +97,61 @@ def test_to_str_custom_delimiter(self):
9797

9898
def test_to_str_non_iterable(self):
9999
assert utils.to_str(123) is None
100+
101+
102+
class Test_attach_datetime_columns:
103+
"""Tests of _attach_datetime_columns, which derives <prefix>DateTime UTC
104+
columns from Date/Time/TimeZone triplets in Samples and WQP CSVs."""
105+
106+
def test_wqx3_triplet_resolves_to_utc(self):
107+
df = pd.DataFrame(
108+
{
109+
"Activity_StartDate": ["2024-01-09", "2024-02-15"],
110+
"Activity_StartTime": ["10:00:00", "14:30:00"],
111+
"Activity_StartTimeZone": ["PST", "EST"],
112+
}
113+
)
114+
df = utils._attach_datetime_columns(df)
115+
assert df["Activity_StartDateTime"][0] == pd.Timestamp(
116+
"2024-01-09 18:00:00", tz="UTC"
117+
)
118+
assert df["Activity_StartDateTime"][1] == pd.Timestamp(
119+
"2024-02-15 19:30:00", tz="UTC"
120+
)
121+
assert df["Activity_StartTimeZone"].tolist() == ["PST", "EST"]
122+
123+
def test_legacy_wqp_triplet_resolves_to_utc(self):
124+
df = pd.DataFrame(
125+
{
126+
"ActivityStartDate": ["2024-01-09"],
127+
"ActivityStartTime/Time": ["10:00:00"],
128+
"ActivityStartTime/TimeZoneCode": ["PST"],
129+
}
130+
)
131+
df = utils._attach_datetime_columns(df)
132+
assert df["ActivityStartDateTime"][0] == pd.Timestamp(
133+
"2024-01-09 18:00:00", tz="UTC"
134+
)
135+
136+
def test_unknown_timezone_is_NaT(self):
137+
df = pd.DataFrame(
138+
{
139+
"Activity_StartDate": ["2024-01-09"],
140+
"Activity_StartTime": ["10:00:00"],
141+
"Activity_StartTimeZone": ["BOGUS"],
142+
}
143+
)
144+
df = utils._attach_datetime_columns(df)
145+
assert df["Activity_StartDateTime"].isna().all()
146+
147+
def test_existing_datetime_column_not_overwritten(self):
148+
df = pd.DataFrame(
149+
{
150+
"Activity_StartDate": ["2024-01-09"],
151+
"Activity_StartTime": ["10:00:00"],
152+
"Activity_StartTimeZone": ["PST"],
153+
"Activity_StartDateTime": ["preexisting"],
154+
}
155+
)
156+
df = utils._attach_datetime_columns(df)
157+
assert df["Activity_StartDateTime"].tolist() == ["preexisting"]

tests/waterdata_filters_test.py

Lines changed: 47 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -190,14 +190,18 @@ def fake_walk_pages(*_args, **_kwargs):
190190
frame = pd.DataFrame({"id": [f"chunk-{idx}"], "value": [idx]})
191191
return frame, _fake_response()
192192

193-
with mock.patch(
194-
"dataretrieval.waterdata.utils._construct_api_requests",
195-
side_effect=fake_construct_api_requests,
196-
), mock.patch(
197-
"dataretrieval.waterdata.utils._walk_pages", side_effect=fake_walk_pages
198-
), mock.patch(
199-
"dataretrieval.waterdata.filters._effective_filter_budget",
200-
return_value=_CQL_FILTER_CHUNK_LEN,
193+
with (
194+
mock.patch(
195+
"dataretrieval.waterdata.utils._construct_api_requests",
196+
side_effect=fake_construct_api_requests,
197+
),
198+
mock.patch(
199+
"dataretrieval.waterdata.utils._walk_pages", side_effect=fake_walk_pages
200+
),
201+
mock.patch(
202+
"dataretrieval.waterdata.filters._effective_filter_budget",
203+
return_value=_CQL_FILTER_CHUNK_LEN,
204+
),
201205
):
202206
df, _ = get_continuous(
203207
monitoring_location_id="USGS-07374525",
@@ -239,14 +243,18 @@ def fake_walk_pages(*_args, **_kwargs):
239243
frame = pd.DataFrame({"id": ["shared-feature"], "value": [1]})
240244
return frame, _fake_response()
241245

242-
with mock.patch(
243-
"dataretrieval.waterdata.utils._construct_api_requests",
244-
return_value=_fake_prepared_request(),
245-
), mock.patch(
246-
"dataretrieval.waterdata.utils._walk_pages", side_effect=fake_walk_pages
247-
), mock.patch(
248-
"dataretrieval.waterdata.filters._effective_filter_budget",
249-
return_value=_CQL_FILTER_CHUNK_LEN,
246+
with (
247+
mock.patch(
248+
"dataretrieval.waterdata.utils._construct_api_requests",
249+
return_value=_fake_prepared_request(),
250+
),
251+
mock.patch(
252+
"dataretrieval.waterdata.utils._walk_pages", side_effect=fake_walk_pages
253+
),
254+
mock.patch(
255+
"dataretrieval.waterdata.filters._effective_filter_budget",
256+
return_value=_CQL_FILTER_CHUNK_LEN,
257+
),
250258
):
251259
df, _ = get_continuous(
252260
monitoring_location_id="USGS-07374525",
@@ -293,14 +301,18 @@ def fake_walk_pages(*_args, **_kwargs):
293301
)
294302
return frame, _fake_response()
295303

296-
with mock.patch(
297-
"dataretrieval.waterdata.utils._construct_api_requests",
298-
return_value=_fake_prepared_request(),
299-
), mock.patch(
300-
"dataretrieval.waterdata.utils._walk_pages", side_effect=fake_walk_pages
301-
), mock.patch(
302-
"dataretrieval.waterdata.filters._effective_filter_budget",
303-
return_value=_CQL_FILTER_CHUNK_LEN,
304+
with (
305+
mock.patch(
306+
"dataretrieval.waterdata.utils._construct_api_requests",
307+
return_value=_fake_prepared_request(),
308+
),
309+
mock.patch(
310+
"dataretrieval.waterdata.utils._walk_pages", side_effect=fake_walk_pages
311+
),
312+
mock.patch(
313+
"dataretrieval.waterdata.filters._effective_filter_budget",
314+
return_value=_CQL_FILTER_CHUNK_LEN,
315+
),
304316
):
305317
df, _ = get_continuous(
306318
monitoring_location_id="USGS-07374525",
@@ -434,14 +446,17 @@ def fake_construct_api_requests(**kwargs):
434446
sent_filters.append(kwargs.get("filter"))
435447
return _fake_prepared_request()
436448

437-
with mock.patch(
438-
"dataretrieval.waterdata.utils._construct_api_requests",
439-
side_effect=fake_construct_api_requests,
440-
), mock.patch(
441-
"dataretrieval.waterdata.utils._walk_pages",
442-
return_value=(
443-
pd.DataFrame({"id": ["row-1"], "value": [1]}),
444-
_fake_response(),
449+
with (
450+
mock.patch(
451+
"dataretrieval.waterdata.utils._construct_api_requests",
452+
side_effect=fake_construct_api_requests,
453+
),
454+
mock.patch(
455+
"dataretrieval.waterdata.utils._walk_pages",
456+
return_value=(
457+
pd.DataFrame({"id": ["row-1"], "value": [1]}),
458+
_fake_response(),
459+
),
445460
),
446461
):
447462
get_continuous(

0 commit comments

Comments
 (0)