Skip to content

Commit df8943e

Browse files
thodson-usgsclaude
andcommitted
refactor(waterdata): decompose the 2033-LOC utils.py god-module into cohesive modules
waterdata/utils.py had grown to 2033 LOC spanning ~6 unrelated domains (request building, response parsing, result finalization, pagination/async, stats post-processing, validation) plus constants and the public engines -- changing one domain meant navigating all of them. Split it into six cohesive private modules under dataretrieval/waterdata/, moving every definition VERBATIM (no signature/logic changes): _constants.py URLs, _OUTPUT_ID_BY_SERVICE, regexes, param sets _http.py headers, _error_body, _raise_for_non_200, retry-after _validate.py arg normalization/validation (_get_args, _check_*) _requests.py request building (_construct_api_requests, CQL2, dates) _responses.py geometry-agnostic parsing/finalization/stats shaping _engine.py pagination/async driver (_paginate, _run_sync, ...) utils.py (2033 -> 651 LOC) becomes a thin facade that re-exports the package's internal API (explicit __all__, 56 names) so every existing `from dataretrieval.waterdata.utils import ...` and `mock.patch("dataretrieval.waterdata.utils.<name>")` keeps working unchanged -- no import sites or tests were touched. Seven functions stay physically defined in the facade because the test suite monkeypatches them (or `gpd`) by their module-global name (get_ogc_data, _fetch_once, get_stats_data, _get_resp_data, _ogc_parse_response, _walk_pages, _handle_stats_nesting): a function's global lookups resolve in its defining module, so they must live in utils.py for the patches to take effect. The geopandas import probe stays with them; the pagination logger keeps the name "dataretrieval.waterdata.utils". Behavior-preserving: 56 top-level definitions moved AST-identically (none lost or duplicated); 469 tests pass, 2 skipped; ruff clean; chunking.py untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 1adf174 commit df8943e

7 files changed

Lines changed: 1861 additions & 1553 deletions

File tree

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
"""Dependency-free constants for the Water Data internals.
2+
3+
URLs, the service ``id``-column map, datetime/duration regexes, and the
4+
parameter-classification frozensets shared across the request-building,
5+
response-parsing, validation, and pagination layers. This module imports
6+
nothing from the rest of the package so every other ``waterdata`` internal
7+
module can depend on it without risking an import cycle.
8+
"""
9+
10+
from __future__ import annotations
11+
12+
import re
13+
14+
BASE_URL = "https://api.waterdata.usgs.gov"
15+
OGC_API_VERSION = "v0"
16+
OGC_API_URL = f"{BASE_URL}/ogcapi/{OGC_API_VERSION}"
17+
SAMPLES_URL = f"{BASE_URL}/samples-data"
18+
STATISTICS_API_VERSION = "v0"
19+
STATISTICS_API_URL = f"{BASE_URL}/statistics/{STATISTICS_API_VERSION}"
20+
21+
# Maps each OGC waterdata service to its user-facing ``id`` column (the name the
22+
# typed getters rename the wire ``id`` to, e.g. ``daily`` -> ``daily_id``).
23+
# ``get_cql`` validates its ``service`` argument against these keys and
24+
# uses the value as the ``output_id`` for result shaping. Keep in sync with the
25+
# ``types.WATERDATA_SERVICES`` Literal (same keys).
26+
_OUTPUT_ID_BY_SERVICE: dict[str, str] = {
27+
"channel-measurements": "channel_measurements_id",
28+
"combined-metadata": "combined_meta_id",
29+
"continuous": "continuous_id",
30+
"daily": "daily_id",
31+
"field-measurements": "field_measurement_id",
32+
"field-measurements-metadata": "field_series_id",
33+
"latest-continuous": "latest_continuous_id",
34+
"latest-daily": "latest_daily_id",
35+
"monitoring-locations": "monitoring_location_id",
36+
"peaks": "peak_id",
37+
"time-series-metadata": "time_series_id",
38+
}
39+
40+
# Every service's output id EXCEPT the two that are genuinely user-facing
41+
# (``monitoring_location_id`` and ``time_series_id``). The rest are synthetic
42+
# per-record ids that ``_arrange_cols`` moves to the end of a result frame.
43+
# Derived from ``_OUTPUT_ID_BY_SERVICE`` so adding a service can't silently
44+
# leave a stray id column at the front again.
45+
_EXTRA_ID_COLS = set(_OUTPUT_ID_BY_SERVICE.values()) - {
46+
"monitoring_location_id",
47+
"time_series_id",
48+
}
49+
50+
_DATETIME_FORMATS = (
51+
"%Y-%m-%dT%H:%M:%S.%f%z",
52+
"%Y-%m-%dT%H:%M:%S%z",
53+
"%Y-%m-%dT%H:%M:%S.%f",
54+
"%Y-%m-%dT%H:%M:%S",
55+
"%Y-%m-%d %H:%M:%S.%f",
56+
"%Y-%m-%d %H:%M:%S",
57+
"%Y-%m-%d",
58+
)
59+
60+
# Anchored to ``[Pp]\d`` so a normal word containing ``p`` (e.g. ``"Apr"``)
61+
# doesn't get mis-classified as an ISO 8601 duration; the optional ``T``
62+
# admits time-only forms like ``PT36H``.
63+
_DURATION_RE = re.compile(r"^[Pp]T?\d")
64+
65+
# OGC API parameters that carry a date/datetime value (single string,
66+
# two-element range, or interval/duration string) rather than a multi-value
67+
# string list. Used by ``_construct_api_requests`` to keep them out of the
68+
# POST/CQL2 multi-value path and to route them through ``_format_api_dates``,
69+
# and by ``_NO_NORMALIZE_PARAMS`` to bypass string-iterable normalization.
70+
_DATE_RANGE_PARAMS = frozenset(
71+
{"datetime", "last_modified", "begin", "begin_utc", "end", "end_utc", "time"}
72+
)
73+
74+
# Services that don't support comma-separated values for multi-value GET
75+
# parameters and require POST with CQL2 JSON instead.
76+
_CQL2_REQUIRED_SERVICES = frozenset({"monitoring-locations"})
77+
78+
_MONITORING_LOCATION_ID_RE = re.compile(r"[^-\s]+-[^-\s]+")
79+
80+
81+
# Iterable-shaped params that ``_get_args`` must NOT push through
82+
# ``_normalize_str_iterable`` (scalar non-string knobs are caught by runtime
83+
# type, so only iterables with special handling need to be named here):
84+
# - date-range params may contain ``pd.NaT``/None or interval strings
85+
# - ``bbox``/``boundingBox`` are ``list[float]``, sometimes ``numpy.ndarray``
86+
# - ``get_peaks``'s int-valued filters (``water_year`` etc.) are ``list[int]``
87+
# - ``get_combined_metadata``'s ``thresholds`` is ``list[float]``
88+
_NO_NORMALIZE_PARAMS = _DATE_RANGE_PARAMS | {
89+
"bbox",
90+
"boundingBox",
91+
"water_year",
92+
"year",
93+
"month",
94+
"day",
95+
"peak_since",
96+
"thresholds",
97+
}

0 commit comments

Comments
 (0)