Skip to content

Commit cca750e

Browse files
committed
refactor(waterdata): split the 2033-LOC utils.py god-module into a utils/ package
## Why `dataretrieval/waterdata/utils.py` had grown to 2033 LOC spanning ~6 unrelated domains -- request building, response parsing, result finalization, pagination/async, stats post-processing, and validation -- plus constants and the public engines. It was the package's one genuine god-module. (An architecture review found the package's OO is otherwise appropriate, so this is a modularization, not an OO-pattern refactor.) ## What Convert `utils.py` into a `utils/` package: the public surface stays in `utils/__init__.py` (a thin facade) and the implementation is split across six cohesive submodules, moving every definition verbatim (no signature/logic changes): | submodule | holds | |---|---| | `utils/constants.py` | URLs, `_OUTPUT_ID_BY_SERVICE`, regexes, param sets (dependency-free) | | `utils/http.py` | headers, `_error_body`, `_raise_for_non_200`, retry-after | | `utils/validate.py` | arg normalization/validation (`_get_args`, `_check_*`) | | `utils/requests.py` | request building (`_construct_api_requests`, CQL2, dates) | | `utils/responses.py` | geometry-agnostic parsing / finalization / stats shaping | | `utils/engine.py` | pagination/async driver (`_paginate`, `_run_sync`, ...) | `utils/__init__.py` re-exports the internal API (explicit `__all__`, 56 names), so every existing `from dataretrieval.waterdata.utils import ...` and `mock.patch("dataretrieval.waterdata.utils.<name>")` keeps working -- no import sites or tests were touched. `dataretrieval.waterdata.utils` resolves to the package's `__init__`, so the import path is unchanged from when it was a module. Seven functions remain physically defined in `utils/__init__.py` (`get_ogc_data`, `_fetch_once`, `get_stats_data`, `_get_resp_data`, `_ogc_parse_response`, `_walk_pages`, `_handle_stats_nesting`) because the test suite monkeypatches them (or `gpd`) by their `dataretrieval.waterdata.utils.*` name, and a function's global lookups resolve in its defining module. The geopandas probe stays with them, and the pagination logger keeps the name `dataretrieval.waterdata.utils` (a caplog test pins it). These could later move to the `engine`/`responses` submodules -- which do not import the package, so there is no cycle -- but that requires re-targeting the test patches; left as a follow-up. ## Behavior-preserving - 56 top-level definitions moved verbatim -- none lost, none duplicated. - 469 tests pass, 2 skipped; ruff clean; submodules import without cycles (`constants` <- `http`/`validate` <- `requests`/`responses` <- `engine` <- `__init__`); `chunking.py` untouched. ## Note Overlaps with the error-taxonomy (#313) and namespace (#315) PRs on `waterdata/` imports -- sequence on merge. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
1 parent 1adf174 commit cca750e

8 files changed

Lines changed: 2346 additions & 2033 deletions

File tree

dataretrieval/waterdata/utils.py

Lines changed: 0 additions & 2033 deletions
This file was deleted.

dataretrieval/waterdata/utils/__init__.py

Lines changed: 655 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
"""Dependency-free constants for the Water Data internals.
2+
3+
URLs, the service ``id``-column map, datetime/duration regexes, and the
4+
parameter-classification frozensets shared across the request-building,
5+
response-parsing, validation, and pagination layers. This module imports
6+
nothing from the rest of the package so every other ``waterdata`` internal
7+
module can depend on it without risking an import cycle.
8+
"""
9+
10+
from __future__ import annotations
11+
12+
import re
13+
14+
BASE_URL = "https://api.waterdata.usgs.gov"
15+
OGC_API_VERSION = "v0"
16+
OGC_API_URL = f"{BASE_URL}/ogcapi/{OGC_API_VERSION}"
17+
SAMPLES_URL = f"{BASE_URL}/samples-data"
18+
STATISTICS_API_VERSION = "v0"
19+
STATISTICS_API_URL = f"{BASE_URL}/statistics/{STATISTICS_API_VERSION}"
20+
21+
# Maps each OGC waterdata service to its user-facing ``id`` column (the name the
22+
# typed getters rename the wire ``id`` to, e.g. ``daily`` -> ``daily_id``).
23+
# ``get_cql`` validates its ``service`` argument against these keys and
24+
# uses the value as the ``output_id`` for result shaping. Keep in sync with the
25+
# ``types.WATERDATA_SERVICES`` Literal (same keys).
26+
_OUTPUT_ID_BY_SERVICE: dict[str, str] = {
27+
"channel-measurements": "channel_measurements_id",
28+
"combined-metadata": "combined_meta_id",
29+
"continuous": "continuous_id",
30+
"daily": "daily_id",
31+
"field-measurements": "field_measurement_id",
32+
"field-measurements-metadata": "field_series_id",
33+
"latest-continuous": "latest_continuous_id",
34+
"latest-daily": "latest_daily_id",
35+
"monitoring-locations": "monitoring_location_id",
36+
"peaks": "peak_id",
37+
"time-series-metadata": "time_series_id",
38+
}
39+
40+
# Every service's output id EXCEPT the two that are genuinely user-facing
41+
# (``monitoring_location_id`` and ``time_series_id``). The rest are synthetic
42+
# per-record ids that ``_arrange_cols`` moves to the end of a result frame.
43+
# Derived from ``_OUTPUT_ID_BY_SERVICE`` so adding a service can't silently
44+
# leave a stray id column at the front again.
45+
_EXTRA_ID_COLS = set(_OUTPUT_ID_BY_SERVICE.values()) - {
46+
"monitoring_location_id",
47+
"time_series_id",
48+
}
49+
50+
_DATETIME_FORMATS = (
51+
"%Y-%m-%dT%H:%M:%S.%f%z",
52+
"%Y-%m-%dT%H:%M:%S%z",
53+
"%Y-%m-%dT%H:%M:%S.%f",
54+
"%Y-%m-%dT%H:%M:%S",
55+
"%Y-%m-%d %H:%M:%S.%f",
56+
"%Y-%m-%d %H:%M:%S",
57+
"%Y-%m-%d",
58+
)
59+
60+
# Anchored to ``[Pp]\d`` so a normal word containing ``p`` (e.g. ``"Apr"``)
61+
# doesn't get mis-classified as an ISO 8601 duration; the optional ``T``
62+
# admits time-only forms like ``PT36H``.
63+
_DURATION_RE = re.compile(r"^[Pp]T?\d")
64+
65+
# OGC API parameters that carry a date/datetime value (single string,
66+
# two-element range, or interval/duration string) rather than a multi-value
67+
# string list. Used by ``_construct_api_requests`` to keep them out of the
68+
# POST/CQL2 multi-value path and to route them through ``_format_api_dates``,
69+
# and by ``_NO_NORMALIZE_PARAMS`` to bypass string-iterable normalization.
70+
_DATE_RANGE_PARAMS = frozenset(
71+
{"datetime", "last_modified", "begin", "begin_utc", "end", "end_utc", "time"}
72+
)
73+
74+
# Services that don't support comma-separated values for multi-value GET
75+
# parameters and require POST with CQL2 JSON instead.
76+
_CQL2_REQUIRED_SERVICES = frozenset({"monitoring-locations"})
77+
78+
_MONITORING_LOCATION_ID_RE = re.compile(r"[^-\s]+-[^-\s]+")
79+
80+
81+
# Iterable-shaped params that ``_get_args`` must NOT push through
82+
# ``_normalize_str_iterable`` (scalar non-string knobs are caught by runtime
83+
# type, so only iterables with special handling need to be named here):
84+
# - date-range params may contain ``pd.NaT``/None or interval strings
85+
# - ``bbox``/``boundingBox`` are ``list[float]``, sometimes ``numpy.ndarray``
86+
# - ``get_peaks``'s int-valued filters (``water_year`` etc.) are ``list[int]``
87+
# - ``get_combined_metadata``'s ``thresholds`` is ``list[float]``
88+
_NO_NORMALIZE_PARAMS = _DATE_RANGE_PARAMS | {
89+
"bbox",
90+
"boundingBox",
91+
"water_year",
92+
"year",
93+
"month",
94+
"day",
95+
"peak_since",
96+
"thresholds",
97+
}

0 commit comments

Comments
 (0)