You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(waterdata): Migrate to httpx and add async parallel chunker
Replaces ``requests`` with ``httpx`` package-wide and adds an async
parallel branch to the multi-value chunker, governed by the new
``API_USGS_CONCURRENT`` environment variable. Parallel mode is the
default; set ``API_USGS_CONCURRENT=1`` to force the legacy sequential
path. Benchmarked on a 52,753-site / 10-state ``get_daily`` query:
12.69s parallel vs 67.33s sequential (5.3x speedup), identical row
counts, comparable quota burn.
Why httpx: the parallel fan-out runs on a single shared
``httpx.AsyncClient`` so all sub-requests amortize one TCP+TLS
handshake — impossible under the requests stack without a thread pool.
Both modes share one ``_walk_pages_steps`` generator (pagination state
machine) driven by thin sync/async loops; future retry/backoff lands
in one place via the ``("wait", seconds)`` yield pattern the drivers
can translate to ``time.sleep`` / ``await asyncio.sleep``.
Production
- New ``httpx`` dependency, dropped ``requests``; ``pytest-httpx``
replaces ``requests-mock`` in test extras.
- Three httpx behavior diffs handled defensively: ``InvalidURL``
(URLs > 64 KB rejected client-side) via ``_safe_request_bytes`` /
``_safe_canonical_url``; ``Response.elapsed`` only populated on close
via ``_safe_elapsed``; ``Response.url`` is a read-only property
wrapped via ``_set_response_url``.
- ``BaseMetadata.url`` coerced to ``str`` to preserve the string-typed
contract. ``BaseMetadata.header`` is now ``httpx.Headers`` (see
backwards-compat note below).
- Chunker decorator gains ``walk_pages_async=`` and honors
``API_USGS_CONCURRENT``. Silently falls back to sequential (with INFO
log) when ``walk_pages_async`` isn't wired or when args trip the
inner filter chunker — preserves workloads that worked before
parallel-by-default.
- One shared ``httpx.Client`` / ``AsyncClient`` per chunked call via
``ContextVar``. Eliminates per-sub-request TCP+TLS handshakes in
both modes.
- Bug fix surfaced by the migration: ``filters.py`` was calling
``len(probe.url)`` which fails on ``httpx.URL`` — fixed.
- Unified pagination state machine: ``_walk_pages``,
``_walk_pages_async``, and ``get_stats_data`` all route through
``_walk_pages_steps`` parameterized on ``page_data_fn`` and
``next_req_fn``. ~80 lines of duplicated pagination logic collapsed
into one place; same source of truth across two pagination styles
(OGC link-header and stats next-token).
Tests
- All 8 test files migrated from ``requests-mock`` to ``pytest-httpx``.
- ``tests/conftest.py`` (new) centralizes ``mock_request``,
``assert_url_equivalent``, ``assert_mock_header`` (previously
triplicated across three test files).
- New tests for parallel-mode contract, env-var parsing
(explicit-low / malformed / unset), client contextvar publishing,
and the ``_safe_*`` defensive helpers.
- 194 mocked tests pass; live-API tests unaffected by the migration
(one pre-existing USGS column-drift failure unrelated to this PR).
Backwards-compat
- ``BaseMetadata.header`` type: ``requests.structures.CaseInsensitiveDict``
→ ``httpx.Headers``. Both behave like case-insensitive dicts for
reads, but ``httpx.Headers`` carries auto-added entries (``host``,
``content-length``) so ``md.header == {"key": "val"}`` literal
equality breaks. Use ``md.header.get(...)``.
- ``API_USGS_CONCURRENT`` now controls a feature that defaults on.
Workloads that combined long multi-value lists with chunkable CQL
filters (rare) automatically fall back to sequential — no caller
changes needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments