You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(waterdata): gate the chunked fan-out with a semaphore, not the pool (#322)
ChunkedCall._run dispatched every pending sub-request into one
asyncio.gather and relied on the shared httpx.AsyncClient connection
pool as the only concurrency throttle (max_connections sized from
API_USGS_CONCURRENT). That collides with the client's pool-acquire
timeout (60 s, from HTTPX_DEFAULTS): a sub-request that can't get a
connection waits in httpx's pool queue, and that wait is bounded by the
pool-acquire timeout. So whenever every pooled connection stays busy
past that window with none freeing — a batch of large, slowly-streaming
pages is enough — the still-queued tail of the fan-out times out with
httpx.PoolTimeout. Being a TransportError it burns the per-sub-request
retry budget and ultimately surfaces as a bogus *resumable*
ServiceInterrupted, telling the user to wait for an upstream that never
saw the request.
Gate each fetch attempt with an asyncio.Semaphore sized from
API_USGS_CONCURRENT instead; the connection pool is now merely sized to
match so in-flight sub-requests reuse keepalive connections. Parked
sub-requests wait on the semaphore before they touch the pool, so no
transport clock runs while they wait and the pool timeout reverts to
its protective role (a genuinely wedged checkout). The slot is acquired
per attempt, so a sub-request sleeping off a retry backoff doesn't hold
one. "unbounded" degenerates to a semaphore sized at the plan total, so
there is a single gated code path. Observable behavior is otherwise
unchanged: same plan, same sub-request order, same resume semantics.
Tests:
- in-flight high-water-mark probe (parametrized capped/unbounded) — the
fetch-level concurrency equals the cap, not the plan total; the capped
case fails on the pre-fix code.
- real-localhost-server end-to-end test — mock transports bypass the
pool, so this drives the chunker's shared client against a slow server
past a scaled-down pool timeout; reproduces the spurious resumable
ServiceInterrupted on the pre-fix code and completes on this branch.
Also raise the default API_USGS_CONCURRENT from 16 to 32 and correct
the concurrency rationale: N caps how many of a chunked query's
sub-requests are in flight at once (a client-side connection/latency
knob), but does not affect the API rate limit -- a chunked call issues
the same number of sub-requests regardless of N. Live testing showed
the API serves 300 simultaneous requests without 5xx; heavy use is
rate-limited by request volume (HTTP 429), mitigated by an
API_USGS_PAT token.
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
0 commit comments