fix(waterdata): gate the chunked fan-out with a semaphore, not the connection pool by thodson-usgs · Pull Request #322 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-06-10T03:43:30Z

Problem

ChunkedCall._run dispatches every pending sub-request into one asyncio.gather and relies on the shared httpx.AsyncClient's connection pool as the only concurrency throttle (max_connections = API_USGS_CONCURRENT). That collides with the client's pool-acquire timeout (60 s, from HTTPX_DEFAULTS):

Time a sub-request spends queued waiting for a pooled connection counts against that timeout, and a queued waiter's clock only resets when some response completes and httpcore reassigns the freed connection. So whenever every pooled connection stays busy past the timeout with no completion in between — a batch of large, slowly-streaming pages is enough — the queued tail of the fan-out fails with httpx.PoolTimeout. Being a TransportError it burns the per-sub-request retry budget and ultimately surfaces as a bogus resumable ServiceInterrupted telling the user to wait for the upstream service to recover — for a failure the server never saw.
Pool queueing also thunders: httpcore assigns a freed connection to every queued waiter, which all wake and race for it, the losers re-queuing — O(pending) wakeups per page completion under a large fan-out.

Fix

Gate each fetch attempt with an asyncio.Semaphore sized from API_USGS_CONCURRENT; the connection pool is now merely sized to match so in-flight sub-requests reuse keepalive connections. Parked sub-requests wait on the semaphore before they touch the pool — no transport clock runs while they wait, semaphore releases wake one waiter each, and the pool timeout reverts to its protective role (a genuinely wedged checkout). The slot is acquired per attempt (inside the retry driver), so a sub-request sleeping off a retry backoff doesn't hold one. unbounded degenerates to a semaphore sized at the plan total, so there is a single gated code path.

Observable behavior is otherwise unchanged: same plan, same sub-request order, same resume semantics.

Why not just disable the pool timeout?

Setting pool=None on the client would suppress the spurious failure but lose the stuck-checkout protection, keep the thundering-herd churn, and leave dispatch breadth-first in one FIFO. The semaphore removes the failure mode, keeps the timeout meaningful, and bounds in-flight work explicitly.

Tests

test_fan_out_in_flight_high_water_mark_is_the_cap (parametrized capped/unbounded) — the fetch-level concurrency equals the cap, not the plan total; the capped case fails on the pre-fix code.
test_fan_out_outlives_pool_timeout_on_real_transport — mock transports bypass the pool, so this drives the chunker's shared client against a slow localhost server past a scaled-down pool timeout; it reproduces the spurious resumable ServiceInterrupted on the pre-fix code and completes on this branch.

Verified end-to-end against the live USGS API: with the pre-fix code 6/8 runs interrupted (in-flight peak 8, pool the only throttle); with the fix 0/8 (in-flight peak 2, semaphore gating). Full suite green, ruff and mypy --strict dataretrieval/ clean.

🤖 Generated with Claude Code

ChunkedCall._run dispatched every pending sub-request into one asyncio.gather and relied on the shared httpx.AsyncClient connection pool as the only concurrency throttle (max_connections sized from API_USGS_CONCURRENT). That collides with the client's pool-acquire timeout (60 s, from HTTPX_DEFAULTS): - Time a sub-request spends queued waiting for a pooled connection counts against that timeout, and a queued waiter's clock only resets when some response completes and httpcore reassigns the freed connection. So whenever every pooled connection stays busy past the timeout with no completion in between (a batch of large, slowly-streaming pages), the queued tail of the fan-out fails with httpx.PoolTimeout. Being a TransportError it burns the per-sub-request retry budget and ultimately surfaces as a bogus *resumable* ServiceInterrupted telling the user to wait for the upstream to recover — for a failure the server never saw. - Pool queueing also thunders: httpcore assigns a freed connection to every queued waiter, which all wake and race for it, the losers re-queuing — O(pending) wakeups per page completion under a large fan-out. Gate each fetch attempt with an asyncio.Semaphore sized from API_USGS_CONCURRENT instead; the connection pool is now merely sized to match so in-flight sub-requests reuse keepalive connections. Parked sub-requests wait on the semaphore before they touch the pool, so no transport clock runs while they wait and the pool timeout reverts to its protective role (a genuinely wedged checkout). The slot is acquired per attempt, so a sub-request sleeping off a retry backoff doesn't hold one. "unbounded" degenerates to a semaphore sized at the plan total, so there is a single gated code path. Observable behavior is otherwise unchanged: same plan, same sub-request order, same resume semantics. Tests: - in-flight high-water-mark probe (parametrized capped/unbounded) — the fetch-level concurrency equals the cap, not the plan total; the capped case fails on the pre-fix code. - real-localhost-server end-to-end test — mock transports bypass the pool, so this drives the chunker's shared client against a slow server past a scaled-down pool timeout; reproduces the spurious resumable ServiceInterrupted on the pre-fix code and completes on this branch. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

thodson-usgs force-pushed the fix/chunker-pool-timeout branch from f4ecc31 to 1c676b8 Compare June 10, 2026 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(waterdata): gate the chunked fan-out with a semaphore, not the connection pool#322

fix(waterdata): gate the chunked fan-out with a semaphore, not the connection pool#322
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:fix/chunker-pool-timeout

thodson-usgs commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Why not just disable the pool timeout?

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thodson-usgs commented Jun 10, 2026 •

edited

Loading