Skip to content

fix(infra): replace httpx with aiohttp + requests for long-running asyncio stability#17

Open
GrigoryEvko wants to merge 6 commits into
FusionBrainLab:mainfrom
GrigoryEvko:fix/aiohttp-network-layer
Open

fix(infra): replace httpx with aiohttp + requests for long-running asyncio stability#17
GrigoryEvko wants to merge 6 commits into
FusionBrainLab:mainfrom
GrigoryEvko:fix/aiohttp-network-layer

Conversation

@GrigoryEvko
Copy link
Copy Markdown

@GrigoryEvko GrigoryEvko commented May 15, 2026

Closes #9.

Replaces direct httpx usage in this codebase with a two-stack network layer: aiohttp under the openai SDK for async paths, requests + urllib3 for sync paths. Drops import httpx from gigaevo/, problems/, and tests/ — httpx remains a transitive dependency through openai / langchain-openai / langfuse / litellm but is no longer a surface this codebase directly maintains. The migration targets the failure mode the issue documents: long-running asyncio processes accumulating connections that httpcore's async semaphore never releases, producing silent PoolTimeout cascades on the timescale of hours.

What landed

New gigaevo/infra/ modules

  • aiohttp_factory.pymake_aiohttp_session(role, ...) returns aiohttp.ClientSession; make_openai_http_client(role, *, proxy, **overrides) returns openai.DefaultAioHttpClient, the aiohttp-backed http_client adapter shipped in the openai[aiohttp] extra. A _KeepaliveConnector subclasses TCPConnector and applies DEFAULT_SOCKET_OPTIONS to each transport after _wrap_create_connection (aiohttp doesn't expose socket_options directly).
  • requests_factory.pymake_requests_session(role, ...) returns a requests.Session with _KeepaliveHTTPAdapter mounted on both http:// and https://. The adapter forwards socket_options and ssl_context through init_poolmanager to urllib3.PoolManager so every minted connection inherits the keepalive settings. _TimeoutSession injects a default per-request timeout split into (connect, read) because requests has no session-level timeout knob.
  • _net.py — shared DEFAULT_SOCKET_OPTIONS (TCP_NODELAY + SO_KEEPALIVE + per-OS probe timing; Linux drops dead-peer detection from net.ipv4.tcp_keepalive_time=7200 to ~120 seconds via TCP_KEEPIDLE=60s / KEEPINTVL=10s / KEEPCNT=6) and build_tls_context() (TLS 1.2+ minimum, hostname + CA verification on).

Defaults targeting the bug profile

  • Bounded ClientTimeout(total=300s, connect=15s, sock_read=300s, sock_connect=15s) — no None pool waits, LLM-friendly read window.
  • keepalive_timeout=20s — shorter than common LB idle thresholds (AWS ALB 60s, NGINX 75s, CloudFlare 90s) so connections age out before the LB closes them asymmetrically.
  • limit_per_host=100 (was effectively unbounded on the old direct-httpx usage).
  • urllib3 Retry(total=5, connect=5, read=5, status=5, backoff_factor=0.5, backoff_jitter=0.5, backoff_max=30s). POST is excluded from allowed_methods (urllib3's safe default) — retrying a successfully-processed POST creates a duplicate resource. status_forcelist covers 408 / 425 / 429 / 5xx; respect_retry_after_header=True.

Call site migration

  • problems/prompts/client.py, problems/chains/client.py — the LLM clients now construct AsyncOpenAI with http_client=make_openai_http_client(...). The problems/chains/client.py timeout=None (which let the pool wait indefinitely — the silent-forever-hang from the issue) is removed; the factory provides bounded defaults.
  • gigaevo/memory/shared_memory/concept_api.py_ConceptApiClient switches from sync httpx.Client to a requests.Session from the factory. Sync httpx isn't subject to the documented async-pool bug, but removing it consolidates the codebase's network footprint.
  • gigaevo/infra/balanced_chat.pyBalancedChatOpenAI gains an explicit http_async_client kwarg, forwarded to every per-endpoint ChatOpenAI via kwargs.setdefault. Default None preserves prior behavior; callers opt in for shared aiohttp pool semantics across all endpoints.
  • gigaevo/memory/ideas_tracker/components/fabrics/llm_clients_fabric.py — the AsyncOpenAI client now passes http_client=make_openai_http_client("ideas_tracker_async"). The sibling sync OpenAI is left on the openai SDK default.
  • gigaevo/llm/models.pyMultiModelRouter._verify_models swaps urllib.request.urlopen for a make_requests_session("model_verify", timeout=(5.0, 10.0)) session. The probe now carries the same retry / keepalive / TLS settings as the rest of the sync layer.

Tests

  • tests/infra/test_aiohttp_factory.py (24 tests) — connector defaults, enable_cleanup_closed passed regardless of platform, TLS context pins TLSv1.2 + check_hostname + CERT_REQUIRED, socket options shape on Linux, factory delegation to DefaultAioHttpClient.
  • tests/infra/test_requests_factory.py (22 tests) — POST excluded from retry allowed_methods, status_forcelist contains 408/429/503, _TimeoutSession.request forwards extra kwargs verbatim, adapter pushes socket_options and ssl_context to PoolManager, default-timeout tuple normalization.
  • tests/infra/test_balanced_chat.py — new TestHttpAsyncClientForwarding covers explicit-kwarg propagation to every endpoint and the no-kwarg-no-injection default.
  • tests/memory/_fake_http.py — replaces httpx.MockTransport with a requests.Session.request patch keeping the previous handler interface (request.method, request.url, request.content) so the existing memory test handlers continue working unchanged.

Dependency changes

  • openai>=1.0.0openai[aiohttp]>=2.0.0. DefaultAioHttpClient requires the aiohttp extra in openai 2.x.
  • aiohttp>=3.10 and requests>=2.31 added explicitly — both are now first-party imports in this codebase.
  • Direct httpx>=0.27.0 dropped. Stays available transitively.

Deliberately not in this PR

  • Per-library transport patchers for langfuse / langsmith / litellm internals. Those libraries use their own httpx clients for telemetry; monkey-patching them carries upgrade risk and they aren't the hot path the issue documents. A separate effort if needed.
  • problems/dashboard/validate.py and gigaevo/memory/openai_inference.py — both use sync OpenAI. Sync httpx isn't the documented bug surface and the openai SDK's sync interface is built around httpx.Client; routing those through anything else requires architectural changes out of scope.
  • Replacing sync urllib.request usage outside the LLM router (gigaevo/utils/..., tools/...). Those are admin/CLI paths, not the hot path.
  • Hydra config updates to start passing http_async_client to BalancedChatOpenAI. The kwarg is added; wiring callers is a follow-up.

New gigaevo/infra/aiohttp_factory.py exposes two factories:

- make_aiohttp_session(role, ...) returns aiohttp.ClientSession with a
  hardened TCPConnector (limit=300, limit_per_host=50,
  keepalive_timeout=30s, enable_cleanup_closed=True) and a bounded
  ClientTimeout (total=120s, connect=10s, sock_read=120s,
  sock_connect=10s). No component defaults to None — the
  silent-forever-hang on the pool timeout is structurally unreachable.

- make_openai_http_client(role, ...) returns
  openai.DefaultAioHttpClient, the aiohttp-backed http_client adapter
  shipped in the openai[aiohttp] extra. Surfaces a clear ImportError
  with install instructions when the extra is missing.

The ``role`` argument is informational (memory_api, llm_prompts, ...).
Currently unused beyond receipt; reserved for per-role override hooks.

Build helpers (build_connector / build_timeout) are exported for callers
that need to customize one knob while keeping the rest of the defaults.

14 unit tests covering: hardened defaults, override hatches, trust_env
behavior, proxy forwarding through to DefaultAioHttpClient, and
ImportError messaging.
problems/prompts/client.py was constructing a bare httpx.AsyncClient(proxy=...)
with no limits, no keepalive_expiry, and the openai-default 5s total timeout.

problems/chains/client.py was constructing httpx.AsyncClient(timeout=httpx.Timeout(timeout=None, connect=30.0)) —
the timeout=None covers read/write/pool, which means the pool can wait
indefinitely for a connection slot. That is the silent forever-hang failure
shape reported in upstream encode/httpx discussion threads about long-running
asyncio processes accumulating leaked connections from cancelled tasks.

Both modules now build their http_client via the centralized aiohttp factory
(make_openai_http_client). The factory returns openai.DefaultAioHttpClient
which speaks the httpx-compatible interface the openai SDK expects but
routes traffic through aiohttp's connection pool — pool semantics that
survive multi-hour asyncio loads. Bounded ClientTimeout (total=120s,
connect=10s, sock_read=120s, sock_connect=10s) plus
TCPConnector(limit=300, limit_per_host=50, keepalive_timeout=30,
enable_cleanup_closed=True) replace the previous ad-hoc configuration.

Tenacity retry config is unchanged — it has no exception-type filter, so
aiohttp.ClientError and friends are caught on the same retry path that
previously caught httpx errors.
…3 pool

Removes the last direct httpx import from gigaevo/. The async-pool
deadlock documented in issue FusionBrainLab#9 is specific to httpx.AsyncClient over
httpcore; sync httpx.Client isn't the bug surface, but eliminating the
import keeps the codebase's network footprint consistent — aiohttp for
async paths, requests + urllib3 for sync paths.

New gigaevo/infra/requests_factory.py:
- make_requests_session(role, *, timeout, pool_connections, pool_maxsize,
  retry) returns a requests.Session backed by urllib3's connection pool.
- _TimeoutSession subclass injects a default per-request timeout because
  requests has no session-level timeout knob; callers can still override
  per-call.
- build_retry() exposes the urllib3.Retry policy with conservative
  defaults (total=3, backoff_factor=0.5, retry on 5xx, allow retry on
  HEAD/GET/OPTIONS/PUT/DELETE — POST excluded so a successfully-
  processed POST isn't duplicated by retry, raise_on_status=False so
  the client formats its own error message).

_ConceptApiClient:
- Uses make_requests_session("memory_api", timeout=...) instead of
  httpx.Client. ApiSync stays sync.
- requests.exceptions.ConnectionError / requests.exceptions.Timeout
  replace the previous httpx.ConnectError / httpx.TimeoutException
  branches with the same MemoryStorageError message text.

Tests:
- tests/memory/_fake_http.py replaces the httpx.MockTransport pattern:
  make_mocked_client(handler) builds a _ConceptApiClient whose
  _http.request is patched to dispatch into the test handler. The
  handler receives a small _FakeRequest with the same attribute surface
  (method / url / content / headers) the previous httpx-based handlers
  expected, so per-test handler code keeps working without rewriting.
- make_fake_response(status_code, *, json, text) builds a
  requests.Response with httpx.Response-compatible constructor shape.

Library footprint after this commit:
- aiohttp for async HTTP paths (via openai.DefaultAioHttpClient,
  introduced earlier in this branch)
- requests + urllib3 for sync HTTP paths (this commit)
- No `import httpx` remains in gigaevo/, problems/, or tests/.
  httpx stays as a transitive dependency through openai / langchain-
  openai / langfuse / litellm; it's gone from our own modules.
…acks

Adds gigaevo/infra/_net.py — single source of truth for the parts the
sync (requests) and async (aiohttp) factories must agree on:

- DEFAULT_SOCKET_OPTIONS: SO_KEEPALIVE plus per-OS probe timing
  (TCP_KEEPIDLE/INTVL/CNT on Linux; TCP_KEEPALIVE on macOS), shifted
  from the kernel default of net.ipv4.tcp_keepalive_time=7200 (two
  hours, useless for HTTP) down to IDLE=60s / INTVL=10s / CNT=6 — a
  dead peer is detected in ~120s rather than 2h. TCP_NODELAY=1 stays
  in the list so urllib3's default isn't lost when callers compose
  their own option list. apply_socket_options() applies each option
  best-effort and swallows individual OSErrors so kernel rejection of
  one tuning option never fails an otherwise-good connection.
- build_tls_context(): ssl.SSLContext pinned to TLS 1.2+ with
  hostname checking and CA verification on. Both libraries already
  default to similar contexts; this makes the properties explicit so
  they survive library upgrades and are visible in our code instead of
  implicit in ssl.create_default_context.

aiohttp_factory.py:
- _KeepaliveConnector subclasses aiohttp.TCPConnector. aiohttp doesn't
  expose socket_options on the connector — the only way to set per-
  connection SO_KEEPALIVE is to intercept transport creation. The
  subclass overrides _wrap_create_connection and applies the options
  to the underlying socket from transport.get_extra_info('socket')
  after the connection completes.
- Default keepalive_timeout=20s (was 30s) — short enough to age out
  before any common load balancer's idle timeout (AWS ALB 60s,
  NGINX 75s, CloudFlare 90s) closes us asymmetrically.
- Default limit_per_host=100 (was 50) — LLM endpoints saturate.
- Default ClientTimeout total=300s, sock_read=300s, connect=15s — LLM
  generations occasionally take several minutes; the previous 120s
  defaults were too tight.
- Default ssl_context = build_tls_context().

requests_factory.py:
- _KeepaliveHTTPAdapter subclasses HTTPAdapter to thread socket_options
  and ssl_context through init_poolmanager into the underlying
  PoolManager — required so the options reach every connection minted
  by the pool, not just the first.
- Default timeout split into (connect=15s, read=60s) tuple; a single
  float is normalized to (value, value) so legacy callsites continue
  working.
- Default Retry: separated connect / read / status counts (5 each),
  backoff_jitter=0.5 defeats synchronized retry storms,
  backoff_max=30s caps the worst-case sleep, status_forcelist now
  includes 408 and 425 along with 429 and 5xx. POST is excluded from
  allowed_methods (urllib3's safe default) — retrying a successfully-
  processed POST creates a duplicate resource.
- Default ssl_context = build_tls_context().
- Default pool_connections=20, pool_maxsize=100 (was 10, 50).

Test coverage:
- DEFAULT_SOCKET_OPTIONS contains SO_KEEPALIVE + TCP_NODELAY on every
  platform; TCP_KEEPIDLE on Linux.
- apply_socket_options swallows individual setsockopt failures.
- ssl context default pins TLSv1_2 + check_hostname + CERT_REQUIRED.
- _KeepaliveHTTPAdapter pushes socket_options and ssl_context through
  to the PoolManager's connection_pool_kw.
- _KeepaliveConnector is the concrete TCPConnector subclass returned
  by build_connector(); enable_cleanup_closed and ttl_dns_cache are
  forwarded from build_connector kwargs (asserted via construction
  spy because aiohttp 3.13 doesn't expose ttl_dns_cache publicly).
- Retry excludes POST; status_forcelist contains 408/429/503.
- _TimeoutSession.request forwards extra kwargs verbatim.

168/168 tests pass across tests/infra/ + tests/memory/. ruff clean.
…sites

Three call sites in the LLM stack constructed openai SDK clients with no
http_client kwarg, leaving each instance to mint its own internal
connection pool. After this commit they can opt into the aiohttp-backed
factory built earlier in the branch:

- BalancedChatOpenAI gains an explicit ``http_async_client`` constructor
  kwarg. When provided, it's forwarded to every per-endpoint ChatOpenAI
  via kwargs.setdefault, so all N endpoints share a single underlying
  connection pool. Default ``None`` preserves the historical behavior of
  letting each ChatOpenAI mint its own client — opt-in, not enforced.

- The ideas_tracker AsyncOpenAI client now passes
  http_client=make_openai_http_client("ideas_tracker_async") at
  construction. The sibling sync OpenAI client is intentionally left on
  the openai SDK default (sync httpx) — sync httpx is not subject to
  the documented async-pool deadlock and the openai SDK's sync interface
  is built around httpx.Client.

- MultiModelRouter._verify_models was using urllib.request.urlopen
  directly with no retry/keepalive/TLS configuration. Swapped to a
  session minted by make_requests_session("model_verify",
  timeout=(5.0, 10.0)). The probe now inherits the same urllib3 retry
  policy and TCP keepalive socket options as the rest of the sync HTTP
  layer.

Not touched in this commit:
- problems/dashboard/validate.py — single ChatOpenAI driven via the
  sync .invoke() path, runs once per dashboard score, not on a hot
  loop. Sync httpx default is appropriate here.
- gigaevo/memory/openai_inference.py — sync OpenAI() only, no async
  hot path.

Test coverage adds ``TestHttpAsyncClientForwarding`` to
test_balanced_chat.py — verifies an explicit http_async_client reaches
every per-endpoint ChatOpenAI, and the default ``None`` doesn't inject.
…ohttp + requests

Reflects the actual runtime footprint after the network layer rewrite:

- openai>=1.0.0 → openai[aiohttp]>=2.0.0. DefaultAioHttpClient lives in
  openai 2.x's aiohttp extra (which transitively pulls aiohttp and
  httpx-aiohttp). 2.x is required for the async migration to function.

- httpx>=0.27.0 is dropped from direct dependencies. No `import httpx`
  remains in gigaevo/, problems/, or tests/; httpx stays available as
  a transitive dep through openai / langchain-openai / langfuse /
  litellm but is no longer something this codebase imports directly.

- aiohttp>=3.10 added explicitly. It arrives transitively via the
  openai aiohttp extra, but the codebase now imports aiohttp directly
  (gigaevo/infra/aiohttp_factory.py + _net.py + concept_api callers
  via the openai SDK wrapping). Pinning a floor makes the dependency
  graph honest.

- requests>=2.31 added explicitly. _ConceptApiClient (the Memory API
  client) is now requests-based via gigaevo/infra/requests_factory.py;
  this codifies it as a direct dependency rather than relying on
  transitive presence through assorted other packages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Стабильность асинхронного httpx при длительной работе

1 participant