fix(infra): replace httpx with aiohttp + requests for long-running asyncio stability#17
Open
GrigoryEvko wants to merge 6 commits into
Open
fix(infra): replace httpx with aiohttp + requests for long-running asyncio stability#17GrigoryEvko wants to merge 6 commits into
GrigoryEvko wants to merge 6 commits into
Conversation
New gigaevo/infra/aiohttp_factory.py exposes two factories: - make_aiohttp_session(role, ...) returns aiohttp.ClientSession with a hardened TCPConnector (limit=300, limit_per_host=50, keepalive_timeout=30s, enable_cleanup_closed=True) and a bounded ClientTimeout (total=120s, connect=10s, sock_read=120s, sock_connect=10s). No component defaults to None — the silent-forever-hang on the pool timeout is structurally unreachable. - make_openai_http_client(role, ...) returns openai.DefaultAioHttpClient, the aiohttp-backed http_client adapter shipped in the openai[aiohttp] extra. Surfaces a clear ImportError with install instructions when the extra is missing. The ``role`` argument is informational (memory_api, llm_prompts, ...). Currently unused beyond receipt; reserved for per-role override hooks. Build helpers (build_connector / build_timeout) are exported for callers that need to customize one knob while keeping the rest of the defaults. 14 unit tests covering: hardened defaults, override hatches, trust_env behavior, proxy forwarding through to DefaultAioHttpClient, and ImportError messaging.
problems/prompts/client.py was constructing a bare httpx.AsyncClient(proxy=...) with no limits, no keepalive_expiry, and the openai-default 5s total timeout. problems/chains/client.py was constructing httpx.AsyncClient(timeout=httpx.Timeout(timeout=None, connect=30.0)) — the timeout=None covers read/write/pool, which means the pool can wait indefinitely for a connection slot. That is the silent forever-hang failure shape reported in upstream encode/httpx discussion threads about long-running asyncio processes accumulating leaked connections from cancelled tasks. Both modules now build their http_client via the centralized aiohttp factory (make_openai_http_client). The factory returns openai.DefaultAioHttpClient which speaks the httpx-compatible interface the openai SDK expects but routes traffic through aiohttp's connection pool — pool semantics that survive multi-hour asyncio loads. Bounded ClientTimeout (total=120s, connect=10s, sock_read=120s, sock_connect=10s) plus TCPConnector(limit=300, limit_per_host=50, keepalive_timeout=30, enable_cleanup_closed=True) replace the previous ad-hoc configuration. Tenacity retry config is unchanged — it has no exception-type filter, so aiohttp.ClientError and friends are caught on the same retry path that previously caught httpx errors.
…3 pool Removes the last direct httpx import from gigaevo/. The async-pool deadlock documented in issue FusionBrainLab#9 is specific to httpx.AsyncClient over httpcore; sync httpx.Client isn't the bug surface, but eliminating the import keeps the codebase's network footprint consistent — aiohttp for async paths, requests + urllib3 for sync paths. New gigaevo/infra/requests_factory.py: - make_requests_session(role, *, timeout, pool_connections, pool_maxsize, retry) returns a requests.Session backed by urllib3's connection pool. - _TimeoutSession subclass injects a default per-request timeout because requests has no session-level timeout knob; callers can still override per-call. - build_retry() exposes the urllib3.Retry policy with conservative defaults (total=3, backoff_factor=0.5, retry on 5xx, allow retry on HEAD/GET/OPTIONS/PUT/DELETE — POST excluded so a successfully- processed POST isn't duplicated by retry, raise_on_status=False so the client formats its own error message). _ConceptApiClient: - Uses make_requests_session("memory_api", timeout=...) instead of httpx.Client. ApiSync stays sync. - requests.exceptions.ConnectionError / requests.exceptions.Timeout replace the previous httpx.ConnectError / httpx.TimeoutException branches with the same MemoryStorageError message text. Tests: - tests/memory/_fake_http.py replaces the httpx.MockTransport pattern: make_mocked_client(handler) builds a _ConceptApiClient whose _http.request is patched to dispatch into the test handler. The handler receives a small _FakeRequest with the same attribute surface (method / url / content / headers) the previous httpx-based handlers expected, so per-test handler code keeps working without rewriting. - make_fake_response(status_code, *, json, text) builds a requests.Response with httpx.Response-compatible constructor shape. Library footprint after this commit: - aiohttp for async HTTP paths (via openai.DefaultAioHttpClient, introduced earlier in this branch) - requests + urllib3 for sync HTTP paths (this commit) - No `import httpx` remains in gigaevo/, problems/, or tests/. httpx stays as a transitive dependency through openai / langchain- openai / langfuse / litellm; it's gone from our own modules.
…acks
Adds gigaevo/infra/_net.py — single source of truth for the parts the
sync (requests) and async (aiohttp) factories must agree on:
- DEFAULT_SOCKET_OPTIONS: SO_KEEPALIVE plus per-OS probe timing
(TCP_KEEPIDLE/INTVL/CNT on Linux; TCP_KEEPALIVE on macOS), shifted
from the kernel default of net.ipv4.tcp_keepalive_time=7200 (two
hours, useless for HTTP) down to IDLE=60s / INTVL=10s / CNT=6 — a
dead peer is detected in ~120s rather than 2h. TCP_NODELAY=1 stays
in the list so urllib3's default isn't lost when callers compose
their own option list. apply_socket_options() applies each option
best-effort and swallows individual OSErrors so kernel rejection of
one tuning option never fails an otherwise-good connection.
- build_tls_context(): ssl.SSLContext pinned to TLS 1.2+ with
hostname checking and CA verification on. Both libraries already
default to similar contexts; this makes the properties explicit so
they survive library upgrades and are visible in our code instead of
implicit in ssl.create_default_context.
aiohttp_factory.py:
- _KeepaliveConnector subclasses aiohttp.TCPConnector. aiohttp doesn't
expose socket_options on the connector — the only way to set per-
connection SO_KEEPALIVE is to intercept transport creation. The
subclass overrides _wrap_create_connection and applies the options
to the underlying socket from transport.get_extra_info('socket')
after the connection completes.
- Default keepalive_timeout=20s (was 30s) — short enough to age out
before any common load balancer's idle timeout (AWS ALB 60s,
NGINX 75s, CloudFlare 90s) closes us asymmetrically.
- Default limit_per_host=100 (was 50) — LLM endpoints saturate.
- Default ClientTimeout total=300s, sock_read=300s, connect=15s — LLM
generations occasionally take several minutes; the previous 120s
defaults were too tight.
- Default ssl_context = build_tls_context().
requests_factory.py:
- _KeepaliveHTTPAdapter subclasses HTTPAdapter to thread socket_options
and ssl_context through init_poolmanager into the underlying
PoolManager — required so the options reach every connection minted
by the pool, not just the first.
- Default timeout split into (connect=15s, read=60s) tuple; a single
float is normalized to (value, value) so legacy callsites continue
working.
- Default Retry: separated connect / read / status counts (5 each),
backoff_jitter=0.5 defeats synchronized retry storms,
backoff_max=30s caps the worst-case sleep, status_forcelist now
includes 408 and 425 along with 429 and 5xx. POST is excluded from
allowed_methods (urllib3's safe default) — retrying a successfully-
processed POST creates a duplicate resource.
- Default ssl_context = build_tls_context().
- Default pool_connections=20, pool_maxsize=100 (was 10, 50).
Test coverage:
- DEFAULT_SOCKET_OPTIONS contains SO_KEEPALIVE + TCP_NODELAY on every
platform; TCP_KEEPIDLE on Linux.
- apply_socket_options swallows individual setsockopt failures.
- ssl context default pins TLSv1_2 + check_hostname + CERT_REQUIRED.
- _KeepaliveHTTPAdapter pushes socket_options and ssl_context through
to the PoolManager's connection_pool_kw.
- _KeepaliveConnector is the concrete TCPConnector subclass returned
by build_connector(); enable_cleanup_closed and ttl_dns_cache are
forwarded from build_connector kwargs (asserted via construction
spy because aiohttp 3.13 doesn't expose ttl_dns_cache publicly).
- Retry excludes POST; status_forcelist contains 408/429/503.
- _TimeoutSession.request forwards extra kwargs verbatim.
168/168 tests pass across tests/infra/ + tests/memory/. ruff clean.
…sites
Three call sites in the LLM stack constructed openai SDK clients with no
http_client kwarg, leaving each instance to mint its own internal
connection pool. After this commit they can opt into the aiohttp-backed
factory built earlier in the branch:
- BalancedChatOpenAI gains an explicit ``http_async_client`` constructor
kwarg. When provided, it's forwarded to every per-endpoint ChatOpenAI
via kwargs.setdefault, so all N endpoints share a single underlying
connection pool. Default ``None`` preserves the historical behavior of
letting each ChatOpenAI mint its own client — opt-in, not enforced.
- The ideas_tracker AsyncOpenAI client now passes
http_client=make_openai_http_client("ideas_tracker_async") at
construction. The sibling sync OpenAI client is intentionally left on
the openai SDK default (sync httpx) — sync httpx is not subject to
the documented async-pool deadlock and the openai SDK's sync interface
is built around httpx.Client.
- MultiModelRouter._verify_models was using urllib.request.urlopen
directly with no retry/keepalive/TLS configuration. Swapped to a
session minted by make_requests_session("model_verify",
timeout=(5.0, 10.0)). The probe now inherits the same urllib3 retry
policy and TCP keepalive socket options as the rest of the sync HTTP
layer.
Not touched in this commit:
- problems/dashboard/validate.py — single ChatOpenAI driven via the
sync .invoke() path, runs once per dashboard score, not on a hot
loop. Sync httpx default is appropriate here.
- gigaevo/memory/openai_inference.py — sync OpenAI() only, no async
hot path.
Test coverage adds ``TestHttpAsyncClientForwarding`` to
test_balanced_chat.py — verifies an explicit http_async_client reaches
every per-endpoint ChatOpenAI, and the default ``None`` doesn't inject.
…ohttp + requests Reflects the actual runtime footprint after the network layer rewrite: - openai>=1.0.0 → openai[aiohttp]>=2.0.0. DefaultAioHttpClient lives in openai 2.x's aiohttp extra (which transitively pulls aiohttp and httpx-aiohttp). 2.x is required for the async migration to function. - httpx>=0.27.0 is dropped from direct dependencies. No `import httpx` remains in gigaevo/, problems/, or tests/; httpx stays available as a transitive dep through openai / langchain-openai / langfuse / litellm but is no longer something this codebase imports directly. - aiohttp>=3.10 added explicitly. It arrives transitively via the openai aiohttp extra, but the codebase now imports aiohttp directly (gigaevo/infra/aiohttp_factory.py + _net.py + concept_api callers via the openai SDK wrapping). Pinning a floor makes the dependency graph honest. - requests>=2.31 added explicitly. _ConceptApiClient (the Memory API client) is now requests-based via gigaevo/infra/requests_factory.py; this codifies it as a direct dependency rather than relying on transitive presence through assorted other packages.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #9.
Replaces direct httpx usage in this codebase with a two-stack network layer: aiohttp under the openai SDK for async paths, requests + urllib3 for sync paths. Drops
import httpxfrom gigaevo/, problems/, and tests/ — httpx remains a transitive dependency through openai / langchain-openai / langfuse / litellm but is no longer a surface this codebase directly maintains. The migration targets the failure mode the issue documents: long-running asyncio processes accumulating connections that httpcore's async semaphore never releases, producing silentPoolTimeoutcascades on the timescale of hours.What landed
New
gigaevo/infra/modulesaiohttp_factory.py—make_aiohttp_session(role, ...)returnsaiohttp.ClientSession;make_openai_http_client(role, *, proxy, **overrides)returnsopenai.DefaultAioHttpClient, the aiohttp-backed http_client adapter shipped in theopenai[aiohttp]extra. A_KeepaliveConnectorsubclassesTCPConnectorand appliesDEFAULT_SOCKET_OPTIONSto each transport after_wrap_create_connection(aiohttp doesn't exposesocket_optionsdirectly).requests_factory.py—make_requests_session(role, ...)returns arequests.Sessionwith_KeepaliveHTTPAdaptermounted on bothhttp://andhttps://. The adapter forwardssocket_optionsandssl_contextthroughinit_poolmanagertourllib3.PoolManagerso every minted connection inherits the keepalive settings._TimeoutSessioninjects a default per-request timeout split into(connect, read)becauserequestshas no session-level timeout knob._net.py— sharedDEFAULT_SOCKET_OPTIONS(TCP_NODELAY + SO_KEEPALIVE + per-OS probe timing; Linux drops dead-peer detection fromnet.ipv4.tcp_keepalive_time=7200to ~120 seconds via TCP_KEEPIDLE=60s / KEEPINTVL=10s / KEEPCNT=6) andbuild_tls_context()(TLS 1.2+ minimum, hostname + CA verification on).Defaults targeting the bug profile
ClientTimeout(total=300s, connect=15s, sock_read=300s, sock_connect=15s)— noNonepool waits, LLM-friendly read window.keepalive_timeout=20s— shorter than common LB idle thresholds (AWS ALB 60s, NGINX 75s, CloudFlare 90s) so connections age out before the LB closes them asymmetrically.limit_per_host=100(was effectively unbounded on the old direct-httpx usage).Retry(total=5, connect=5, read=5, status=5, backoff_factor=0.5, backoff_jitter=0.5, backoff_max=30s). POST is excluded fromallowed_methods(urllib3's safe default) — retrying a successfully-processed POST creates a duplicate resource.status_forcelistcovers 408 / 425 / 429 / 5xx;respect_retry_after_header=True.Call site migration
problems/prompts/client.py,problems/chains/client.py— the LLM clients now constructAsyncOpenAIwithhttp_client=make_openai_http_client(...). Theproblems/chains/client.pytimeout=None(which let the pool wait indefinitely — the silent-forever-hang from the issue) is removed; the factory provides bounded defaults.gigaevo/memory/shared_memory/concept_api.py—_ConceptApiClientswitches from synchttpx.Clientto arequests.Sessionfrom the factory. Sync httpx isn't subject to the documented async-pool bug, but removing it consolidates the codebase's network footprint.gigaevo/infra/balanced_chat.py—BalancedChatOpenAIgains an explicithttp_async_clientkwarg, forwarded to every per-endpointChatOpenAIviakwargs.setdefault. DefaultNonepreserves prior behavior; callers opt in for shared aiohttp pool semantics across all endpoints.gigaevo/memory/ideas_tracker/components/fabrics/llm_clients_fabric.py— theAsyncOpenAIclient now passeshttp_client=make_openai_http_client("ideas_tracker_async"). The sibling syncOpenAIis left on the openai SDK default.gigaevo/llm/models.py—MultiModelRouter._verify_modelsswapsurllib.request.urlopenfor amake_requests_session("model_verify", timeout=(5.0, 10.0))session. The probe now carries the same retry / keepalive / TLS settings as the rest of the sync layer.Tests
tests/infra/test_aiohttp_factory.py(24 tests) — connector defaults,enable_cleanup_closedpassed regardless of platform, TLS context pins TLSv1.2 + check_hostname + CERT_REQUIRED, socket options shape on Linux, factory delegation toDefaultAioHttpClient.tests/infra/test_requests_factory.py(22 tests) — POST excluded from retry allowed_methods, status_forcelist contains 408/429/503,_TimeoutSession.requestforwards extra kwargs verbatim, adapter pushessocket_optionsandssl_contexttoPoolManager, default-timeout tuple normalization.tests/infra/test_balanced_chat.py— newTestHttpAsyncClientForwardingcovers explicit-kwarg propagation to every endpoint and the no-kwarg-no-injection default.tests/memory/_fake_http.py— replaceshttpx.MockTransportwith arequests.Session.requestpatch keeping the previous handler interface (request.method,request.url,request.content) so the existing memory test handlers continue working unchanged.Dependency changes
openai>=1.0.0→openai[aiohttp]>=2.0.0.DefaultAioHttpClientrequires the aiohttp extra in openai 2.x.aiohttp>=3.10andrequests>=2.31added explicitly — both are now first-party imports in this codebase.httpx>=0.27.0dropped. Stays available transitively.Deliberately not in this PR
problems/dashboard/validate.pyandgigaevo/memory/openai_inference.py— both use syncOpenAI. Sync httpx isn't the documented bug surface and the openai SDK's sync interface is built around httpx.Client; routing those through anything else requires architectural changes out of scope.urllib.requestusage outside the LLM router (gigaevo/utils/...,tools/...). Those are admin/CLI paths, not the hot path.http_async_clienttoBalancedChatOpenAI. The kwarg is added; wiring callers is a follow-up.