Skip to content

Fix TCPConnector.close race nullifying resolver during in-flight resolve (#12497)#12688

Draft
jbbqqf wants to merge 2 commits into
aio-libs:masterfrom
jbbqqf:fix/12497-connector-close-race
Draft

Fix TCPConnector.close race nullifying resolver during in-flight resolve (#12497)#12688
jbbqqf wants to merge 2 commits into
aio-libs:masterfrom
jbbqqf:fix/12497-connector-close-race

Conversation

@jbbqqf
Copy link
Copy Markdown

@jbbqqf jbbqqf commented May 23, 2026

What do these changes do?

When TCPConnector.close() runs concurrently with an in-flight non-cached
resolve that is suspended inside a trace callback (e.g.
on_dns_resolvehost_start), the owned resolver used to be torn down
before the connector was marked closed. The suspended resolve then
resumed, called self._resolver.resolve(...) against a nullified backend,
and crashed with
AttributeError: 'NoneType' object has no attribute 'getaddrinfo'.

This PR:

  1. Closes the connector first in TCPConnector.close() (so _closed flips
    to True and any tracked resolve tasks are cancelled via
    _close_immediately), then awaits self._resolver.close().
  2. After the trace callback await in the non-cached _resolve_host path,
    short-circuits with ClientConnectionError("Connector is closed.")
    whenever self._closed is set. The non-cached path doesn't store its
    resolve task in _resolve_host_tasks, so this guard is what makes the
    teardown observable to the resuming coroutine.

Net effect: the resuming coroutine now sees the connector as closed and
surfaces ClientConnectionError to the caller, matching the cached-path
behavior described in the issue.

Are there changes in behavior for the user?

Yes, but only in the previously-broken edge case: callers that hit this
race now see ClientConnectionError("Connector is closed.") instead of
AttributeError. The new error matches the wording of the existing check
at aiohttp/connector.py:621 raised on the post-resolve fast path, so
it's consistent with how an already-closed connector is reported
elsewhere.

Is it a substantial burden for the maintainers to support this?

No. The fix is two narrow edits in aiohttp/connector.py
(TCPConnector.close swap + a self._closed guard right after the
trace-callback await) plus a regression test. No public-API change.

Related issue number

Fixes #12497.

Reproduce BEFORE/AFTER yourself (copy-paste)

# Setup (Python 3.11+, aiodns optional but reproduces the AttributeError variant)
git clone https://github.com/aio-libs/aiohttp.git /tmp/aiohttp-repro
cd /tmp/aiohttp-repro
git submodule update --init
python -m venv .venv && source .venv/bin/activate
AIOHTTP_NO_EXTENSIONS=1 pip install -e . aiodns

cat > /tmp/repro_12497.py <<'PY'
import asyncio, sys, aiohttp

async def main():
    async def on_dns_start(session, ctx, params):
        await asyncio.sleep(0)
    trace = aiohttp.TraceConfig()
    trace.on_dns_resolvehost_start.append(on_dns_start)
    session = aiohttp.ClientSession(
        trace_configs=[trace],
        connector=aiohttp.TCPConnector(use_dns_cache=False),
    )
    task = asyncio.create_task(session.ws_connect("wss://example.com"))
    await asyncio.sleep(0)
    await session.close()
    try:
        await task
    except AttributeError as e:
        print(f"REPRODUCED: {type(e).__name__}: {e}")
    except aiohttp.ClientConnectionError as e:
        print(f"FIXED: {type(e).__name__}: {e}")
    except Exception as e:
        print(f"OTHER: {type(e).__name__}: {e}")

print(f"Python {sys.version.split()[0]}, aiohttp {aiohttp.__version__}")
asyncio.run(main())
PY

# BEFORE — on origin/master
git checkout master
AIOHTTP_NO_EXTENSIONS=1 pip install -e . --quiet
AIOHTTP_NO_EXTENSIONS=1 python /tmp/repro_12497.py
# Expected: REPRODUCED: AttributeError: 'NoneType' object has no attribute 'getaddrinfo'

# AFTER — on this PR branch
git fetch https://github.com/jbbqqf/aiohttp.git fix/12497-connector-close-race
git checkout FETCH_HEAD
AIOHTTP_NO_EXTENSIONS=1 pip install -e . --quiet
AIOHTTP_NO_EXTENSIONS=1 python /tmp/repro_12497.py
# Expected: FIXED: ClientConnectionError: Connector is closed.

What I ran locally

$ AIOHTTP_NO_EXTENSIONS=1 pytest tests/test_connector.py --timeout=60 -q
================== 180 passed, 7 skipped, 1 xfailed in 1.19s ===================

$ AIOHTTP_NO_EXTENSIONS=1 pytest tests/test_connector.py::test_close_during_non_cached_resolve_raises_client_error \
    tests/test_connector.py::test_tcp_connector_close_resolver \
    tests/test_connector.py::test_close_cancels_resolve_host \
    tests/test_connector.py::test_close_twice \
    tests/test_connector.py::test_dns_error -v --timeout=60
============================== 5 passed in 0.06s ===============================

$ AIOHTTP_NO_EXTENSIONS=1 pytest tests/test_client_session.py tests/test_resolver.py --timeout=60 -q
======================== 113 passed, 1 skipped in 0.64s ========================

The new regression test test_close_during_non_cached_resolve_raises_client_error
fails on master with the exact AttributeError: 'NoneType' object has no attribute 'getaddrinfo' from the issue, and passes on this branch.

Edge cases

Scenario Before After
use_dns_cache=False + trace cb yielding before resolve AttributeError: 'NoneType'... ClientConnectionError("Connector is closed.")
use_dns_cache=True cached hit + close mid-flight already OK (cached path doesn't touch resolver after _cached_hosts.next_addrs) unchanged
use_dns_cache=True cache miss + tracked task tracked task cancelled by _close_immediately unchanged (cancellation still fires first because super().close() now precedes resolver.close())
Connector closed twice (close() reentry) OK (_close_immediately early-returns on _closed) unchanged
Resolver not owned by the connector (resolver_owner=False) resolver kept alive unchanged: we still skip resolver.close()

Checklist

  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes (N/A — bug fix, no public API change)
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt
  • Add a new news fragment into the CHANGES/ folder (CHANGES/12497.bugfix.rst)

Drafted with Claude Code (claude-opus-4-7, Anthropic); to be reviewed by @jbbqqf.

Close the connector before the owned resolver so an in-flight non-cached
resolve suspended inside a trace callback (e.g. on_dns_resolvehost_start)
sees the closed connector instead of dereferencing a nullified backend.
The non-cached resolve path now raises ClientConnectionError("Connector
is closed.") after the trace callback await, matching the behavior the
cached path already gets from _close_immediately() cancelling tracked
resolve tasks.

Fixes aio-libs#12497.
@psf-chronographer psf-chronographer Bot added the bot:chronographer:provided There is a change note present in this PR label May 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 23, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
4470 1 4469 142
View the top 1 failed test(s) by shortest run time
tests.test_cookie_helpers::test_cookie_pattern_performance
Stack Traces | 0.164s run time
#x1B[0m#x1B[94mdef#x1B[39;49;00m#x1B[90m #x1B[39;49;00m#x1B[92mtest_cookie_pattern_performance#x1B[39;49;00m() -> #x1B[94mNone#x1B[39;49;00m:#x1B[90m#x1B[39;49;00m
    #x1B[90m    #x1B[39;49;00m#x1B[33m"""Test that the cookie pattern doesn't suffer from ReDoS issues."""#x1B[39;49;00m#x1B[90m#x1B[39;49;00m
        COOKIE_PATTERN_TIME_THRESHOLD_SECONDS = #x1B[94m0.08#x1B[39;49;00m#x1B[90m#x1B[39;49;00m
        value = #x1B[33m"#x1B[39;49;00m#x1B[33ma#x1B[39;49;00m#x1B[33m"#x1B[39;49;00m + #x1B[33m"#x1B[39;49;00m#x1B[33m=#x1B[39;49;00m#x1B[33m"#x1B[39;49;00m * #x1B[94m21651#x1B[39;49;00m + #x1B[33m"#x1B[39;49;00m#x1B[33m\x00#x1B[39;49;00m#x1B[33m"#x1B[39;49;00m#x1B[90m#x1B[39;49;00m
        start = time.perf_counter()#x1B[90m#x1B[39;49;00m
        match = helpers._COOKIE_PATTERN.match(value)#x1B[90m#x1B[39;49;00m
        elapsed = time.perf_counter() - start#x1B[90m#x1B[39;49;00m
    #x1B[90m#x1B[39;49;00m
        #x1B[90m# If this is taking more time, there's probably a performance/ReDoS issue.#x1B[39;49;00m#x1B[90m#x1B[39;49;00m
>       #x1B[94massert#x1B[39;49;00m elapsed < COOKIE_PATTERN_TIME_THRESHOLD_SECONDS, (#x1B[90m#x1B[39;49;00m
            #x1B[33mf#x1B[39;49;00m#x1B[33m"#x1B[39;49;00m#x1B[33mPattern took #x1B[39;49;00m#x1B[33m{#x1B[39;49;00melapsed#x1B[90m #x1B[39;49;00m*#x1B[90m #x1B[39;49;00m#x1B[94m1000#x1B[39;49;00m#x1B[33m:#x1B[39;49;00m#x1B[33m.1f#x1B[39;49;00m#x1B[33m}#x1B[39;49;00m#x1B[33mms, #x1B[39;49;00m#x1B[33m"#x1B[39;49;00m#x1B[90m#x1B[39;49;00m
            #x1B[33mf#x1B[39;49;00m#x1B[33m"#x1B[39;49;00m#x1B[33mexpected <#x1B[39;49;00m#x1B[33m{#x1B[39;49;00mCOOKIE_PATTERN_TIME_THRESHOLD_SECONDS#x1B[90m #x1B[39;49;00m*#x1B[90m #x1B[39;49;00m#x1B[94m1000#x1B[39;49;00m#x1B[33m:#x1B[39;49;00m#x1B[33m.0f#x1B[39;49;00m#x1B[33m}#x1B[39;49;00m#x1B[33mms - potential ReDoS issue#x1B[39;49;00m#x1B[33m"#x1B[39;49;00m#x1B[90m#x1B[39;49;00m
        )#x1B[90m#x1B[39;49;00m
#x1B[1m#x1B[31mE       AssertionError: Pattern took 128.2ms, expected <80ms - potential ReDoS issue#x1B[0m
#x1B[1m#x1B[31mE       assert 0.12816452800001343 < 0.08#x1B[0m

COOKIE_PATTERN_TIME_THRESHOLD_SECONDS = 0.08
elapsed    = 0.12816452800001343
match      = None
start      = 261.513801902
value      = 'a====================================================================================================================...==================================================================================================================\x00'

#x1B[1m#x1B[31mtests/test_cookie_helpers.py#x1B[0m:649: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 23, 2026

Merging this PR will not alter performance

✅ 72 untouched benchmarks
⏩ 72 skipped benchmarks1


Comparing jbbqqf:fix/12497-connector-close-race (8064b88) with master (d2c203f)

Open in CodSpeed

Footnotes

  1. 72 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:chronographer:provided There is a change note present in this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TCPConnector.close() race: AttributeError on in-flight resolve with aiodns

1 participant