Skip to content

[GMLP-9206] Backport redis-py PR #3557 recursion fix#3

Open
waiho-gumloop wants to merge 4 commits into
gumloop:mainfrom
waiho-gumloop:waiho-gmlp-9206-redis-py-recursion-workaround
Open

[GMLP-9206] Backport redis-py PR #3557 recursion fix#3
waiho-gumloop wants to merge 4 commits into
gumloop:mainfrom
waiho-gumloop:waiho-gmlp-9206-redis-py-recursion-workaround

Conversation

@waiho-gumloop

@waiho-gumloop waiho-gumloop commented Jun 30, 2026

Copy link
Copy Markdown

What

Adds celery/_redis_py_recursion_workaround.py, imported from celery/__init__.py, that backports redis-py PR #3557 for redis-py < 6.0.0b2. Faithful port of the upstream fix:

  • Adds Connection.connect_check_health and Connection.on_connect_check_health.
  • Threads check_health through every send_command in the on_connect chain.
  • Changes send_packed_command's stale-socket reconnect from self.connect() to self.connect_check_health(check_health=False).
  • Preserves the upstream transparent-reconnect contract.
  • Adds a single WARNING-per-process observability log the first time the suppressed-check_health reconnect fires (subsequent reconnects log at DEBUG).

Why

On 2026-06-26T03:46:35Z–03:46:50Z, 171 Celery worker pods in production crashed with RecursionError('maximum recursion depth exceeded while calling a Python object') during a Redis Deployment Recreate rollout. Confirmed via prod cluster inspection: Redis ReplicaSet redis-6f9c8ffc9f was created at 2026-06-26T03:45:56Z, matching the first worker "Connection to broker lost" log at 03:45:55.820Z.

The bug is in redis-py 5.2.1's reconnect path (upstream issue #3745, fixed in PR #3557):

connect() -> on_connect()
  -> send_command("CLIENT","SETINFO","LIB-NAME", lib_name)
  -> send_packed_command(check_health=True)
    -> check_health() -> retry.call_with_retry(_send_ping, _ping_failed)
      -> _send_ping()                             # fails (any retryable error)
      -> _ping_failed() -> disconnect()           # _sock = None
      -> _send_ping()                             # retry
        -> send_command("PING", check_health=False)
        -> send_packed_command(check_health=False)
        -> if not self._sock: self.connect()      # <-- re-enters top of chain
        -> on_connect() -> CLIENT SETINFO -> check_health -> ...
        -> recursion (~166 levels until Python stack limit)

Trigger conditions (all set in every Celery service's broker_transport_options):

  • health_check_interval = 30
  • retry_on_timeout = True
  • lib_name non-empty (redis-py defaults it to "redis-py")

Failure-mode agnostic: the recursion fires regardless of whether the initial _send_ping failure is TimeoutError, ConnectionError, BusyLoadingError, or a raw socket error. This fix breaks the cycle at the reconnect point, so all these variants surface as normal handleable errors.

Why not just upgrade redis-py

kombu==5.5.2 pins redis<=5.2.1. celery 5.5.x pins kombu<5.6. A redis-py bump alone is rejected by uv sync. Proper fix is a coordinated celery 5.6+ / kombu 5.6+ / redis-py 6.0+ upgrade, which also requires rebasing this fork onto upstream celery 5.6+ (forward-porting rbehal's spawn pool work, PR #1 --disable-prefetch, PR #2 GMLP-9012). That is tracked strategically. This PR is the tactical mitigation.

Idempotency / no-op conditions

  • redis-py not installed → no-op
  • redis-py ≥ 6.0.0b2 (detected via hasattr(Connection, 'connect_check_health')) → no-op
  • Guards against double-patching on re-import

Validation

Two reproduction methods worked on the beta cluster:

  • Deterministic in-pod repro: kubectl exec a Python script into a worker pod that monkeypatches redis.connection.Connection._send_ping to always raise TimeoutError after sending PING. Then client.ping() against live beta Redis. Unpatched → 5/5 RecursionError. Patched → 5/5 non-recursion.
  • Live Redis DEBUG SLEEP + tracer script: temporarily enabled --enable-debug-command yes on the beta Redis Deployment, then redis-cli DEBUG SLEEP 180 (backgrounded) + a tracer script inside a worker pod using socket_timeout=0.2 so each recursion level ≈ 200 ms. Unpatched → RecursionError max_depth=996 in 13.42 s. Patched → non-recursion TimeoutError at max_depth=29 in 0.81 s. Redis deployment reverted after.
Test Image Result
Synthetic in-pod, unpatched 5.5.3+gumloop.0.1.5 5/5 RecursionError
Synthetic in-pod, patched (inline) unfixed + inline monkeypatch 5/5 non-recursion
Synthetic in-pod, patched (via wheel) 5.5.3+gumloop.0.1.5.gmlp9206.test1 5/5 non-recursion, WARNING log fires
Live Redis, unpatched (tracer, DEBUG SLEEP 180 + socket_timeout=0.2) unfixed RecursionError at max_depth=996 in 13.42s
Live Redis, patched (tracer, same setup) fixed non-recursion TimeoutError at max_depth=29 in 0.81s
Prod incident (2026-06-26) unfixed 171 RecursionError crashes in 15s window

Test wheel 5.5.3+gumloop.0.1.5.gmlp9206.test1 is on beta cluster now. Do not promote that tag to staging/prod. Cut a clean 5.5.3+gumloop.0.1.6 on merge.

Caveats

  • Only takes effect when celery is imported (celery/__init__.py applies the patch as import side-effect). For our Celery broker code paths this is universally true.
  • App code that uses redis-py directly without importing celery first will NOT get the workaround; that path is covered by the coordinated redis-py upgrade tracked separately.

References

Workaround for the on_connect-recursion bug in redis-py < 6.0.0b2 that
crashed ~261 Celery workers across all services during the production
Redis broker pod replacement on 2026-06-25 (incident GMLP-9206).

Root cause: when a redis-py Connection is mid-handshake and the inner
PING from check_health fails, send_packed_command calls self.connect()
to recover, which re-enters on_connect -> CLIENT SETINFO -> check_health
-> PING -> send_packed_command -> connect(), recursing until
RecursionError crashes the worker.

The upstream fix (redis-py PR celery#3557) splits connect into
connect_check_health and threads check_health=False through the inner
reconnect. It is in redis-py 6.0.0b2+ only and is NOT backported to
the 5.x line.

We cannot adopt the upstream fix without a coordinated upgrade:
  - kombu 5.5.x pins redis<=5.2.1
  - celery 5.5.x pins kombu<5.6
  - the gumloop-celery fork would need to be rebased on celery 5.6+
    (which also requires forward-porting rbehal/celery's spawn pool
    work and our --disable-prefetch / GMLP-9012 PRs)

Until that upgrade lands, this patches Connection.send_packed_command
to raise ConnectionError instead of silently calling self.connect()
when the socket is gone. Higher-level callers (kombu Channel,
redis.client, application pools) already handle ConnectionError with
proper retry, so the failure surfaces cleanly.

Patch is applied via import side effect from celery/__init__.py and is
a no-op when:
  - redis-py is not installed
  - redis-py >= 6.0.0b2 (detected via hasattr Connection,
    'connect_check_health')

References:
  - redis-py PR celery#3557: redis/redis-py#3557
  - redis-py issue celery#3745: redis/redis-py#3745
  - Linear GMLP-9206
The imported module's own docstring already explains what it does.
…vability log

Replaces the minimal patch (which raised ConnectionError to break the
cycle) with a faithful port of redis-py PR celery#3557's actual logic:

  * Adds Connection.connect_check_health(check_health: bool = True)
  * Adds Connection.on_connect_check_health(check_health: bool = True)
  * Rebinds Connection.connect() and Connection.on_connect() as
    backwards-compatible wrappers
  * In Connection.send_packed_command, replaces the reconnect branch
    'if not self._sock: self.connect()' with
    'self.connect_check_health(check_health=False)' so the inner
    reconnect does not re-fire check_health -> _send_ping ->
    send_packed_command -> connect -> ... (the recursion cycle)

Preserves the upstream contract: a transient broker blip is recovered
transparently inside send_packed_command, the outer caller's command
succeeds against the freshly-established connection. Application code
using redis-py directly (e.g. cache clients) no longer sees a
ConnectionError on every blip, matching the behavior callers expect
from redis-py.

Adds one operational concession to upstream: a single WARNING per
worker process the first time the suppressed-check_health reconnect
fires, with subsequent reconnects logged at DEBUG. Restores a small
observability signal for incident triage without log spam.

Validated on the beta cluster (deterministic in-pod reproducer, 5/5
runs show recursion broken, observability WARNING fires correctly,
control flow matches the expected 4-attempt retry shape).
@waiho-gumloop waiho-gumloop reopened this Jul 1, 2026
@waiho-gumloop waiho-gumloop marked this pull request as ready for review July 1, 2026 01:23
Point to the specific line ranges of the AbstractConnection.connect and
AbstractConnection.on_connect methods that this module ports, so reviewers
can diff the port against the exact upstream source.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants