[GMLP-9206] Backport redis-py PR #3557 recursion fix by waiho-gumloop · Pull Request #3 · gumloop/gumloop-celery

waiho-gumloop · 2026-06-30T17:41:36Z

What

Adds celery/_redis_py_recursion_workaround.py, imported from celery/__init__.py, that backports redis-py PR #3557 for redis-py < 6.0.0b2. Faithful port of the upstream fix:

Adds Connection.connect_check_health and Connection.on_connect_check_health.
Threads check_health through every send_command in the on_connect chain.
Changes send_packed_command's stale-socket reconnect from self.connect() to self.connect_check_health(check_health=False).
Preserves the upstream transparent-reconnect contract.
Adds a single WARNING-per-process observability log the first time the suppressed-check_health reconnect fires (subsequent reconnects log at DEBUG).

Why

On 2026-06-26T03:46:35Z–03:46:50Z, 171 Celery worker pods in production crashed with RecursionError('maximum recursion depth exceeded while calling a Python object') during a Redis Deployment Recreate rollout. Confirmed via prod cluster inspection: Redis ReplicaSet redis-6f9c8ffc9f was created at 2026-06-26T03:45:56Z, matching the first worker "Connection to broker lost" log at 03:45:55.820Z.

The bug is in redis-py 5.2.1's reconnect path (upstream issue #3745, fixed in PR #3557):

connect() -> on_connect()
  -> send_command("CLIENT","SETINFO","LIB-NAME", lib_name)
  -> send_packed_command(check_health=True)
    -> check_health() -> retry.call_with_retry(_send_ping, _ping_failed)
      -> _send_ping()                             # fails (any retryable error)
      -> _ping_failed() -> disconnect()           # _sock = None
      -> _send_ping()                             # retry
        -> send_command("PING", check_health=False)
        -> send_packed_command(check_health=False)
        -> if not self._sock: self.connect()      # <-- re-enters top of chain
        -> on_connect() -> CLIENT SETINFO -> check_health -> ...
        -> recursion (~166 levels until Python stack limit)

Trigger conditions (all set in every Celery service's broker_transport_options):

health_check_interval = 30
retry_on_timeout = True
lib_name non-empty (redis-py defaults it to "redis-py")

Failure-mode agnostic: the recursion fires regardless of whether the initial _send_ping failure is TimeoutError, ConnectionError, BusyLoadingError, or a raw socket error. This fix breaks the cycle at the reconnect point, so all these variants surface as normal handleable errors.

Why not just upgrade redis-py

kombu==5.5.2 pins redis<=5.2.1. celery 5.5.x pins kombu<5.6. A redis-py bump alone is rejected by uv sync. Proper fix is a coordinated celery 5.6+ / kombu 5.6+ / redis-py 6.0+ upgrade, which also requires rebasing this fork onto upstream celery 5.6+ (forward-porting rbehal's spawn pool work, PR #1 --disable-prefetch, PR #2 GMLP-9012). That is tracked strategically. This PR is the tactical mitigation.

Idempotency / no-op conditions

redis-py not installed → no-op
redis-py ≥ 6.0.0b2 (detected via hasattr(Connection, 'connect_check_health')) → no-op
Guards against double-patching on re-import

Validation

Two reproduction methods worked on the beta cluster:

Deterministic in-pod repro: kubectl exec a Python script into a worker pod that monkeypatches redis.connection.Connection._send_ping to always raise TimeoutError after sending PING. Then client.ping() against live beta Redis. Unpatched → 5/5 RecursionError. Patched → 5/5 non-recursion.
Live Redis DEBUG SLEEP + tracer script: temporarily enabled --enable-debug-command yes on the beta Redis Deployment, then redis-cli DEBUG SLEEP 180 (backgrounded) + a tracer script inside a worker pod using socket_timeout=0.2 so each recursion level ≈ 200 ms. Unpatched → RecursionError max_depth=996 in 13.42 s. Patched → non-recursion TimeoutError at max_depth=29 in 0.81 s. Redis deployment reverted after.

Test	Image	Result
Synthetic in-pod, unpatched	`5.5.3+gumloop.0.1.5`	5/5 `RecursionError`
Synthetic in-pod, patched (inline)	unfixed + inline monkeypatch	5/5 non-recursion
Synthetic in-pod, patched (via wheel)	`5.5.3+gumloop.0.1.5.gmlp9206.test1`	5/5 non-recursion, WARNING log fires
Live Redis, unpatched (tracer, `DEBUG SLEEP 180` + `socket_timeout=0.2`)	unfixed	`RecursionError` at `max_depth=996` in 13.42s
Live Redis, patched (tracer, same setup)	fixed	non-recursion `TimeoutError` at `max_depth=29` in 0.81s
Prod incident (2026-06-26)	unfixed	171 RecursionError crashes in 15s window

Test wheel 5.5.3+gumloop.0.1.5.gmlp9206.test1 is on beta cluster now. Do not promote that tag to staging/prod. Cut a clean 5.5.3+gumloop.0.1.6 on merge.

Caveats

Only takes effect when celery is imported (celery/__init__.py applies the patch as import side-effect). For our Celery broker code paths this is universally true.
App code that uses redis-py directly without importing celery first will NOT get the workaround; that path is covered by the coordinated redis-py upgrade tracked separately.

References

redis-py PR documentation for 3.1 is not available on rtd celery/celery#3557: Fixed infinitely recursive health checks redis/redis-py#3557
redis-py issue Consumer does not shutdown properly when embedded in gevent application celery/celery#3745: RecursionError in redis.asyncio when using retry_on_timeout or health_check_interval redis/redis-py#3745
redis-py issue ignore_result=True in add_unlock_chord_task preventing all backend reporting? celery/celery#2563 (same recursion via max clients reached): Infinity retry on the exception of max number of clients reached redis/redis-py#2563
Linear: GMLP-9206

Workaround for the on_connect-recursion bug in redis-py < 6.0.0b2 that crashed ~261 Celery workers across all services during the production Redis broker pod replacement on 2026-06-25 (incident GMLP-9206). Root cause: when a redis-py Connection is mid-handshake and the inner PING from check_health fails, send_packed_command calls self.connect() to recover, which re-enters on_connect -> CLIENT SETINFO -> check_health -> PING -> send_packed_command -> connect(), recursing until RecursionError crashes the worker. The upstream fix (redis-py PR celery#3557) splits connect into connect_check_health and threads check_health=False through the inner reconnect. It is in redis-py 6.0.0b2+ only and is NOT backported to the 5.x line. We cannot adopt the upstream fix without a coordinated upgrade: - kombu 5.5.x pins redis<=5.2.1 - celery 5.5.x pins kombu<5.6 - the gumloop-celery fork would need to be rebased on celery 5.6+ (which also requires forward-porting rbehal/celery's spawn pool work and our --disable-prefetch / GMLP-9012 PRs) Until that upgrade lands, this patches Connection.send_packed_command to raise ConnectionError instead of silently calling self.connect() when the socket is gone. Higher-level callers (kombu Channel, redis.client, application pools) already handle ConnectionError with proper retry, so the failure surfaces cleanly. Patch is applied via import side effect from celery/__init__.py and is a no-op when: - redis-py is not installed - redis-py >= 6.0.0b2 (detected via hasattr Connection, 'connect_check_health') References: - redis-py PR celery#3557: redis/redis-py#3557 - redis-py issue celery#3745: redis/redis-py#3745 - Linear GMLP-9206

The imported module's own docstring already explains what it does.

…vability log Replaces the minimal patch (which raised ConnectionError to break the cycle) with a faithful port of redis-py PR celery#3557's actual logic: * Adds Connection.connect_check_health(check_health: bool = True) * Adds Connection.on_connect_check_health(check_health: bool = True) * Rebinds Connection.connect() and Connection.on_connect() as backwards-compatible wrappers * In Connection.send_packed_command, replaces the reconnect branch 'if not self._sock: self.connect()' with 'self.connect_check_health(check_health=False)' so the inner reconnect does not re-fire check_health -> _send_ping -> send_packed_command -> connect -> ... (the recursion cycle) Preserves the upstream contract: a transient broker blip is recovered transparently inside send_packed_command, the outer caller's command succeeds against the freshly-established connection. Application code using redis-py directly (e.g. cache clients) no longer sees a ConnectionError on every blip, matching the behavior callers expect from redis-py. Adds one operational concession to upstream: a single WARNING per worker process the first time the suppressed-check_health reconnect fires, with subsequent reconnects logged at DEBUG. Restores a small observability signal for incident triage without log spam. Validated on the beta cluster (deterministic in-pod reproducer, 5/5 runs show recursion broken, observability WARNING fires correctly, control flow matches the expected 4-attempt retry shape).

Point to the specific line ranges of the AbstractConnection.connect and AbstractConnection.on_connect methods that this module ports, so reviewers can diff the port against the exact upstream source.

waiho-gumloop added 3 commits June 29, 2026 14:38

Remove redundant comment in celery/__init__.py

7a2d497

The imported module's own docstring already explains what it does.

waiho-gumloop closed this Jun 30, 2026

waiho-gumloop reopened this Jul 1, 2026

waiho-gumloop marked this pull request as ready for review July 1, 2026 01:23

Add exact redis-py v5.2.1 source permalinks in docstrings

a1d47cd

Point to the specific line ranges of the AbstractConnection.connect and AbstractConnection.on_connect methods that this module ports, so reviewers can diff the port against the exact upstream source.

rbehal approved these changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GMLP-9206] Backport redis-py PR #3557 recursion fix#3

[GMLP-9206] Backport redis-py PR #3557 recursion fix#3
waiho-gumloop wants to merge 4 commits into
gumloop:mainfrom
waiho-gumloop:waiho-gmlp-9206-redis-py-recursion-workaround

waiho-gumloop commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

waiho-gumloop commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Why not just upgrade redis-py

Idempotency / no-op conditions

Validation

Caveats

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waiho-gumloop commented Jun 30, 2026 •

edited

Loading