Skip to content

Auto-reconnect subscribe connection on backend failure. Refs cable-cr/cable#105#1

Open
russ wants to merge 4 commits into
cable-cr:mainfrom
russ:reconnect-subscribe-on-backend-failure
Open

Auto-reconnect subscribe connection on backend failure. Refs cable-cr/cable#105#1
russ wants to merge 4 commits into
cable-cr:mainfrom
russ:reconnect-subscribe-on-backend-failure

Conversation

@russ

@russ russ commented May 7, 2026

Copy link
Copy Markdown

Summary

Fixes the WebSocket close-1006 cascade in cable-cr/cable#105 from the
backend side: when the Redis pubsub TCP dies (idle reap, CLIENT KILL TYPE pubsub, network blip), open_subscribe_connection now rebuilds
the subscribe connection on a backoff and replays tracked
subscriptions onto it, instead of permanently killing message dispatch.

  • @subscribed_channels set tracks every active stream identifier so a
    fresh connection can replay them. The internal control channel is
    re-subscribed implicitly when the loop re-enters the pubsub block.
  • @shutting_down flag, set by close_subscribe_connection, lets a
    clean teardown exit the recovery loop instead of looping forever.
  • subscribe / unsubscribe swallow IO::Error because the recovery
    loop's replay is the source of truth.
  • Fall through to the reconnect path when the pubsub block returns
    cleanly, not only on IO::Errorjgaskins/redis exits its read
    loop normally when read? returns nil, which is what server-side
    closes (the exact CLIENT KILL TYPE pubsub repro) look like.

Companion PR on the cable side: cable-cr/cable#106 — widens the
rescue in Cable::Connection#initialize so a dead-backend IO::Error
during subscribe_to_internal_channel becomes a clean 1011
InternalServerError close instead of a bare TCP teardown.

Also adds a compose.yaml so contributors can run the suite without
a local Crystal install (docker compose run --rm app crystal spec).

Test plan

  • crystal spec passes on Crystal 1.10.0 (CI floor) and latest (1.20.1)
  • crystal tool format --check clean on both
  • ./bin/ameba clean on both
  • New integration spec issues CLIENT KILL TYPE pubsub against a
    live Redis and verifies a fresh publish reaches the client
    through the new connection
  • CI matrix (1.10.0 / latest / nightly) green on this PR

russ and others added 3 commits May 7, 2026 15:05
…/cable#105

Wraps open_subscribe_connection's pubsub block in a reconnect loop with backoff
so a transient Redis death (restart, network blip, idle reap) no longer kills
message dispatch permanently.

Tracks active stream subscriptions in @subscribed_channels and replays them
onto the fresh connection inside a spawned fiber, so per-stream channels
(e.g. chat_1) keep flowing after a reconnect — not just the internal control
channel.

subscribe / unsubscribe now swallow IO::Error and rely on the recovery loop's
replay; close_subscribe_connection sets a @shutting_down flag so a clean
Cable.restart does not get interpreted as a transient failure and trigger
reconnect spam.

Adds an integration spec that issues CLIENT KILL TYPE pubsub against a real
Redis (mirroring the issue's reproduction) and asserts message dispatch
resumes after the kill. The spec is timing-sensitive by design — the comment
flags where to bump the post-kill sleep if CI flakes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings up Redis with a healthcheck and a Crystal container with the
source bind-mounted, so contributors can run `shards install`,
`crystal spec`, format, and ameba without a local Crystal install.
CRYSTAL_VERSION overrides the image tag to match the CI matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLIENT KILL TYPE pubsub (the cable#105 repro) closes the socket
server-side. jgaskins/redis Subscription#call sees `read?` return
nil and exits its loop cleanly — no IO::Error is raised — so the
old `break # subscribe returned cleanly` path treated it as success
and never reconnected. Move the reconnect/sleep/new-connection code
out of the rescue and run it whenever the block returns, gated on
@shutting_down so a real teardown still exits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/cable-redis.cr
The previous reconnect loop wrapped only the pubsub block in a rescue.
The reopen call -- Redis::Connection.new(...) -- ran unprotected, so the
first DNS/connect failure during an outage (e.g. Redis container down,
network blip) crashed the subscribe fiber via Socket::Addrinfo::Error
and dispatch never recovered, even after the backend came back.

Widen the inner rescue to cover Socket::Error as well, and wrap the
reopen call in its own rescue that swallows DNS/connect failures and
falls through to the next backoff cycle. on_error is deliberately not
fired for reopen failures so a multi-minute outage doesn't flood the
operator's error tracker (the loop ticks once per second).

Adds a regression spec that flips Cable.settings.url to an unreachable
port across multiple backoff cycles, then restores it and verifies the
subscribe fiber resumed and replay_tracked_subscriptions re-registered
the test channel.
@russ

russ commented May 11, 2026

Copy link
Copy Markdown
Author

Follow-up: subscribe fiber still crashes on a full backend outage (fixed in 35cf906)

Integration-tested this PR against a real Crystal/Lucky deployment (Crystal 1.14.1, two browser tabs subscribed to a live chat room, backend holding 10 active pubsub channels). The recovery loop survives CLIENT KILL TYPE pubsub cleanly — exactly as the spec asserts — but does not survive a full Redis outage where the reconnect attempt itself raises before any I/O.

Failure mode

The rescue e : IO::Error at src/cable-redis.cr:83 only covers the subscribe(channel) do … end block. The bare Redis::Connection.new(URI.parse(Cable.settings.url)) at line 92 runs after that rescue closes, so any exception there escapes the loop and kills the fiber:

Unhandled exception in spawn(name: Cable::Server - subscribe):
  Hostname lookup for redis failed: No address found (Socket::Addrinfo::Error)
  from lib/redis/src/connection.cr:34:7 in 'initialize'
  from lib/redis/src/connection.cr:31:5 in 'new'
  from lib/cable-redis/src/cable-redis.cr:92:28 in 'open_subscribe_connection'

Once that fiber is dead, no further reconnect attempts ever fire — even after the backend comes back. Browser-side WS clients reconnect, in-server bot messages still transmit, but cross-process Cable::Server.publish writes land on a pubsub channel with no subscriber. From the operator's perspective: chat goes permanently dead until the process is restarted — same end state as the bug this PR was meant to eliminate.

Fix

Two-part patch in 35cf906:

  1. Widen the existing rescue from IO::Error to IO::Error | Socket::Error so a connection-level error mid-pubsub-block also routes through the reconnect path.
  2. Wrap the reopen line itself in begin/rescue that swallows IO::Error | Socket::Error and falls through to the next loop iteration. The dead @redis_subscribe is intentionally left in place — the next iteration's subscribe call will raise IO::Error immediately, get caught by the outer rescue, and we back off + retry the reopen.

on_error is deliberately not invoked on reopen failures. A multi-second outage ticks the loop once per second; routing every tick to Bugsnag/Sentry would flood operators' error trackers for the duration of the outage. Retry attempts are visible via Cable::Logger.warn at the appropriate level instead.

Regression spec

The new spec flips Cable.settings.url to redis://127.0.0.1:1 (port nothing listens on), kills the live pubsub socket, sits in the failing-reopen state for SUBSCRIBE_RECONNECT_BACKOFF * 3 + 500ms, then restores the real URL and asserts that a fresh publish reaches the test socket.

Red → green proof — same spec, both runs against this PR's branch:

# Without the patch (PR @ 022526d):
3 examples, 1 failures, 0 errors, 0 pending
Failures:
  1) Cable::RedisBackend subscribe-connection recovery (cable-cr/cable#105)
     survives reopen failures and recovers when the backend comes back
     Failure/Error: socket.messages.any?(&.includes?("after-recovery")).should be_true
       Expected: true
            got: false

# With the patch (PR @ 35cf906):
3 examples, 0 failures, 0 errors, 0 pending

Live integration test

Setup: a Crystal/Lucky app using Cable with Cable.settings.url = redis://redis:6379, two browser tabs (user A and user B) both subscribed to the same chat room. Pre-stop, the backend held 10 active pubsub subscriptions covering chat, stream-event, whisper, and Cable's internal control channels.

Timeline (clock = host wall time):

14:55:39   docker compose stop redis
           Pre-stop: PUBSUB CHANNELS shows 10 channels, 1 subscriber TCP

14:55:40 - 14:56:17  reconnect loop ticks every 1s for 30s:
             ▸ Cable::RedisBackend subscribe disconnected; reconnecting in 1.0s
             ▸ Socket::Addrinfo::Error: Hostname lookup for redis failed: No address found
               …lib/cable-redis/src/cable-redis.cr:95:30 in 'open_subscribe_connection'
           15+ iterations caught by the new rescue.
           Zero "Unhandled exception" lines. Fiber stays alive.

14:56:17   docker compose start redis

14:56:22   On Redis: new subscriber client id=10, age=10s, sub=10, cmd=subscribe
           PUBSUB CHANNELS shows all 10 original channels
           (replay_tracked_subscriptions re-registered each one)

14:56:30   Fresh message from user A → user B receives in real-time

End-to-end recovery: ~5 seconds after Redis returned, with no client-side reconnect, no backend restart, no operator action beyond starting Redis. The 38-second total chat outage maps cleanly to Redis downtime + reconnect/replay.

Against this PR's previous tip (022526d), the same scenario crashed the subscribe fiber on the first iteration and chat never recovered — bringing Redis back did nothing. The unfixed spec output above mirrors this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant