Auto-reconnect subscribe connection on backend failure. Refs cable-cr/cable#105 by russ · Pull Request #1 · cable-cr/cable-redis

russ · 2026-05-07T22:32:17Z

Summary

Fixes the WebSocket close-1006 cascade in cable-cr/cable#105 from the
backend side: when the Redis pubsub TCP dies (idle reap, CLIENT KILL TYPE pubsub, network blip), open_subscribe_connection now rebuilds
the subscribe connection on a backoff and replays tracked
subscriptions onto it, instead of permanently killing message dispatch.

@subscribed_channels set tracks every active stream identifier so a
fresh connection can replay them. The internal control channel is
re-subscribed implicitly when the loop re-enters the pubsub block.
@shutting_down flag, set by close_subscribe_connection, lets a
clean teardown exit the recovery loop instead of looping forever.
subscribe / unsubscribe swallow IO::Error because the recovery
loop's replay is the source of truth.
Fall through to the reconnect path when the pubsub block returns
cleanly, not only on IO::Error — jgaskins/redis exits its read
loop normally when read? returns nil, which is what server-side
closes (the exact CLIENT KILL TYPE pubsub repro) look like.

Companion PR on the cable side: cable-cr/cable#106 — widens the
rescue in Cable::Connection#initialize so a dead-backend IO::Error
during subscribe_to_internal_channel becomes a clean 1011
InternalServerError close instead of a bare TCP teardown.

Also adds a compose.yaml so contributors can run the suite without
a local Crystal install (docker compose run --rm app crystal spec).

Test plan

crystal spec passes on Crystal 1.10.0 (CI floor) and latest (1.20.1)
crystal tool format --check clean on both
./bin/ameba clean on both
New integration spec issues CLIENT KILL TYPE pubsub against a
live Redis and verifies a fresh publish reaches the client
through the new connection
CI matrix (1.10.0 / latest / nightly) green on this PR

…/cable#105 Wraps open_subscribe_connection's pubsub block in a reconnect loop with backoff so a transient Redis death (restart, network blip, idle reap) no longer kills message dispatch permanently. Tracks active stream subscriptions in @subscribed_channels and replays them onto the fresh connection inside a spawned fiber, so per-stream channels (e.g. chat_1) keep flowing after a reconnect — not just the internal control channel. subscribe / unsubscribe now swallow IO::Error and rely on the recovery loop's replay; close_subscribe_connection sets a @shutting_down flag so a clean Cable.restart does not get interpreted as a transient failure and trigger reconnect spam. Adds an integration spec that issues CLIENT KILL TYPE pubsub against a real Redis (mirroring the issue's reproduction) and asserts message dispatch resumes after the kill. The spec is timing-sensitive by design — the comment flags where to bump the post-kill sleep if CI flakes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings up Redis with a healthcheck and a Crystal container with the source bind-mounted, so contributors can run `shards install`, `crystal spec`, format, and ameba without a local Crystal install. CRYSTAL_VERSION overrides the image tag to match the CI matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CLIENT KILL TYPE pubsub (the cable#105 repro) closes the socket server-side. jgaskins/redis Subscription#call sees `read?` return nil and exits its loop cleanly — no IO::Error is raised — so the old `break # subscribe returned cleanly` path treated it as success and never reconnected. Move the reconnect/sleep/new-connection code out of the rescue and run it whenever the block returns, gated on @shutting_down so a real teardown still exits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous reconnect loop wrapped only the pubsub block in a rescue. The reopen call -- Redis::Connection.new(...) -- ran unprotected, so the first DNS/connect failure during an outage (e.g. Redis container down, network blip) crashed the subscribe fiber via Socket::Addrinfo::Error and dispatch never recovered, even after the backend came back. Widen the inner rescue to cover Socket::Error as well, and wrap the reopen call in its own rescue that swallows DNS/connect failures and falls through to the next backoff cycle. on_error is deliberately not fired for reopen failures so a multi-minute outage doesn't flood the operator's error tracker (the loop ticks once per second). Adds a regression spec that flips Cable.settings.url to an unreachable port across multiple backoff cycles, then restores it and verifies the subscribe fiber resumed and replay_tracked_subscriptions re-registered the test channel.

russ · 2026-05-11T22:03:06Z

Follow-up: subscribe fiber still crashes on a full backend outage (fixed in `35cf906`)

Integration-tested this PR against a real Crystal/Lucky deployment (Crystal 1.14.1, two browser tabs subscribed to a live chat room, backend holding 10 active pubsub channels). The recovery loop survives CLIENT KILL TYPE pubsub cleanly — exactly as the spec asserts — but does not survive a full Redis outage where the reconnect attempt itself raises before any I/O.

Failure mode

The rescue e : IO::Error at src/cable-redis.cr:83 only covers the subscribe(channel) do … end block. The bare Redis::Connection.new(URI.parse(Cable.settings.url)) at line 92 runs after that rescue closes, so any exception there escapes the loop and kills the fiber:

Unhandled exception in spawn(name: Cable::Server - subscribe):
  Hostname lookup for redis failed: No address found (Socket::Addrinfo::Error)
  from lib/redis/src/connection.cr:34:7 in 'initialize'
  from lib/redis/src/connection.cr:31:5 in 'new'
  from lib/cable-redis/src/cable-redis.cr:92:28 in 'open_subscribe_connection'

Once that fiber is dead, no further reconnect attempts ever fire — even after the backend comes back. Browser-side WS clients reconnect, in-server bot messages still transmit, but cross-process Cable::Server.publish writes land on a pubsub channel with no subscriber. From the operator's perspective: chat goes permanently dead until the process is restarted — same end state as the bug this PR was meant to eliminate.

Fix

Two-part patch in 35cf906:

Widen the existing rescue from IO::Error to IO::Error | Socket::Error so a connection-level error mid-pubsub-block also routes through the reconnect path.
Wrap the reopen line itself in begin/rescue that swallows IO::Error | Socket::Error and falls through to the next loop iteration. The dead @redis_subscribe is intentionally left in place — the next iteration's subscribe call will raise IO::Error immediately, get caught by the outer rescue, and we back off + retry the reopen.

on_error is deliberately not invoked on reopen failures. A multi-second outage ticks the loop once per second; routing every tick to Bugsnag/Sentry would flood operators' error trackers for the duration of the outage. Retry attempts are visible via Cable::Logger.warn at the appropriate level instead.

Regression spec

The new spec flips Cable.settings.url to redis://127.0.0.1:1 (port nothing listens on), kills the live pubsub socket, sits in the failing-reopen state for SUBSCRIBE_RECONNECT_BACKOFF * 3 + 500ms, then restores the real URL and asserts that a fresh publish reaches the test socket.

Red → green proof — same spec, both runs against this PR's branch:

# Without the patch (PR @ 022526d):
3 examples, 1 failures, 0 errors, 0 pending
Failures:
  1) Cable::RedisBackend subscribe-connection recovery (cable-cr/cable#105)
     survives reopen failures and recovers when the backend comes back
     Failure/Error: socket.messages.any?(&.includes?("after-recovery")).should be_true
       Expected: true
            got: false

# With the patch (PR @ 35cf906):
3 examples, 0 failures, 0 errors, 0 pending

Live integration test

Setup: a Crystal/Lucky app using Cable with Cable.settings.url = redis://redis:6379, two browser tabs (user A and user B) both subscribed to the same chat room. Pre-stop, the backend held 10 active pubsub subscriptions covering chat, stream-event, whisper, and Cable's internal control channels.

Timeline (clock = host wall time):

14:55:39   docker compose stop redis
           Pre-stop: PUBSUB CHANNELS shows 10 channels, 1 subscriber TCP

14:55:40 - 14:56:17  reconnect loop ticks every 1s for 30s:
             ▸ Cable::RedisBackend subscribe disconnected; reconnecting in 1.0s
             ▸ Socket::Addrinfo::Error: Hostname lookup for redis failed: No address found
               …lib/cable-redis/src/cable-redis.cr:95:30 in 'open_subscribe_connection'
           15+ iterations caught by the new rescue.
           Zero "Unhandled exception" lines. Fiber stays alive.

14:56:17   docker compose start redis

14:56:22   On Redis: new subscriber client id=10, age=10s, sub=10, cmd=subscribe
           PUBSUB CHANNELS shows all 10 original channels
           (replay_tracked_subscriptions re-registered each one)

14:56:30   Fresh message from user A → user B receives in real-time

End-to-end recovery: ~5 seconds after Redis returned, with no client-side reconnect, no backend restart, no operator action beyond starting Redis. The 38-second total chat outage maps cleanly to Redis downtime + reconnect/replay.

Against this PR's previous tip (022526d), the same scenario crashed the subscribe fiber on the first iteration and chat never recovered — bringing Redis back did nothing. The unfixed spec output above mirrors this behavior.

russ and others added 3 commits May 7, 2026 15:05

russ commented May 7, 2026

View reviewed changes

Comment thread src/cable-redis.cr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-reconnect subscribe connection on backend failure. Refs cable-cr/cable#105#1

Auto-reconnect subscribe connection on backend failure. Refs cable-cr/cable#105#1
russ wants to merge 4 commits into
cable-cr:mainfrom
russ:reconnect-subscribe-on-backend-failure

russ commented May 7, 2026

Uh oh!

Uh oh!

russ commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

russ commented May 7, 2026

Summary

Test plan

Uh oh!

Uh oh!

russ commented May 11, 2026

Follow-up: subscribe fiber still crashes on a full backend outage (fixed in 35cf906)

Failure mode

Fix

Regression spec

Live integration test

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Follow-up: subscribe fiber still crashes on a full backend outage (fixed in `35cf906`)