Auto-reconnect subscribe connection on backend failure. Refs cable-cr/cable#105#1
Auto-reconnect subscribe connection on backend failure. Refs cable-cr/cable#105#1russ wants to merge 4 commits into
Conversation
…/cable#105 Wraps open_subscribe_connection's pubsub block in a reconnect loop with backoff so a transient Redis death (restart, network blip, idle reap) no longer kills message dispatch permanently. Tracks active stream subscriptions in @subscribed_channels and replays them onto the fresh connection inside a spawned fiber, so per-stream channels (e.g. chat_1) keep flowing after a reconnect — not just the internal control channel. subscribe / unsubscribe now swallow IO::Error and rely on the recovery loop's replay; close_subscribe_connection sets a @shutting_down flag so a clean Cable.restart does not get interpreted as a transient failure and trigger reconnect spam. Adds an integration spec that issues CLIENT KILL TYPE pubsub against a real Redis (mirroring the issue's reproduction) and asserts message dispatch resumes after the kill. The spec is timing-sensitive by design — the comment flags where to bump the post-kill sleep if CI flakes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings up Redis with a healthcheck and a Crystal container with the source bind-mounted, so contributors can run `shards install`, `crystal spec`, format, and ameba without a local Crystal install. CRYSTAL_VERSION overrides the image tag to match the CI matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLIENT KILL TYPE pubsub (the cable#105 repro) closes the socket server-side. jgaskins/redis Subscription#call sees `read?` return nil and exits its loop cleanly — no IO::Error is raised — so the old `break # subscribe returned cleanly` path treated it as success and never reconnected. Move the reconnect/sleep/new-connection code out of the rescue and run it whenever the block returns, gated on @shutting_down so a real teardown still exits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous reconnect loop wrapped only the pubsub block in a rescue. The reopen call -- Redis::Connection.new(...) -- ran unprotected, so the first DNS/connect failure during an outage (e.g. Redis container down, network blip) crashed the subscribe fiber via Socket::Addrinfo::Error and dispatch never recovered, even after the backend came back. Widen the inner rescue to cover Socket::Error as well, and wrap the reopen call in its own rescue that swallows DNS/connect failures and falls through to the next backoff cycle. on_error is deliberately not fired for reopen failures so a multi-minute outage doesn't flood the operator's error tracker (the loop ticks once per second). Adds a regression spec that flips Cable.settings.url to an unreachable port across multiple backoff cycles, then restores it and verifies the subscribe fiber resumed and replay_tracked_subscriptions re-registered the test channel.
Follow-up: subscribe fiber still crashes on a full backend outage (fixed in 35cf906)Integration-tested this PR against a real Crystal/Lucky deployment (Crystal 1.14.1, two browser tabs subscribed to a live chat room, backend holding 10 active pubsub channels). The recovery loop survives Failure modeThe Once that fiber is dead, no further reconnect attempts ever fire — even after the backend comes back. Browser-side WS clients reconnect, in-server bot messages still transmit, but cross-process FixTwo-part patch in 35cf906:
Regression specThe new spec flips Red → green proof — same spec, both runs against this PR's branch: Live integration testSetup: a Crystal/Lucky app using Cable with Timeline (clock = host wall time): End-to-end recovery: ~5 seconds after Redis returned, with no client-side reconnect, no backend restart, no operator action beyond starting Redis. The 38-second total chat outage maps cleanly to Redis downtime + reconnect/replay. Against this PR's previous tip (022526d), the same scenario crashed the subscribe fiber on the first iteration and chat never recovered — bringing Redis back did nothing. The unfixed spec output above mirrors this behavior. |
Summary
Fixes the WebSocket close-1006 cascade in cable-cr/cable#105 from the
backend side: when the Redis pubsub TCP dies (idle reap,
CLIENT KILL TYPE pubsub, network blip),open_subscribe_connectionnow rebuildsthe subscribe connection on a backoff and replays tracked
subscriptions onto it, instead of permanently killing message dispatch.
@subscribed_channelsset tracks every active stream identifier so afresh connection can replay them. The internal control channel is
re-subscribed implicitly when the loop re-enters the pubsub block.
@shutting_downflag, set byclose_subscribe_connection, lets aclean teardown exit the recovery loop instead of looping forever.
subscribe/unsubscribeswallowIO::Errorbecause the recoveryloop's replay is the source of truth.
cleanly, not only on
IO::Error—jgaskins/redisexits its readloop normally when
read?returns nil, which is what server-sidecloses (the exact
CLIENT KILL TYPE pubsubrepro) look like.Companion PR on the cable side: cable-cr/cable#106 — widens the
rescue in
Cable::Connection#initializeso a dead-backendIO::Errorduring
subscribe_to_internal_channelbecomes a clean 1011InternalServerError close instead of a bare TCP teardown.
Also adds a
compose.yamlso contributors can run the suite withouta local Crystal install (
docker compose run --rm app crystal spec).Test plan
crystal specpasses on Crystal 1.10.0 (CI floor) and latest (1.20.1)crystal tool format --checkclean on both./bin/amebaclean on bothCLIENT KILL TYPE pubsubagainst alive Redis and verifies a fresh publish reaches the client
through the new connection