Backport: Core: Add pipeline send timeout and fail pending requests during recovery (#5755)#5960
Open
jduo wants to merge 1 commit into
Open
Backport: Core: Add pipeline send timeout and fail pending requests during recovery (#5755)#5960jduo wants to merge 1 commit into
jduo wants to merge 1 commit into
Conversation
…uring recovery Backport of #5755 to release-2.2. Adds a 100ms timeout on pipeline.send() to detect dead connections and fails pending requests immediately when entering recovery. Key changes: - pipeline.send() times out after 100ms with FatalSendError instead of blocking forever when the bounded channel is full - poll_recover() properly polls JoinHandle instead of now_or_never() - poll_flush() returns Pending during recovery instead of busy-spinning - fail_pending_requests() drains pending requests with ClientError on recovery entry - Reconnection operations are non-blocking (tokio::spawn) Without this fix, a single dead shard blocks the pipeline indefinitely, preventing recovery and causing OOM under sustained load. Validated on EC2: iptables block test shows client recovers within 5s of block being lifted (previously stuck indefinitely). Signed-off-by: James Duong <duong.james@gmail.com>
yipin-chen
approved these changes
May 19, 2026
alexr-bq
approved these changes
May 19, 2026
Collaborator
alexr-bq
left a comment
There was a problem hiding this comment.
Approved pending fixing CI
jeremyprime
approved these changes
May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backport of #5755 to release-2.2. Adds a 100ms timeout on
pipeline.send()to detect dead connections (half-open TCP) and fails pending requests immediately when entering recovery.Issue link
Closes #5715
Closes #5716
Features / Behaviour Changes
pipeline.send()now times out after 100ms withFatalSendErrorinstead of blocking forever when the bounded channel (50 slots) is full due to a dead connectionpoll_recover()properly pollsJoinHandleinstead of usingnow_or_never(), fixing waker registrationpoll_flush()returnsPendingduring recovery instead of busy-spinning in aloop{}fail_pending_requests()drains pending request queue withClientError("Connection in recovery")when recovery is in progresstokio::spawn)Limitations
Without this fix, a single dead shard blocks the pipeline indefinitely, preventing recovery from triggering and causing OOM under sustained load.
Testing
Validated on EC2 against ElastiCache (3-shard CRR cluster):
cargo check,cargo clippy,cargo fmtall pass.Checklist
Before submitting the PR make sure the following are checked:
make *-linttargets) and Prettier has been run (make prettier-fix).