Skip to content

Backport: Core: Add pipeline send timeout and fail pending requests during recovery (#5755)#5960

Open
jduo wants to merge 1 commit into
release-2.2from
jduo/pipeline-timeout-r22
Open

Backport: Core: Add pipeline send timeout and fail pending requests during recovery (#5755)#5960
jduo wants to merge 1 commit into
release-2.2from
jduo/pipeline-timeout-r22

Conversation

@jduo
Copy link
Copy Markdown
Collaborator

@jduo jduo commented May 15, 2026

Summary

Backport of #5755 to release-2.2. Adds a 100ms timeout on pipeline.send() to detect dead connections (half-open TCP) and fails pending requests immediately when entering recovery.

Issue link

Closes #5715
Closes #5716

Features / Behaviour Changes

  • pipeline.send() now times out after 100ms with FatalSendError instead of blocking forever when the bounded channel (50 slots) is full due to a dead connection
  • poll_recover() properly polls JoinHandle instead of using now_or_never(), fixing waker registration
  • poll_flush() returns Pending during recovery instead of busy-spinning in a loop{}
  • New fail_pending_requests() drains pending request queue with ClientError("Connection in recovery") when recovery is in progress
  • Reconnection operations are non-blocking (tokio::spawn)

Limitations

Without this fix, a single dead shard blocks the pipeline indefinitely, preventing recovery from triggering and causing OOM under sustained load.

Testing

Validated on EC2 against ElastiCache (3-shard CRR cluster):

  • iptables block on one primary for 30s → client recovers within 5s of block being lifted
  • Previously (without fix): client stuck indefinitely, never recovered

cargo check, cargo clippy, cargo fmt all pass.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one issue.
  • Commit message has a detailed description of what changed and why.
  • Tests are added or updated.
  • CHANGELOG.md and documentation files are updated.
  • Linters have been run (make *-lint targets) and Prettier has been run (make prettier-fix).
  • Destination branch is correct - main or release
  • Create merge commit if merging release branch into main, squash otherwise.

…uring recovery

Backport of #5755 to release-2.2. Adds a 100ms timeout on pipeline.send()
to detect dead connections and fails pending requests immediately when
entering recovery.

Key changes:
- pipeline.send() times out after 100ms with FatalSendError instead of
  blocking forever when the bounded channel is full
- poll_recover() properly polls JoinHandle instead of now_or_never()
- poll_flush() returns Pending during recovery instead of busy-spinning
- fail_pending_requests() drains pending requests with ClientError on
  recovery entry
- Reconnection operations are non-blocking (tokio::spawn)

Without this fix, a single dead shard blocks the pipeline indefinitely,
preventing recovery and causing OOM under sustained load.

Validated on EC2: iptables block test shows client recovers within 5s
of block being lifted (previously stuck indefinitely).

Signed-off-by: James Duong <duong.james@gmail.com>
@jduo jduo requested a review from a team as a code owner May 15, 2026 21:51
Copy link
Copy Markdown
Collaborator

@alexr-bq alexr-bq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending fixing CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants