Backport: Core: Add pipeline send timeout and fail pending requests during recovery (#5755) by jduo · Pull Request #5960 · valkey-io/valkey-glide

jduo · 2026-05-15T21:51:02Z

Summary

Backport of #5755 to release-2.2. Adds a 100ms timeout on pipeline.send() to detect dead connections (half-open TCP) and fails pending requests immediately when entering recovery.

Issue link

Closes #5715
Closes #5716

Features / Behaviour Changes

pipeline.send() now times out after 100ms with FatalSendError instead of blocking forever when the bounded channel (50 slots) is full due to a dead connection
poll_recover() properly polls JoinHandle instead of using now_or_never(), fixing waker registration
poll_flush() returns Pending during recovery instead of busy-spinning in a loop{}
New fail_pending_requests() drains pending request queue with ClientError("Connection in recovery") when recovery is in progress
Reconnection operations are non-blocking (tokio::spawn)

Limitations

Without this fix, a single dead shard blocks the pipeline indefinitely, preventing recovery from triggering and causing OOM under sustained load.

Testing

Validated on EC2 against ElastiCache (3-shard CRR cluster):

iptables block on one primary for 30s → client recovers within 5s of block being lifted
Previously (without fix): client stuck indefinitely, never recovered

cargo check, cargo clippy, cargo fmt all pass.

Checklist

Before submitting the PR make sure the following are checked:

This Pull Request is related to one issue.
Commit message has a detailed description of what changed and why.
Tests are added or updated.
CHANGELOG.md and documentation files are updated.
Linters have been run (make *-lint targets) and Prettier has been run (make prettier-fix).
Destination branch is correct - main or release
Create merge commit if merging release branch into main, squash otherwise.

…uring recovery Backport of #5755 to release-2.2. Adds a 100ms timeout on pipeline.send() to detect dead connections and fails pending requests immediately when entering recovery. Key changes: - pipeline.send() times out after 100ms with FatalSendError instead of blocking forever when the bounded channel is full - poll_recover() properly polls JoinHandle instead of now_or_never() - poll_flush() returns Pending during recovery instead of busy-spinning - fail_pending_requests() drains pending requests with ClientError on recovery entry - Reconnection operations are non-blocking (tokio::spawn) Without this fix, a single dead shard blocks the pipeline indefinitely, preventing recovery and causing OOM under sustained load. Validated on EC2: iptables block test shows client recovers within 5s of block being lifted (previously stuck indefinitely). Signed-off-by: James Duong <duong.james@gmail.com>

alexr-bq

Approved pending fixing CI

jduo requested a review from a team as a code owner May 15, 2026 21:51

jduo requested a deployment to AWS_ACTIONS May 15, 2026 21:51 — with GitHub Actions Waiting

yipin-chen approved these changes May 19, 2026

View reviewed changes

alexr-bq approved these changes May 19, 2026

View reviewed changes

jeremyprime approved these changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport: Core: Add pipeline send timeout and fail pending requests during recovery (#5755)#5960

Backport: Core: Add pipeline send timeout and fail pending requests during recovery (#5755)#5960
jduo wants to merge 1 commit into
release-2.2from
jduo/pipeline-timeout-r22

jduo commented May 15, 2026

Uh oh!

alexr-bq left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jduo commented May 15, 2026

Summary

Issue link

Features / Behaviour Changes

Limitations

Testing

Checklist

Uh oh!

alexr-bq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants