Skip to content

util: Graceful switch to new LB when leaving CONNECTING#11983

Merged
ejona86 merged 1 commit intogrpc:masterfrom
ejona86:graceful-leaveconnecting
Mar 28, 2025
Merged

util: Graceful switch to new LB when leaving CONNECTING#11983
ejona86 merged 1 commit intogrpc:masterfrom
ejona86:graceful-leaveconnecting

Conversation

@ejona86
Copy link
Copy Markdown
Member

@ejona86 ejona86 commented Mar 27, 2025

Previously it would wait for the new LB to enter READY. However, that prevents there being an upper-bound on how long the old policy will continue to be used. The point of graceful switch is to avoid RPCs seeing increased latency when we swap config. We don't want it to prevent the system from becoming eventually consistent.

Previously it would wait for the new LB to enter READY. However, that
prevents there being an upper-bound on how long the old policy will
continue to be used. The point of graceful switch is to avoid RPCs
seeing increased latency when we swap config. We don't want it to
prevent the system from becoming eventually consistent.
@ejona86 ejona86 requested a review from larry-safran March 27, 2025 20:55
@larry-safran
Copy link
Copy Markdown
Contributor

I don't understand your reasoning here. If we have something working, switching away from it because the new configuration is explicitly not working seems wrong. The goal should be to have RPCs succeed, not to do a mathematical proof of consistency.

@ejona86
Copy link
Copy Markdown
Member Author

ejona86 commented Mar 28, 2025

The other languages already do this and, again, the only purpose for graceful switch at this point is to hide latency. It isn't a resiliency feature and having it persist configuration indefinitely can lead to problems. Without a feedback loop to inform the user something is wrong, it may be a month before the client is restarted and at that point it is very hard to figure out what broke it.

@ejona86
Copy link
Copy Markdown
Member Author

ejona86 commented Mar 28, 2025

Copy link
Copy Markdown
Contributor

@larry-safran larry-safran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we are generally overly aggressive in failing things. In this case, you could put a timer and say if it has been more than a few hours then you'll switch and add to the message that the configuration has been broken since whenever the update time was. That would keep a temporary failure from blocking activity.

However, since everyone wants to do the simple break the user as soon as possible I'll approve it.

@ejona86
Copy link
Copy Markdown
Member Author

ejona86 commented Mar 28, 2025

Delaying it by an hour serves no purpose if the rollout of the bad config is not stopped. That just means more clients will have the bad config before the problem is noticed. At best you have the same number of clients impacted. (As that is determined by (time of detection) → (time of resolution).)

Since this is swapping the LB policy being used, that is only going to happen with a config push. It isn't as transient as endpoint lookup.

@ejona86 ejona86 merged commit 2e260a4 into grpc:master Mar 28, 2025
15 of 16 checks passed
@ejona86 ejona86 deleted the graceful-leaveconnecting branch March 28, 2025 15:18
@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Jun 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants