util: Graceful switch to new LB when leaving CONNECTING#11983
util: Graceful switch to new LB when leaving CONNECTING#11983ejona86 merged 1 commit intogrpc:masterfrom
Conversation
Previously it would wait for the new LB to enter READY. However, that prevents there being an upper-bound on how long the old policy will continue to be used. The point of graceful switch is to avoid RPCs seeing increased latency when we swap config. We don't want it to prevent the system from becoming eventually consistent.
|
I don't understand your reasoning here. If we have something working, switching away from it because the new configuration is explicitly not working seems wrong. The goal should be to have RPCs succeed, not to do a mathematical proof of consistency. |
|
The other languages already do this and, again, the only purpose for graceful switch at this point is to hide latency. It isn't a resiliency feature and having it persist configuration indefinitely can lead to problems. Without a feedback loop to inform the user something is wrong, it may be a month before the client is restarted and at that point it is very hard to figure out what broke it. |
|
https://github.com/grpc/grpc/blob/bdf170a3151ee429f7b42a62b7e86c9b0d3e7a6d/src/core/load_balancing/child_policy_handler.cc#L65 (You can take a look at the header to see it is the same graceful switch policy) |
larry-safran
left a comment
There was a problem hiding this comment.
I believe that we are generally overly aggressive in failing things. In this case, you could put a timer and say if it has been more than a few hours then you'll switch and add to the message that the configuration has been broken since whenever the update time was. That would keep a temporary failure from blocking activity.
However, since everyone wants to do the simple break the user as soon as possible I'll approve it.
|
Delaying it by an hour serves no purpose if the rollout of the bad config is not stopped. That just means more clients will have the bad config before the problem is noticed. At best you have the same number of clients impacted. (As that is determined by (time of detection) → (time of resolution).) Since this is swapping the LB policy being used, that is only going to happen with a config push. It isn't as transient as endpoint lookup. |
Previously it would wait for the new LB to enter READY. However, that prevents there being an upper-bound on how long the old policy will continue to be used. The point of graceful switch is to avoid RPCs seeing increased latency when we swap config. We don't want it to prevent the system from becoming eventually consistent.