rls: only reset backoff on recovery from TRANSIENT_FAILURE#9137
Conversation
Fix control channel connectivity monitoring to track TRANSIENT_FAILURE state explicitly. Only reset backoff timers when transitioning from TRANSIENT_FAILURE to READY, not for benign state changes like READY → IDLE → READY. Fixes grpc#8693
- Add testOnlyInitialReadyDone channel for proper test synchronization - Signal when monitoring goroutine processes initial READY state - Tests wait for this signal instead of using time.Sleep - All synchronization now uses channels/callbacks - no arbitrary sleeps - Tests pass consistently with race detector Addresses review feedback about removing time.Sleep for state transitions.
…el-state-monitoring
…rt timeouts - Replace goto/label patterns with labeled for loops and break - Replace default cases with time.After(10ms) for proper timing - Remove impossible TRANSIENT_FAILURE handling in IDLE test
Replace 5 identical labeled-for-select loops with a shared helper function that waits for a specific connectivity state on the channel.
…l-channel-state-monitoring
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9137 +/- ##
==========================================
- Coverage 83.21% 83.09% -0.12%
==========================================
Files 414 417 +3
Lines 33489 33643 +154
==========================================
+ Hits 27868 27956 +88
- Misses 4207 4261 +54
- Partials 1414 1426 +12
🚀 New features to boost your workflow:
|
|
@eshitachandwani : Since you were the first reviewer on the original PR, I'm assigning this to you again for review. Please do take a look my open comments from the original PR and ensure that they are handled here. Thanks. |
eshitachandwani
left a comment
There was a problem hiding this comment.
LGTM modulo some minor nits. Adding @easwars as a second reviewer.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request refactors the RLS control channel connectivity state monitoring by removing the background monitoring goroutine and handling state transitions directly within the OnMessage callback. This change ensures that backoff resets are only triggered when recovering from a TRANSIENT_FAILURE state, avoiding unnecessary resets during benign transitions like READY -> IDLE -> READY or the initial connection. The test suite has been updated and expanded to verify these scenarios. The review comments identify a potential panic in a test helper when a channel is closed, and potential race conditions in the test subscriber setup where a shared instance's delegate is overwritten.
Fixes #8693
This PR is a continuation and finalization of the stale/closed PR #8720.
It addresses the remaining feedback from the maintainers and the code review tools to resolve control channel state monitoring issues in the RLS balancer.
RELEASE NOTES: N/A