Remove grpc.WaitForReady to make eviction handler more robust#1069
Merged
Conversation
✅ Deploy Preview for kpt-porch ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
9b9c314 to
0a07281
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to eliminate long hangs and reduce false-positive evictions in the function-runner pod cache by removing grpc.WaitForReady(true) usage and instead gating pod availability on gRPC connection readiness, while making eviction behavior consult actual Kubernetes Pod state.
Changes:
- Added a gRPC readiness wait (
waitForGrpcReady) during pod creation, with a test-only skip flag. - Removed
grpc.WaitForReady(true)fromEvaluateFunctionRPCs. - Updated eviction handling to check the Kubernetes Pod state before removing from cache and to delete only Pods (not Services); updated/added related tests.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
func/internal/podmanager.go |
Adds waitForGrpcReady and uses it during pod creation (with test skip flag). |
func/internal/podevaluator.go |
Removes WaitForReady and updates retry rationale around Unavailable. |
func/internal/podcachemanager.go |
Changes eviction to consult Kubernetes Pod state and delete only Pods. |
func/internal/podevaluator_podmanager_test.go |
Sets skipGrpcReadyCheck for tests. |
func/internal/podevaluator_podcachemanager_test.go |
Sets skipGrpcReadyCheck for tests. |
func/internal/podcachemanager_eventloop_test.go |
Refactors eviction tests into a table-driven test and sets skipGrpcReadyCheck. |
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
|
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
rendre-greyling
approved these changes
Jun 24, 2026
efiacor
approved these changes
Jun 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Title
Remove grpc.WaitForReady to make eviction handler more robust
Description
What changed:
Removed WaitForReady(true) from EvaluateFunction call options — no longer needed since pods in cache are guaranteed reachable.
Added waitForGrpcReady function in podmanager.go that calls cc.Connect() and waits for the gRPC connection state to reach READY before marking a pod as available.
Added skipGrpcReadyCheck field to podManager struct for test skipping.
Eviction handler now checks k8s pod state (Get + status check) instead of unconditionally removing — keeps healthy Running pods, only evicts if pod is gone/Failed/deleting.
Eviction deletes only the Pod, not the Service. Cleaned up by GC.
Why it’s needed:
WaitForReady(true) on retries blocked indefinitely on dead pods (whose IP was unreachable), causing 4+ minute hangs until parent context timeout
Without waitForGrpcReady, pods were added to cache before their gRPC server was listening, causing immediate "connection refused" on first use and triggering spurious evictions of healthy starting pods
Unconditional eviction (old handler) deleted services during burst scale-up, causing cascading "service not found" failures for new pods trying to reuse the same service name
How it works:
During pod creation, waitForGrpcReady triggers cc.Connect() and polls connection state transitions until READY or podReadyTimeout (60s) expires — pod only enters cache once gRPC is confirmed reachable
Since all cached pods have verified connections, Unavailable means the pod genuinely died — no ambiguity with starting pods
Eviction handler does a single k8s Get on the specific pod: if not found, Failed, or DeletionTimestamp set → removes from cache + deletes pod in background. If still Running → keeps it (false positive from transient issue, GC handles within 1 scan interval(default))
Only the Pod is deleted during eviction — Service persists for reuse by retrieveOrCreateService on the next scale-up
Related Issue(s)
Type of Change
Checklist
Testing Instructions (Optional)
Additional Notes (Optional)
AI Disclosure
If so, please describe how: