Skip to content

Remove grpc.WaitForReady to make eviction handler more robust#1069

Merged
efiacor merged 5 commits into
kptdev:mainfrom
Nordix:evict-pod-on-demand2
Jun 25, 2026
Merged

Remove grpc.WaitForReady to make eviction handler more robust#1069
efiacor merged 5 commits into
kptdev:mainfrom
Nordix:evict-pod-on-demand2

Conversation

@kushnaidu

@kushnaidu kushnaidu commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Title

Remove grpc.WaitForReady to make eviction handler more robust


Description

  • What changed:
    Removed WaitForReady(true) from EvaluateFunction call options — no longer needed since pods in cache are guaranteed reachable.
    Added waitForGrpcReady function in podmanager.go that calls cc.Connect() and waits for the gRPC connection state to reach READY before marking a pod as available.
    Added skipGrpcReadyCheck field to podManager struct for test skipping.
    Eviction handler now checks k8s pod state (Get + status check) instead of unconditionally removing — keeps healthy Running pods, only evicts if pod is gone/Failed/deleting.
    Eviction deletes only the Pod, not the Service. Cleaned up by GC.

  • Why it’s needed:
    WaitForReady(true) on retries blocked indefinitely on dead pods (whose IP was unreachable), causing 4+ minute hangs until parent context timeout
    Without waitForGrpcReady, pods were added to cache before their gRPC server was listening, causing immediate "connection refused" on first use and triggering spurious evictions of healthy starting pods
    Unconditional eviction (old handler) deleted services during burst scale-up, causing cascading "service not found" failures for new pods trying to reuse the same service name

  • How it works:
    During pod creation, waitForGrpcReady triggers cc.Connect() and polls connection state transitions until READY or podReadyTimeout (60s) expires — pod only enters cache once gRPC is confirmed reachable
    Since all cached pods have verified connections, Unavailable means the pod genuinely died — no ambiguity with starting pods
    Eviction handler does a single k8s Get on the specific pod: if not found, Failed, or DeletionTimestamp set → removes from cache + deletes pod in background. If still Running → keeps it (false positive from transient issue, GC handles within 1 scan interval(default))
    Only the Pod is deleted during eviction — Service persists for reuse by retrieveOrCreateService on the next scale-up


Related Issue(s)

  • Closes/Fixes #

Type of Change

  • Bug fix
  • New feature
  • Enhancement
  • Refactor
  • Documentation
  • Tests
  • Other: ________

Checklist

  • Code follows project style guidelines
  • Self-reviewed changes
  • Tests added/updated
  • Documentation added/updated
  • All tests and gating checks pass

Testing Instructions (Optional)


Additional Notes (Optional)

  • Known issues:
  • Further improvements:
  • Review notes:

AI Disclosure

  • I have used AI in the creation of this PR.

If so, please describe how:

  • Microsoft Copilot to analyse the code.
  • Kiro to generate unit tests and code analysis.

@kushnaidu kushnaidu requested review from a team and Copilot June 23, 2026 07:28
@netlify

netlify Bot commented Jun 23, 2026

Copy link
Copy Markdown

Deploy Preview for kpt-porch ready!

Name Link
🔨 Latest commit 5d98dc1
🔍 Latest deploy log https://app.netlify.com/projects/kpt-porch/deploys/6a3ab231b4014e0008ebf1d4
😎 Deploy Preview https://deploy-preview-1069--kpt-porch.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jun 23, 2026
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
@kushnaidu kushnaidu force-pushed the evict-pod-on-demand2 branch from 9b9c314 to 0a07281 Compare June 23, 2026 07:29

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to eliminate long hangs and reduce false-positive evictions in the function-runner pod cache by removing grpc.WaitForReady(true) usage and instead gating pod availability on gRPC connection readiness, while making eviction behavior consult actual Kubernetes Pod state.

Changes:

  • Added a gRPC readiness wait (waitForGrpcReady) during pod creation, with a test-only skip flag.
  • Removed grpc.WaitForReady(true) from EvaluateFunction RPCs.
  • Updated eviction handling to check the Kubernetes Pod state before removing from cache and to delete only Pods (not Services); updated/added related tests.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
func/internal/podmanager.go Adds waitForGrpcReady and uses it during pod creation (with test skip flag).
func/internal/podevaluator.go Removes WaitForReady and updates retry rationale around Unavailable.
func/internal/podcachemanager.go Changes eviction to consult Kubernetes Pod state and delete only Pods.
func/internal/podevaluator_podmanager_test.go Sets skipGrpcReadyCheck for tests.
func/internal/podevaluator_podcachemanager_test.go Sets skipGrpcReadyCheck for tests.
func/internal/podcachemanager_eventloop_test.go Refactors eviction tests into a table-driven test and sets skipGrpcReadyCheck.

Comment thread func/internal/podevaluator.go
Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podmanager.go
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Copilot AI review requested due to automatic review settings June 23, 2026 07:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podevaluator.go
Comment thread func/internal/podevaluator_podmanager_test.go Outdated
Comment thread func/internal/podcachemanager_eventloop_test.go
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
@kushnaidu kushnaidu self-assigned this Jun 23, 2026
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
75.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Copilot AI review requested due to automatic review settings June 23, 2026 16:09
@kushnaidu kushnaidu removed the do-not-merge/hold #ededed label Jun 23, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread func/internal/podcachemanager.go
Comment thread func/internal/podcachemanager.go
Signed-off-by: Kushal Harish Naidu <kushal.harish.naidu@ericsson.com>
Comment thread func/internal/podcachemanager.go
@dosubot dosubot Bot added the lgtm #ededed label Jun 24, 2026
@kushnaidu kushnaidu added do-not-merge/hold #ededed and removed do-not-merge/hold #ededed labels Jun 24, 2026
@efiacor efiacor merged commit 610aff9 into kptdev:main Jun 25, 2026
25 checks passed
@efiacor efiacor deleted the evict-pod-on-demand2 branch June 25, 2026 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm #ededed size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants