Gate compute-service deletion on agent pods being evicted#320
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (10)
📝 WalkthroughWalkthroughAdds a new ChangesAgent Pods Eviction Gate
Sequence Diagram(s)sequenceDiagram
rect rgba(173, 216, 230, 0.5)
Note over GardenerNodeLifecycleController: VM eviction completes
end
GardenerNodeLifecycleController->>Node: Patch NoExecute taintKeyOffboarding taint
rect rgba(255, 223, 186, 0.5)
Note over HypervisorController: Triggered by taint / periodic reconcile
end
HypervisorController->>Node: Verify offboarding taint present
loop per agent namespace
HypervisorController->>PodAPI: List pods (spec.nodeName field selector, paginated)
end
PodAPI-->>HypervisorController: Running pods without indefinite toleration
HypervisorController->>HypervisorStatus: Patch AgentPodsEvicted condition (True/False)
HypervisorController-->>HypervisorController: RequeueAfter if False
rect rgba(200, 230, 200, 0.5)
Note over OffboardingController: AgentPodsEvicted must be True
end
OffboardingController->>HypervisorStatus: Read AgentPodsEvicted condition
alt AgentPodsEvicted is False
OffboardingController-->>HypervisorStatus: Update waiting message, return early
else AgentPodsEvicted is True
OffboardingController->>NovaComputeService: Delete compute service and resources
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR prevents nova-compute service deletion during offboarding from being undone by still-running agent pods (e.g., nova-compute/neutron) re-registering the service. It introduces an explicit “agent pods evicted” gate backed by a taint-guarded, per-namespace, paginated pod listing that avoids starting a cluster-wide Pod informer.
Changes:
- Gate compute-service deletion in the offboarding controller on
AgentPodsEvicted=True. - Compute
AgentPodsEvictedin the Hypervisor controller by listing pods per configured namespace(s), only after the node has the offboardingNoExecutetaint, and with paged listing. - Add
--agent-namespaces(required) and disable the Pod cache to avoid cluster-wide Pod informers; update Helm chart wiring and tests accordingly.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
internal/global/global.go |
Adds global configuration for agent pod namespaces. |
cmd/main.go |
Adds required --agent-namespaces flag parsing and disables Pod caching. |
internal/controller/offboarding_controller.go |
Blocks service deletion until AgentPodsEvicted=True. |
internal/controller/offboarding_controller_test.go |
Updates/extends tests to cover the new gate behavior. |
internal/controller/hypervisor_controller.go |
Computes AgentPodsEvicted via taint-guarded, per-namespace, paginated pod listing. |
internal/controller/hypervisor_controller_test.go |
Adds tests for taint guard, namespace scoping, and deletion-pending pods. |
internal/controller/gardener_node_lifecycle_controller.go |
Applies offboarding taint after VM eviction; adds watch predicate for Evicting status changes. |
internal/controller/gardener_node_lifecycle_controller_test.go |
Adds tests for taint application and idempotency. |
internal/controller/constants.go |
Introduces the kvm.cloud.sap/offboarding taint key constant. |
api/v1/hypervisor_types.go |
Adds ConditionTypeAgentPodsEvicted constant. |
go.mod |
Promotes go-logr/logr to a direct dependency. |
charts/openstack-hypervisor-operator/values.yaml |
Adds agentNamespaces value and passes it to the manager args. |
charts/openstack-hypervisor-operator/templates/deployment.yaml |
Wires AGENT_NAMESPACES env var into the manager deployment. |
charts/openstack-hypervisor-operator/templates/role.yaml |
Grants Pod permissions needed for listing agent pods. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Gate on Evicting=False to avoid racing with live-migration. Watch Hypervisor status changes (evictingConditionChangedPredicate) so the taint is applied promptly when the condition transitions.
6b461ab to
1fe6bd7
Compare
1fe6bd7 to
a20a49f
Compare
a20a49f to
4ef3f1a
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/offboarding_controller.go (1)
96-101:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMove the AgentPodsEvicted gate before the OpenStack lookup paths.
On Line 96,
GetHypervisorByName()runs before the new gate, and Line 100 can mark offboarding complete onErrNoHypervisorbefore Line 132 is evaluated. That bypasses the eviction gate and can finalize offboarding while agent pods may still be running.💡 Suggested fix
// If the service id is set, there might be VMs either from onboarding or even from normal operation // In that case we need to wait until those are evicted if hv.Status.ServiceID != "" && !meta.IsStatusConditionFalse(hv.Status.Conditions, kvmv1.ConditionTypeEvicting) { // Either has not evicted yet, or is still evicting VMs, so we have to wait for that to finish return ctrl.Result{}, nil } + + // A still-running nova-compute agent would re-register the service we + // are about to delete, undoing the offboarding. + if !meta.IsStatusConditionTrue(hv.Status.Conditions, kvmv1.ConditionTypeAgentPodsEvicted) { + msg := "Waiting for agent pods (nova-compute, neutron) to be evicted from node" + if cond := meta.FindStatusCondition(hv.Status.Conditions, kvmv1.ConditionTypeAgentPodsEvicted); cond != nil && cond.Message != "" { + msg = "Waiting for agent pods to be evicted: " + cond.Message + } + return r.setOffboardingCondition(ctx, hv, msg) + } hypervisor, err := openstack.GetHypervisorByName(ctx, r.computeClient, hostname, true) if err != nil { if errors.Is(err, openstack.ErrNoHypervisor) { // We are (hopefully) done return ctrl.Result{}, r.markOffboarded(ctx, hv) } return r.setOffboardingCondition(ctx, hv, fmt.Sprintf("cannot get hypervisor by name %s due to %s", hostname, err)) } @@ - // A still-running nova-compute agent would re-register the service we - // are about to delete, undoing the offboarding. - if !meta.IsStatusConditionTrue(hv.Status.Conditions, kvmv1.ConditionTypeAgentPodsEvicted) { - msg := "Waiting for agent pods (nova-compute, neutron) to be evicted from node" - if cond := meta.FindStatusCondition(hv.Status.Conditions, kvmv1.ConditionTypeAgentPodsEvicted); cond != nil && cond.Message != "" { - msg = "Waiting for agent pods to be evicted: " + cond.Message - } - return r.setOffboardingCondition(ctx, hv, msg) - }Also applies to: 130-138
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/offboarding_controller.go` around lines 96 - 101, The AgentPodsEvicted gate check needs to be evaluated before the GetHypervisorByName call on line 96. Currently, GetHypervisorByName can return ErrNoHypervisor and mark offboarding complete via r.markOffboarded before the eviction gate is evaluated, allowing offboarding to finalize while agent pods may still be running. Move the AgentPodsEvicted gate check to execute before calling GetHypervisorByName so that the eviction gate is verified first, ensuring agent pods are evicted before any path can mark offboarding as complete.
🧹 Nitpick comments (1)
internal/controller/offboarding_controller_test.go (1)
418-456: ⚡ Quick winAssert that OpenStack hypervisor lookup is not called when AgentPodsEvicted is False.
This case validates the status message, but it doesn’t enforce that the gate short-circuits before external lookup. Add a call counter on
GET /os-hypervisors/detailand assert it stays0in the spec.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/offboarding_controller_test.go` around lines 418 - 456, The test validates that the reconciler waits for agent pods to be evicted when AgentPodsEvicted is False, but it does not verify that the external OpenStack hypervisor lookup is short-circuited. Add a call counter variable in the BeforeEach hook to track how many times the GET /os-hypervisors/detail handler is invoked, increment it within the handler, and then add an assertion in the It block that this counter remains 0 after the Reconcile call completes, ensuring the gate prevents the external API call from being made.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@internal/controller/offboarding_controller.go`:
- Around line 96-101: The AgentPodsEvicted gate check needs to be evaluated
before the GetHypervisorByName call on line 96. Currently, GetHypervisorByName
can return ErrNoHypervisor and mark offboarding complete via r.markOffboarded
before the eviction gate is evaluated, allowing offboarding to finalize while
agent pods may still be running. Move the AgentPodsEvicted gate check to execute
before calling GetHypervisorByName so that the eviction gate is verified first,
ensuring agent pods are evicted before any path can mark offboarding as
complete.
---
Nitpick comments:
In `@internal/controller/offboarding_controller_test.go`:
- Around line 418-456: The test validates that the reconciler waits for agent
pods to be evicted when AgentPodsEvicted is False, but it does not verify that
the external OpenStack hypervisor lookup is short-circuited. Add a call counter
variable in the BeforeEach hook to track how many times the GET
/os-hypervisors/detail handler is invoked, increment it within the handler, and
then add an assertion in the It block that this counter remains 0 after the
Reconcile call completes, ensuring the gate prevents the external API call from
being made.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c28fdcae-cfd4-4a64-93f4-9c6e5835b5e5
📒 Files selected for processing (14)
api/v1/hypervisor_types.gocharts/openstack-hypervisor-operator/templates/deployment.yamlcharts/openstack-hypervisor-operator/templates/role.yamlcharts/openstack-hypervisor-operator/values.yamlcmd/main.gogo.modinternal/controller/constants.gointernal/controller/gardener_node_lifecycle_controller.gointernal/controller/gardener_node_lifecycle_controller_test.gointernal/controller/hypervisor_controller.gointernal/controller/hypervisor_controller_test.gointernal/controller/offboarding_controller.gointernal/controller/offboarding_controller_test.gointernal/global/global.go
4ef3f1a to
9217eec
Compare
A running nova-compute pod would re-register the service immediately after deletion, undoing the offboarding. Introduce AgentPodsEvicted condition computed by HypervisorController and block service deletion in the offboarding controller until it is True. The condition is only evaluated once VM eviction has finished (Evicting=False) and the offboarding taint is present on the node. Pods are listed per-namespace via --agent-namespaces (chart default: monsoon3; the operator refuses to start without it) in pages of 100, with the pod cache disabled so the operator does not start a cluster-wide pod informer. RequeueAfter is used to poll since pods are not watched.
9217eec to
2f3157a
Compare
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
| pods := &corev1.PodList{} | ||
| Expect(uncachedClient.List(ctx, pods, | ||
| client.MatchingFieldsSelector{ | ||
| Selector: fields.OneTermEqualSelector("spec.nodeName", "target-node"), | ||
| }, | ||
| )).To(Succeed()) |
A running nova-compute pod would re-register the service immediately
after deletion, undoing the offboarding. This extends the existing
AgentPodsEvictedcondition (already in main) with:--agent-namespaces(chartdefault:
monsoon3; the operator refuses to start without it).Pods are listed in pages of 100 per namespace, scoped by
spec.nodeNamefield selector.kvm.cloud.sap/offboardingNoExecute taint is present on the node,avoiding a spurious
Trueon a fresh node before agents are scheduled.pod informer that would cause OOM on startup.
deletion-pending pods counting as still running.
Summary by CodeRabbit
--agent-namespaces) to scope which agent pods are evaluated during termination/offboarding.