Skip to content

Gate compute-service deletion on agent pods being evicted#320

Open
fwiesel wants to merge 2 commits into
mainfrom
stop-agents-before-delete-v2
Open

Gate compute-service deletion on agent pods being evicted#320
fwiesel wants to merge 2 commits into
mainfrom
stop-agents-before-delete-v2

Conversation

@fwiesel

@fwiesel fwiesel commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

A running nova-compute pod would re-register the service immediately
after deletion, undoing the offboarding. This extends the existing
AgentPodsEvicted condition (already in main) with:

  • Per-namespace paged pod listing via --agent-namespaces (chart
    default: monsoon3; the operator refuses to start without it).
    Pods are listed in pages of 100 per namespace, scoped by
    spec.nodeName field selector.
  • Offboarding taint guard: the pod list is only issued once the
    kvm.cloud.sap/offboarding NoExecute taint is present on the node,
    avoiding a spurious True on a fresh node before agents are scheduled.
  • Pod cache disabled so the operator does not start a cluster-wide
    pod informer that would cause OOM on startup.
  • Updated tests covering the taint guard, per-namespace scoping, and
    deletion-pending pods counting as still running.

Summary by CodeRabbit

  • New Features
    • Enhanced offboarding workflow now verifies that non-terminal agent pods are safely evicted before proceeding with hypervisor/service cleanup.
    • Added support for tracking an “agent pods evicted” status condition and rechecking automatically until eviction is complete.
  • Configuration
    • Introduced configurable agent namespaces (via --agent-namespaces) to scope which agent pods are evaluated during termination/offboarding.
  • Operational Improvements
    • Updated controller behavior and permissions to reliably list pods for accurate status determination during offboarding.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1f4188ee-066d-42ca-9ec5-1beeac275a01

📥 Commits

Reviewing files that changed from the base of the PR and between 4ef3f1a and 2f3157a.

📒 Files selected for processing (11)
  • api/v1/hypervisor_types.go
  • charts/openstack-hypervisor-operator/templates/deployment.yaml
  • charts/openstack-hypervisor-operator/templates/role.yaml
  • charts/openstack-hypervisor-operator/values.yaml
  • cmd/main.go
  • go.mod
  • internal/controller/hypervisor_controller.go
  • internal/controller/hypervisor_controller_test.go
  • internal/controller/offboarding_controller.go
  • internal/controller/offboarding_controller_test.go
  • internal/global/global.go
✅ Files skipped from review due to trivial changes (1)
  • api/v1/hypervisor_types.go
🚧 Files skipped from review as they are similar to previous changes (10)
  • charts/openstack-hypervisor-operator/templates/role.yaml
  • go.mod
  • charts/openstack-hypervisor-operator/values.yaml
  • internal/global/global.go
  • internal/controller/offboarding_controller.go
  • internal/controller/offboarding_controller_test.go
  • charts/openstack-hypervisor-operator/templates/deployment.yaml
  • cmd/main.go
  • internal/controller/hypervisor_controller_test.go
  • internal/controller/hypervisor_controller.go

📝 Walkthrough

Walkthrough

Adds a new ConditionTypeAgentPodsEvicted hypervisor status condition and a taintKeyOffboarding NoExecute node taint to gate compute-service deletion during offboarding. The gardener node lifecycle controller applies the taint when VM eviction completes; the hypervisor controller lists agent pods per namespace to compute the evicted condition; the offboarding controller blocks deletion until that condition is true. A new --agent-namespaces flag (and Helm wiring) configures which namespaces are scanned.

Changes

Agent Pods Eviction Gate

Layer / File(s) Summary
Shared contracts: condition type, global config, and taint key
api/v1/hypervisor_types.go, internal/global/global.go, internal/controller/constants.go
ConditionTypeAgentPodsEvicted constant, AgentNamespaces exported variable, and taintKeyOffboarding unexported constant are defined as the shared data contracts for the new offboarding gate.
CLI flag, pod cache config, and Helm chart wiring
cmd/main.go, charts/openstack-hypervisor-operator/..., go.mod
--agent-namespaces flag is parsed and validated into global.AgentNamespaces at startup; pod caching is disabled in the manager client; the Helm chart deployment, role, and values add the AGENT_NAMESPACES env var and list RBAC verb for pods; logr moves from indirect to direct dependency.
Node offboarding taint application
internal/controller/gardener_node_lifecycle_controller.go, internal/controller/gardener_node_lifecycle_controller_test.go
Reconcile calls a new ensureOffboardingTaint helper to patch the NoExecute taint onto the node when maintenance is termination and Evicting is false; SetupWithManager adds a Hypervisor watch with an evictingConditionChangedPredicate. Tests verify taint addition and idempotency.
AgentPodsEvicted condition computation in hypervisor controller
internal/controller/hypervisor_controller.go, internal/controller/hypervisor_controller_test.go
Reconcile computes AgentPodsEvicted by calling computeAgentPodsEvictedCondition, which paginates pod listing across AgentNamespaces via spec.nodeName field selector, evaluates podToleratesTaint for indefinite tolerations, and patches the condition into hypervisor status with a RequeueAfter when false. Tests cover varied pod phases, toleration types, deletion-pending pods, and field selector correctness.
Offboarding controller gate
internal/controller/offboarding_controller.go, internal/controller/offboarding_controller_test.go
A new guard in Reconcile returns early with a waiting status message when AgentPodsEvicted is not yet true, blocking compute-service deletion. Tests add the condition to happy-path and failure-mode setups and verify the waiting behavior when pods are still running.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(173, 216, 230, 0.5)
        Note over GardenerNodeLifecycleController: VM eviction completes
    end
    GardenerNodeLifecycleController->>Node: Patch NoExecute taintKeyOffboarding taint

    rect rgba(255, 223, 186, 0.5)
        Note over HypervisorController: Triggered by taint / periodic reconcile
    end
    HypervisorController->>Node: Verify offboarding taint present
    loop per agent namespace
        HypervisorController->>PodAPI: List pods (spec.nodeName field selector, paginated)
    end
    PodAPI-->>HypervisorController: Running pods without indefinite toleration
    HypervisorController->>HypervisorStatus: Patch AgentPodsEvicted condition (True/False)
    HypervisorController-->>HypervisorController: RequeueAfter if False

    rect rgba(200, 230, 200, 0.5)
        Note over OffboardingController: AgentPodsEvicted must be True
    end
    OffboardingController->>HypervisorStatus: Read AgentPodsEvicted condition
    alt AgentPodsEvicted is False
        OffboardingController-->>HypervisorStatus: Update waiting message, return early
    else AgentPodsEvicted is True
        OffboardingController->>NovaComputeService: Delete compute service and resources
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • notandy
  • mchristianl

Poem

🐇 Hop, hop, the pods must go,
Before the service we can't delete, you know!
A taint is placed, a list is checked,
No nova-compute slips past, unchecked.
The rabbit stamps the condition True
Now offboarding proceeds, all clean and new! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly matches the main objective: gating compute-service deletion on agent pods being evicted, which is the primary change across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch stop-agents-before-delete-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents nova-compute service deletion during offboarding from being undone by still-running agent pods (e.g., nova-compute/neutron) re-registering the service. It introduces an explicit “agent pods evicted” gate backed by a taint-guarded, per-namespace, paginated pod listing that avoids starting a cluster-wide Pod informer.

Changes:

  • Gate compute-service deletion in the offboarding controller on AgentPodsEvicted=True.
  • Compute AgentPodsEvicted in the Hypervisor controller by listing pods per configured namespace(s), only after the node has the offboarding NoExecute taint, and with paged listing.
  • Add --agent-namespaces (required) and disable the Pod cache to avoid cluster-wide Pod informers; update Helm chart wiring and tests accordingly.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/global/global.go Adds global configuration for agent pod namespaces.
cmd/main.go Adds required --agent-namespaces flag parsing and disables Pod caching.
internal/controller/offboarding_controller.go Blocks service deletion until AgentPodsEvicted=True.
internal/controller/offboarding_controller_test.go Updates/extends tests to cover the new gate behavior.
internal/controller/hypervisor_controller.go Computes AgentPodsEvicted via taint-guarded, per-namespace, paginated pod listing.
internal/controller/hypervisor_controller_test.go Adds tests for taint guard, namespace scoping, and deletion-pending pods.
internal/controller/gardener_node_lifecycle_controller.go Applies offboarding taint after VM eviction; adds watch predicate for Evicting status changes.
internal/controller/gardener_node_lifecycle_controller_test.go Adds tests for taint application and idempotency.
internal/controller/constants.go Introduces the kvm.cloud.sap/offboarding taint key constant.
api/v1/hypervisor_types.go Adds ConditionTypeAgentPodsEvicted constant.
go.mod Promotes go-logr/logr to a direct dependency.
charts/openstack-hypervisor-operator/values.yaml Adds agentNamespaces value and passes it to the manager args.
charts/openstack-hypervisor-operator/templates/deployment.yaml Wires AGENT_NAMESPACES env var into the manager deployment.
charts/openstack-hypervisor-operator/templates/role.yaml Grants Pod permissions needed for listing agent pods.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cmd/main.go
Comment thread internal/controller/hypervisor_controller.go
Comment thread charts/openstack-hypervisor-operator/templates/role.yaml Outdated
Gate on Evicting=False to avoid racing with live-migration. Watch
Hypervisor status changes (evictingConditionChangedPredicate) so the
taint is applied promptly when the condition transitions.
@fwiesel fwiesel force-pushed the stop-agents-before-delete-v2 branch from 6b461ab to 1fe6bd7 Compare June 19, 2026 10:17
@fwiesel fwiesel requested a review from Copilot June 19, 2026 10:18

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comment thread cmd/main.go
@fwiesel fwiesel force-pushed the stop-agents-before-delete-v2 branch from 1fe6bd7 to a20a49f Compare June 19, 2026 10:34
@fwiesel fwiesel requested a review from Copilot June 19, 2026 10:35

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comment thread internal/controller/hypervisor_controller.go
@fwiesel fwiesel force-pushed the stop-agents-before-delete-v2 branch from a20a49f to 4ef3f1a Compare June 19, 2026 10:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/offboarding_controller.go (1)

96-101: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Move the AgentPodsEvicted gate before the OpenStack lookup paths.

On Line 96, GetHypervisorByName() runs before the new gate, and Line 100 can mark offboarding complete on ErrNoHypervisor before Line 132 is evaluated. That bypasses the eviction gate and can finalize offboarding while agent pods may still be running.

💡 Suggested fix
 	// If the service id is set, there might be VMs either from onboarding or even from normal operation
 	// In that case we need to wait until those are evicted
 	if hv.Status.ServiceID != "" && !meta.IsStatusConditionFalse(hv.Status.Conditions, kvmv1.ConditionTypeEvicting) {
 		// Either has not evicted yet, or is still evicting VMs, so we have to wait for that to finish
 		return ctrl.Result{}, nil
 	}
+
+	// A still-running nova-compute agent would re-register the service we
+	// are about to delete, undoing the offboarding.
+	if !meta.IsStatusConditionTrue(hv.Status.Conditions, kvmv1.ConditionTypeAgentPodsEvicted) {
+		msg := "Waiting for agent pods (nova-compute, neutron) to be evicted from node"
+		if cond := meta.FindStatusCondition(hv.Status.Conditions, kvmv1.ConditionTypeAgentPodsEvicted); cond != nil && cond.Message != "" {
+			msg = "Waiting for agent pods to be evicted: " + cond.Message
+		}
+		return r.setOffboardingCondition(ctx, hv, msg)
+	}
 
 	hypervisor, err := openstack.GetHypervisorByName(ctx, r.computeClient, hostname, true)
 	if err != nil {
 		if errors.Is(err, openstack.ErrNoHypervisor) {
 			// We are (hopefully) done
 			return ctrl.Result{}, r.markOffboarded(ctx, hv)
 		}
 		return r.setOffboardingCondition(ctx, hv, fmt.Sprintf("cannot get hypervisor by name %s due to %s", hostname, err))
 	}
@@
-	// A still-running nova-compute agent would re-register the service we
-	// are about to delete, undoing the offboarding.
-	if !meta.IsStatusConditionTrue(hv.Status.Conditions, kvmv1.ConditionTypeAgentPodsEvicted) {
-		msg := "Waiting for agent pods (nova-compute, neutron) to be evicted from node"
-		if cond := meta.FindStatusCondition(hv.Status.Conditions, kvmv1.ConditionTypeAgentPodsEvicted); cond != nil && cond.Message != "" {
-			msg = "Waiting for agent pods to be evicted: " + cond.Message
-		}
-		return r.setOffboardingCondition(ctx, hv, msg)
-	}

Also applies to: 130-138

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/offboarding_controller.go` around lines 96 - 101, The
AgentPodsEvicted gate check needs to be evaluated before the GetHypervisorByName
call on line 96. Currently, GetHypervisorByName can return ErrNoHypervisor and
mark offboarding complete via r.markOffboarded before the eviction gate is
evaluated, allowing offboarding to finalize while agent pods may still be
running. Move the AgentPodsEvicted gate check to execute before calling
GetHypervisorByName so that the eviction gate is verified first, ensuring agent
pods are evicted before any path can mark offboarding as complete.
🧹 Nitpick comments (1)
internal/controller/offboarding_controller_test.go (1)

418-456: ⚡ Quick win

Assert that OpenStack hypervisor lookup is not called when AgentPodsEvicted is False.

This case validates the status message, but it doesn’t enforce that the gate short-circuits before external lookup. Add a call counter on GET /os-hypervisors/detail and assert it stays 0 in the spec.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/offboarding_controller_test.go` around lines 418 - 456,
The test validates that the reconciler waits for agent pods to be evicted when
AgentPodsEvicted is False, but it does not verify that the external OpenStack
hypervisor lookup is short-circuited. Add a call counter variable in the
BeforeEach hook to track how many times the GET /os-hypervisors/detail handler
is invoked, increment it within the handler, and then add an assertion in the It
block that this counter remains 0 after the Reconcile call completes, ensuring
the gate prevents the external API call from being made.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/controller/offboarding_controller.go`:
- Around line 96-101: The AgentPodsEvicted gate check needs to be evaluated
before the GetHypervisorByName call on line 96. Currently, GetHypervisorByName
can return ErrNoHypervisor and mark offboarding complete via r.markOffboarded
before the eviction gate is evaluated, allowing offboarding to finalize while
agent pods may still be running. Move the AgentPodsEvicted gate check to execute
before calling GetHypervisorByName so that the eviction gate is verified first,
ensuring agent pods are evicted before any path can mark offboarding as
complete.

---

Nitpick comments:
In `@internal/controller/offboarding_controller_test.go`:
- Around line 418-456: The test validates that the reconciler waits for agent
pods to be evicted when AgentPodsEvicted is False, but it does not verify that
the external OpenStack hypervisor lookup is short-circuited. Add a call counter
variable in the BeforeEach hook to track how many times the GET
/os-hypervisors/detail handler is invoked, increment it within the handler, and
then add an assertion in the It block that this counter remains 0 after the
Reconcile call completes, ensuring the gate prevents the external API call from
being made.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c28fdcae-cfd4-4a64-93f4-9c6e5835b5e5

📥 Commits

Reviewing files that changed from the base of the PR and between 441452b and 4ef3f1a.

📒 Files selected for processing (14)
  • api/v1/hypervisor_types.go
  • charts/openstack-hypervisor-operator/templates/deployment.yaml
  • charts/openstack-hypervisor-operator/templates/role.yaml
  • charts/openstack-hypervisor-operator/values.yaml
  • cmd/main.go
  • go.mod
  • internal/controller/constants.go
  • internal/controller/gardener_node_lifecycle_controller.go
  • internal/controller/gardener_node_lifecycle_controller_test.go
  • internal/controller/hypervisor_controller.go
  • internal/controller/hypervisor_controller_test.go
  • internal/controller/offboarding_controller.go
  • internal/controller/offboarding_controller_test.go
  • internal/global/global.go

@fwiesel fwiesel force-pushed the stop-agents-before-delete-v2 branch from 4ef3f1a to 9217eec Compare June 19, 2026 11:18
@fwiesel fwiesel requested a review from Copilot June 19, 2026 11:19

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment thread internal/controller/hypervisor_controller.go Outdated
Comment thread cmd/main.go
A running nova-compute pod would re-register the service immediately
after deletion, undoing the offboarding. Introduce AgentPodsEvicted
condition computed by HypervisorController and block service deletion
in the offboarding controller until it is True.

The condition is only evaluated once VM eviction has finished
(Evicting=False) and the offboarding taint is present on the node.
Pods are listed per-namespace via --agent-namespaces (chart default:
monsoon3; the operator refuses to start without it) in pages of 100,
with the pod cache disabled so the operator does not start a
cluster-wide pod informer. RequeueAfter is used to poll since pods
are not watched.
@fwiesel fwiesel force-pushed the stop-agents-before-delete-v2 branch from 9217eec to 2f3157a Compare June 19, 2026 12:13
@github-actions

Copy link
Copy Markdown

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/api/v1 0.00% (ø)
github.com/cobaltcore-dev/openstack-hypervisor-operator/cmd 0.00% (ø)
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller 69.34% (+0.36%) 👍
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/global 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/cobaltcore-dev/openstack-hypervisor-operator/api/v1/hypervisor_types.go 0.00% (ø) 0 0 0
github.com/cobaltcore-dev/openstack-hypervisor-operator/cmd/main.go 0.00% (ø) 0 0 0
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/constants.go 0.00% (ø) 0 0 0
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/gardener_node_lifecycle_controller.go 63.95% (-8.46%) 86 (+28) 55 (+13) 31 (+15) 👎
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/hypervisor_controller.go 80.00% (+4.32%) 120 (+46) 96 (+40) 24 (+6) 👍
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/offboarding_controller.go 73.91% (+2.04%) 69 (+5) 51 (+5) 18 👍
github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/global/global.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/gardener_node_lifecycle_controller_test.go
  • github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/hypervisor_controller_test.go
  • github.com/cobaltcore-dev/openstack-hypervisor-operator/internal/controller/offboarding_controller_test.go

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comment on lines +736 to +741
pods := &corev1.PodList{}
Expect(uncachedClient.List(ctx, pods,
client.MatchingFieldsSelector{
Selector: fields.OneTermEqualSelector("spec.nodeName", "target-node"),
},
)).To(Succeed())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants