Skip to content

OCPBUGS-84308: fix(cpo) delete terminated MCD pods to retry in-place upgrades#8434

Draft
PoornimaSingour wants to merge 4 commits intoopenshift:mainfrom
PoornimaSingour:OCPBUGS-84308
Draft

OCPBUGS-84308: fix(cpo) delete terminated MCD pods to retry in-place upgrades#8434
PoornimaSingour wants to merge 4 commits intoopenshift:mainfrom
PoornimaSingour:OCPBUGS-84308

Conversation

@PoornimaSingour
Copy link
Copy Markdown
Contributor

@PoornimaSingour PoornimaSingour commented May 6, 2026

What this PR does / why we need it:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

    • Upgrade flow now removes terminated upgrade pods (Succeeded/Failed) so retries can proceed and in-place upgrades continue after prior attempts finish.
    • Improved logging around upgrade pod creation and reconciliation for clearer operational visibility.
  • Tests

    • Added unit tests covering upgrade pod lifecycle: deletion of terminated pods, retention of running pods, creation when missing, and cleanup of idle pods on fully updated nodes.

When an in-place MachineConfig daemon pod is prematurely terminated
(e.g., by a forced node drain), it may transition to Succeeded or
Failed phase without having completed the configuration update.
Previously, reconcileUpgradePods did not check the pod's phase when
it already existed, leaving the terminated pod in place and causing
the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed)
on a node that still requires upgrading, the controller deletes the
pod so it is recreated on the next reconciliation cycle.

Signed-off-by: Poornima Singour <psingour@redhat.com>
Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3a3eae36-8f76-4951-8286-ff8ca2e92bb4

📥 Commits

Reviewing files that changed from the base of the PR and between c82c543 and 3a092a3.

📒 Files selected for processing (2)
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

📝 Walkthrough

Walkthrough

ReconcileInPlaceUpgrade was updated to detect upgrade MCD pods in Succeeded or Failed phases and delete them (tolerating NotFound) so a new upgrade pod can be retried. Running pods are left unchanged; if no pod exists the controller creates one (with added creation logging). The caller’s error return message was updated to reflect reconciling upgrade pods. A unit test TestReconcileUpgradePods was added to cover deleted terminated pods, retained running pods, created missing pods, and removed idle pods on fully updated nodes.

Sequence Diagram(s)

sequenceDiagram
    participant Controller
    participant API_Server
    participant Pod

    Controller->>API_Server: Get upgrade Pod for node
    API_Server-->>Controller: Return Pod (Running | Succeeded | Failed | NotFound)

    alt Pod is Running
        Controller->>Controller: Leave Pod unchanged
    else Pod is Succeeded or Failed
        Controller->>API_Server: Delete Pod (log termination retry)
        API_Server-->>Controller: Delete response (Success / NotFound / Error)
    else Pod NotFound
        Controller->>API_Server: Create upgrade Pod (log creation result)
        API_Server-->>Controller: Create response (Success / Error)
    end
Loading
🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: deleting terminated MCD pods to enable retry of in-place upgrades, which is the primary objective of the pull request.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Standard Go tests used (not Ginkgo). All test case names are static strings with no dynamic content. Check not applicable.
Test Structure And Quality ✅ Passed The custom check specifies reviewing Ginkgo test code. TestReconcileUpgradePods uses standard Go table-driven testing with Gomega assertions, not Ginkgo. The check is not applicable.
Microshift Test Compatibility ✅ Passed TestReconcileUpgradePods is a standard Go unit test (*testing.T), not a Ginkgo e2e test. The check applies only to Ginkgo tests (It(), Describe()). No Ginkgo patterns found.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds standard Go unit tests with fake clients, not Ginkgo e2e tests. SNO compatibility check applies only to Ginkgo e2e tests and is not applicable here.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds pod deletion logic for terminated MCD upgrade pods. No affinity, topology spread constraints, or control-plane targeting introduced. Per-node approach is topology-agnostic.
Ote Binary Stdout Contract ✅ Passed No OTE stdout contract violations found. Code uses ctrl.LoggerFrom structured logging, no direct stdout writes (fmt.Print/Println/Printf), no process-level entry points, and all errors use fmt.Errorf.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed New tests are standard Go unit tests, not Ginkgo e2e tests. The custom check targets Ginkgo e2e tests with IPv6/disconnected concerns. Not applicable here.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Timed out fetching pipeline failures after 30000ms

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: PoornimaSingour
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels May 6, 2026
@PoornimaSingour PoornimaSingour changed the title fix(cpo): delete terminated MCD pods to retry in-place upgrades OCPBUGS-84308: fix(cpo) delete terminated MCD pods to retry in-place upgrades May 6, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • Improved handling of terminated upgrade pods to enable retry mechanisms during in-place upgrades, allowing progress even when previous upgrade attempts have completed.

  • Tests

  • Added comprehensive test coverage for upgrade pod lifecycle management across multiple scenarios, including pod deletion, retention, creation, and cleanup operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Assisted-by: Claude Opus 4.6 noreply@anthropic.com

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • Improved handling of terminated upgrade pods to enable retry mechanisms during in-place upgrades, allowing progress even when previous upgrade attempts have completed.

  • Tests

  • Added comprehensive test coverage for upgrade pod lifecycle management across multiple scenarios, including pod deletion, retention, creation, and cleanup operations.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.44%. Comparing base (640ed89) to head (3a092a3).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...tor/controllers/inplaceupgrader/inplaceupgrader.go 50.00% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8434      +/-   ##
==========================================
+ Coverage   37.39%   37.44%   +0.04%     
==========================================
  Files         751      751              
  Lines       91806    91978     +172     
==========================================
+ Hits        34333    34441     +108     
- Misses      54838    54894      +56     
- Partials     2635     2643       +8     
Files with missing lines Coverage Δ
...tor/controllers/inplaceupgrader/inplaceupgrader.go 57.22% <50.00%> (+0.15%) ⬆️

... and 3 files with indirect coverage changes

Flag Coverage Δ
cmd-support 32.63% <ø> (+0.07%) ⬆️
cpo-hostedcontrolplane 36.48% <ø> (ø)
cpo-other 37.75% <50.00%> (+0.02%) ⬆️
hypershift-operator 47.93% <ø> (+0.07%) ⬆️
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go (1)

736-738: ⚡ Quick win

Tighten deleted-pod assertion to NotFound instead of any error.

HaveOccurred() can pass for unrelated failures. Asserting IsNotFound makes the test intent explicit and failures clearer.

Proposed test hardening
+import apierrors "k8s.io/apimachinery/pkg/api/errors"
...
 			if tc.expectPodDeleted {
 				g.Expect(getErr).To(HaveOccurred(), "expected pod to be deleted")
+				g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod get to return NotFound after deletion")
 			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`
around lines 736 - 738, Replace the loose assertion
g.Expect(getErr).To(HaveOccurred()) for deleted pods with a NotFound-specific
check: import k8s.io/apimachinery/pkg/api/errors as apierrors (or errors alias
used elsewhere) and replace the assertion with
g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod to be
NotFound") when tc.expectPodDeleted is true, referencing the tc.expectPodDeleted
branch and the getErr variable so the test fails only for a NotFound error.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 736-738: Replace the loose assertion
g.Expect(getErr).To(HaveOccurred()) for deleted pods with a NotFound-specific
check: import k8s.io/apimachinery/pkg/api/errors as apierrors (or errors alias
used elsewhere) and replace the assertion with
g.Expect(apierrors.IsNotFound(getErr)).To(BeTrue(), "expected pod to be
NotFound") when tc.expectPodDeleted is true, referencing the tc.expectPodDeleted
branch and the getErr variable so the test fails only for a NotFound error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 21221ab1-79e2-4d7c-8429-c9fb954b5229

📥 Commits

Reviewing files that changed from the base of the PR and between 7ac2953 and 6fbc013.

📒 Files selected for processing (2)
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 762-764: The test currently checks for a deleted pod using a broad
error assertion (g.Expect(getErr).To(HaveOccurred()) when tc.expectPodDeleted is
true); change this to assert specifically that the error is a NotFound error by
using the Kubernetes API errors helper on the getErr variable (e.g., assert
apierrors.IsNotFound(getErr) via the testing framework) so the deleted-pod
branch only passes for NotFound and not for other client/read errors; locate the
check guarded by tc.expectPodDeleted in inplaceupgrader_test.go and replace the
HaveOccurred() assertion with a specific IsNotFound assertion referencing
getErr.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8b8c3bd2-d52a-403e-a20b-a5bbb89688c9

📥 Commits

Reviewing files that changed from the base of the PR and between 6fbc013 and b5637a4.

📒 Files selected for processing (1)
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

…dePods

Add a test case for when a terminated MCD pod already has a
DeletionTimestamp set, verifying the controller skips the delete
and continues without error.

Signed-off-by: Poornima Singour <psingour@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use apierrors.IsNotFound instead of broad HaveOccurred matcher so
delete assertions only pass for the expected NotFound error.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@PoornimaSingour
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 6, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhuynh@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@PoornimaSingour
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

✅ Actions performed

Full review triggered.

@openshift-ci-robot
Copy link
Copy Markdown

@PoornimaSingour: This pull request references Jira Issue OCPBUGS-84308, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhuynh@redhat.com), skipping review request.

Details

In response to this:

What this PR does / why we need it:

When an in-place MachineConfig daemon pod is prematurely terminated (e.g., by a forced node drain), it may transition to Succeeded or Failed phase without having completed the configuration update. Previously, reconcileUpgradePods did not check the pod's phase when it already existed, leaving the terminated pod in place and causing the upgrade to stall indefinitely.

Now, when an MCD pod exists in a terminal phase (Succeeded or Failed) on a node that still requires upgrading, the controller deletes the pod so it is recreated on the next reconciliation cycle.

Which issue(s) this PR fixes:

Fixes : https://redhat.atlassian.net/browse/OCPBUGS-84308

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • Upgrade flow now removes terminated upgrade pods (Succeeded/Failed) so retries can proceed and in-place upgrades continue after prior attempts finish.

  • Tests

  • Added unit tests covering upgrade pod lifecycle: deletion of terminated pods, retention of running pods, creation when missing, and cleanup of idle pods on fully updated nodes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go (1)

352-363: ⚡ Quick win

Deleted-pod retry has no requeue guarantee — upgrade may stall.

After the terminated pod is deleted, reconcileUpgradePods returns nil, reconcileInPlaceUpgrade returns nil, and Reconcile returns ctrl.Result{} (no requeue). Because the deletion doesn't mutate any node annotation, no node-watch event fires to trigger a follow-up reconciliation. If no other MachineSet event arrives, the replacement pod is never created and the upgrade stalls indefinitely — which is exactly the problem this PR is fixing.

Consider either propagating a boolean "needs requeue" flag back up through reconcileInPlaceUpgrade to Reconcile, or returning ctrl.Result{RequeueAfter: ...} whenever at least one pod was deleted:

💡 Sketch of the fix
-func (r *Reconciler) reconcileUpgradePods(...) error {
+func (r *Reconciler) reconcileUpgradePods(...) (bool, error) {
     ...
+    podDeleted := false
     ...
     } else if pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodFailed {
         ...
         if err := hostedClusterClient.Delete(ctx, pod); err != nil {
             ...
-            return fmt.Errorf("error deleting terminated upgrade MCD pod for node %s: %w", node.Name, err)
+            return false, fmt.Errorf("error deleting terminated upgrade MCD pod for node %s: %w", node.Name, err)
         }
+        podDeleted = true
     }
     ...
-    return nil
+    return podDeleted, nil
 }

And in reconcileInPlaceUpgrade / Reconcile, propagate the flag to return ctrl.Result{RequeueAfter: 5 * time.Second}.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go`
around lines 352 - 363, reconcileUpgradePods currently deletes terminated
upgrade pods but returns nil which causes reconcileInPlaceUpgrade and Reconcile
to not requeue and the replacement pod may never be created; change
reconcileUpgradePods to return a (bool, error) or similar indicator (e.g.,
deletedPod bool) when it deletes at least one pod, update
reconcileInPlaceUpgrade to propagate that flag up, and have Reconcile return
ctrl.Result{RequeueAfter: 5 * time.Second} (or another short duration) whenever
the flag indicates a pod was deleted so the controller will immediately requeue
and create the replacement pod.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go`:
- Around line 692-715: Update the test case that sets existingPod with a
DeletionTimestamp and Finalizers so it actually verifies the "skip" behavior
instead of just checking getErr; in the assertion block that currently checks
getErr (references variables existingPod, expectPodSkipped and the retrieved pod
variable), either assert that the retrieved pod's DeletionTimestamp is non-nil
(e.g., pod.DeletionTimestamp != nil) to prove we hit the skip path, or
replace/add a fake-client interceptor (WithInterceptorFuncs) to spy on Delete
and assert Delete was never called for that pod — do not rely solely on getErr.

In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go`:
- Around line 352-363: reconcileUpgradePods now deletes both idle and terminated
pods but the error wrap at the caller still says "failed to delete idle upgrade
pods", which is misleading; update the error wrapping at the call site that
wraps the error from hostedClusterClient.Delete (the delete call inside
reconcileUpgradePods) to use a neutral message like "failed to delete upgrade
pod for node %s" or include the pod phase/node context so failures deleting
terminated pods are accurately described; adjust the fmt.Errorf wrapper (the
existing "failed to delete idle upgrade pods" message) to reference the upgrade
pod deletion generically (or include pod.Status.Phase) so logs reflect the
actual deletion target.

---

Nitpick comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go`:
- Around line 352-363: reconcileUpgradePods currently deletes terminated upgrade
pods but returns nil which causes reconcileInPlaceUpgrade and Reconcile to not
requeue and the replacement pod may never be created; change
reconcileUpgradePods to return a (bool, error) or similar indicator (e.g.,
deletedPod bool) when it deletes at least one pod, update
reconcileInPlaceUpgrade to propagate that flag up, and have Reconcile return
ctrl.Result{RequeueAfter: 5 * time.Second} (or another short duration) whenever
the flag indicates a pod was deleted so the controller will immediately requeue
and create the replacement pod.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7df03c82-4975-43fe-9170-34a23bcc9534

📥 Commits

Reviewing files that changed from the base of the PR and between 7ac2953 and c82c543.

📒 Files selected for processing (2)
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.go

…iation

Update error message at call site from "failed to delete idle upgrade
pods" to "failed to reconcile upgrade pods" to accurately reflect that
the function now handles both idle and terminated pod deletion.

Strengthen the DeletionTimestamp skip test by asserting that the
retrieved pod's DeletionTimestamp is non-nil, proving the skip path
was actually taken rather than relying solely on existence checks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants