OCPBUGS-84577: clear stale EtcdRecoveryActive failure condition when etcd is healthy#8406
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughWhen an existing etcd recovery Job finishes unsuccessfully, the reconciler now fetches the etcd StatefulSet; if the StatefulSet reports ReadyReplicas==3 and AvailableReplicas==3 it deletes the recovery Job and related objects, clears the EtcdRecoveryActive condition (Status=False, Reason=AsExpectedReason) on the HostedCluster, updates HostedCluster status, and returns without marking manual-intervention. If the StatefulSet is not fully ready, the reconciler retains the prior failure handling (EtcdRecoveryJobFailedReason). Separately, when no failing etcd pod is detected and the StatefulSet is fully available, the reconciler clears a stale EtcdRecoveryActive failure condition only if its prior Reason was EtcdRecoveryJobFailedReason, updating status. A unit test verifies these behaviors. Sequence DiagramsequenceDiagram
participant Reconciler
participant ETCDJob as ETCD Recovery Job
participant ETCDSet as ETCD StatefulSet
participant HostedCluster as HostedCluster Status
Reconciler->>ETCDJob: Check if Job exists and failed
alt Job Failed
Reconciler->>ETCDSet: Fetch StatefulSet readiness
alt ReadyReplicas==3 && AvailableReplicas==3
Reconciler->>ETCDJob: Delete recovery Job and objects
Reconciler->>HostedCluster: Set EtcdRecoveryActive=False (Reason: AsExpectedReason)
Reconciler->>HostedCluster: Update status
else StatefulSet Not Fully Ready
Reconciler->>HostedCluster: Set EtcdRecoveryActive=True/Failed (Reason: EtcdRecoveryJobFailedReason)
Reconciler->>HostedCluster: Update status
end
else No Failing Pod Detected
Reconciler->>ETCDSet: Check if fully available
alt StatefulSet Fully Available
Reconciler->>HostedCluster: If prior Reason==EtcdRecoveryJobFailedReason, clear EtcdRecoveryActive (Status=False, Reason=AsExpectedReason)
Reconciler->>HostedCluster: Update status
end
end
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vsolanki12 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #8406 +/- ##
==========================================
+ Coverage 37.44% 37.61% +0.17%
==========================================
Files 751 751
Lines 91969 92036 +67
==========================================
+ Hits 34435 34617 +182
+ Misses 54894 54759 -135
- Partials 2640 2660 +20
... and 4 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)
6622-6627: ⚡ Quick winAssert the failed recovery job is gone in the cleanup case.
The recovered-after-failed-job case only checks the condition reason. If cleanup regresses and the stale
Jobstays behind, this test still passes even though one of the main behaviors in this PR is broken.Suggested assertion
testCases := []struct { name string objects []crclient.Object conditions []metav1.Condition expectedReason string conditionExists bool + expectJobDeleted bool }{ { name: "When failed job exists but etcd recovered it should cleanup job and clear condition", conditions: []metav1.Condition{staleCondition}, objects: append(healthyEtcdPods(), healthyStatefulSet, failedJob), expectedReason: hyperv1.AsExpectedReason, conditionExists: true, + expectJobDeleted: true, }, } @@ if tc.conditionExists { g.Expect(condition).ToNot(BeNil()) g.Expect(condition.Reason).To(Equal(tc.expectedReason)) } else { g.Expect(condition).To(BeNil()) } + + if tc.expectJobDeleted { + job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS) + err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job) + g.Expect(errors2.IsNotFound(err)).To(BeTrue()) + } }) } }Also applies to: 6677-6686
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 6622 - 6627, The test cases that use the failedJob fixture (the case with name "When failed job exists but etcd recovered it should cleanup job and clear condition" and the similar case around lines 6677-6686) only assert condition reasons; add an assertion after reconciliation that the failed Job (failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in the test namespace and assert the client returns NotFound or that listing Jobs returns zero matching entries. Locate the assertions around expectedReason/conditionExists in hostedcluster_controller_test.go and add the cleanup check for failedJob for both test cases so a lingering Job causes the test to fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6622-6627: The test cases that use the failedJob fixture (the case
with name "When failed job exists but etcd recovered it should cleanup job and
clear condition" and the similar case around lines 6677-6686) only assert
condition reasons; add an assertion after reconciliation that the failed Job
(failedJob) has been removed—e.g., attempt to Get the Job by failedJob.Name in
the test namespace and assert the client returns NotFound or that listing Jobs
returns zero matching entries. Locate the assertions around
expectedReason/conditionExists in hostedcluster_controller_test.go and add the
cleanup check for failedJob for both test cases so a lingering Job causes the
test to fail.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: d69011b7-ad73-45d5-aa19-6f7fc30bdb73
📒 Files selected for processing (4)
control-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader.gocontrol-plane-operator/hostedclusterconfigoperator/controllers/inplaceupgrader/inplaceupgrader_test.gohypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
d28aec7 to
a9c44fb
Compare
|
@vsolanki12: This pull request references Jira Issue OCPBUGS-84577, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
9462d10 to
5873d33
Compare
|
/auto-cc |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)
6661-6666: ⚡ Quick winAssert recovery Job deletion in the “failed job but etcd recovered” case.
The test validates condition reason changes, but it doesn’t verify that cleanup actually removed the failed Job. Adding that assertion would directly protect the main behavior this PR introduces.
Suggested test assertion
condition := meta.FindStatusCondition(updatedHC.Status.Conditions, string(hyperv1.EtcdRecoveryActive)) if tc.conditionExists { g.Expect(condition).ToNot(BeNil()) g.Expect(condition.Reason).To(Equal(tc.expectedReason)) } else { g.Expect(condition).To(BeNil()) } + + if tc.name == "When failed job exists but etcd recovered it should cleanup job and clear condition" { + job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS) + err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job) + g.Expect(errors2.IsNotFound(err)).To(BeTrue(), "expected failed recovery job to be deleted") + } }) } }Also applies to: 6719-6725
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 6661 - 6666, Add assertions to the two test cases that simulate "failed job but etcd recovered" to ensure the failed Job resource was actually deleted: after invoking the reconcile/test execution for the case where objects include failedJob (referencing the test case with name "When failed job exists but etcd recovered it should cleanup job and clear condition" and the analogous case at the other location), query the fake client for the Job (the same failedJob name/namespace used in the test) and assert it returns NotFound; use the test's existing fake client/helper methods used elsewhere in hostedcluster_controller_test.go so the assertion verifies cleanup of failedJob in addition to the condition change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6661-6666: Add assertions to the two test cases that simulate
"failed job but etcd recovered" to ensure the failed Job resource was actually
deleted: after invoking the reconcile/test execution for the case where objects
include failedJob (referencing the test case with name "When failed job exists
but etcd recovered it should cleanup job and clear condition" and the analogous
case at the other location), query the fake client for the Job (the same
failedJob name/namespace used in the test) and assert it returns NotFound; use
the test's existing fake client/helper methods used elsewhere in
hostedcluster_controller_test.go so the assertion verifies cleanup of failedJob
in addition to the condition change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 1c4bd90e-22a0-4d2f-8a27-8b46bb7a4b92
📒 Files selected for processing (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
5873d33 to
cafae49
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.go (1)
65-66: ⚡ Quick winAvoid hardcoding the expected etcd replica count to
3.Both health checks gate behavior on
ReadyReplicas == 3 && AvailableReplicas == 3. Using the StatefulSet desired replicas keeps this logic correct if replica sizing changes in the future.♻️ Suggested change
-} else if etcdStatefulSet.Status.ReadyReplicas == 3 && etcdStatefulSet.Status.AvailableReplicas == 3 { +} else { + expectedReplicas := int32(1) + if etcdStatefulSet.Spec.Replicas != nil { + expectedReplicas = *etcdStatefulSet.Spec.Replicas + } + if etcdStatefulSet.Status.ReadyReplicas == expectedReplicas && etcdStatefulSet.Status.AvailableReplicas == expectedReplicas { log.Info("etcd recovered despite failed recovery job, cleaning up") ... return nil, nil + } }-fullyAvailable := etcdStatefulSet.Status.ReadyReplicas == 3 && etcdStatefulSet.Status.AvailableReplicas == 3 +expectedReplicas := int32(1) +if etcdStatefulSet.Spec.Replicas != nil { + expectedReplicas = *etcdStatefulSet.Spec.Replicas +} +fullyAvailable := etcdStatefulSet.Status.ReadyReplicas == expectedReplicas && + etcdStatefulSet.Status.AvailableReplicas == expectedReplicasAlso applies to: 125-126
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go` around lines 65 - 66, Replace the hardcoded "3" replica checks with the StatefulSet's desired replica count: read the intended replica value from etcdStatefulSet.Spec.Replicas (dereference the *int32) and use that for comparisons instead of 3 in the blocks that check etcdStatefulSet.Status.ReadyReplicas and etcdStatefulSet.Status.AvailableReplicas (the checks around etcdStatefulSet). Do the same replacement for the similar check later in the file (the other ReadyReplicas/AvailableReplicas condition).hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)
6719-6730: ⚡ Quick winHarden assertions to avoid false positives and brittle name-coupling.
The test currently keys a cleanup assertion off a case-name string and only checks
Reason. Add explicit per-case flags (e.g.,expectJobDeleted) and assertcondition.Statusas well to make failures more precise.✅ Suggested test tightening
type testCases := []struct { name string objects []crclient.Object conditions []metav1.Condition expectedReason string conditionExists bool + expectJobDeleted bool }{ { name: "When failed job exists but etcd recovered it should cleanup job and clear condition", ... conditionExists: true, + expectJobDeleted: true, }, } ... if tc.conditionExists { g.Expect(condition).ToNot(BeNil()) + g.Expect(condition.Status).To(Equal(metav1.ConditionFalse)) g.Expect(condition.Reason).To(Equal(tc.expectedReason)) } else { g.Expect(condition).To(BeNil()) } -if tc.name == "When failed job exists but etcd recovered it should cleanup job and clear condition" { +if tc.expectJobDeleted { job := etcdrecoverymanifests.EtcdRecoveryJob(hcpNS) err := client.Get(t.Context(), crclient.ObjectKeyFromObject(job), job) g.Expect(errors2.IsNotFound(err)).To(BeTrue(), "expected failed recovery job to be deleted") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 6719 - 6730, The test currently couples a job-deletion assertion to the case name and only checks condition.Reason; add a boolean test-case field (e.g., tc.expectJobDeleted) and use that flag instead of comparing tc.name to decide whether to fetch etcdrecoverymanifests.EtcdRecoveryJob and assert deletion via client.Get; also tighten the condition check by asserting condition.Status equals an explicit tc.expectedStatus (in addition to tc.expectedReason) when tc.conditionExists, referencing the existing meta.FindStatusCondition lookup for hyperv1.EtcdRecoveryActive and the tc.conditionExists / tc.expectedReason symbols to locate where to add the new assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@hypershift-operator/controllers/hostedcluster/etcd_recovery.go`:
- Around line 65-66: Replace the hardcoded "3" replica checks with the
StatefulSet's desired replica count: read the intended replica value from
etcdStatefulSet.Spec.Replicas (dereference the *int32) and use that for
comparisons instead of 3 in the blocks that check
etcdStatefulSet.Status.ReadyReplicas and
etcdStatefulSet.Status.AvailableReplicas (the checks around etcdStatefulSet). Do
the same replacement for the similar check later in the file (the other
ReadyReplicas/AvailableReplicas condition).
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6719-6730: The test currently couples a job-deletion assertion to
the case name and only checks condition.Reason; add a boolean test-case field
(e.g., tc.expectJobDeleted) and use that flag instead of comparing tc.name to
decide whether to fetch etcdrecoverymanifests.EtcdRecoveryJob and assert
deletion via client.Get; also tighten the condition check by asserting
condition.Status equals an explicit tc.expectedStatus (in addition to
tc.expectedReason) when tc.conditionExists, referencing the existing
meta.FindStatusCondition lookup for hyperv1.EtcdRecoveryActive and the
tc.conditionExists / tc.expectedReason symbols to locate where to add the new
assertions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: ff88e339-71fb-4394-b3b5-f412fbfc2743
📒 Files selected for processing (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
cc510cb to
a2a7e34
Compare
|
/retest |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go (1)
6640-6677: ⚡ Quick winAdd a regression case for StatefulSet read failure in failed-job path.
Current cases validate healthy/unhealthy outcomes, but not the new non-
NotFounderror branch (production Line 61-64 inreconcileETCDMemberRecovery). A dedicated case would lock in that behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go` around lines 6640 - 6677, Add a regression test case in the testCases slice that covers the non-NotFound StatefulSet read error path in reconcileETCDMemberRecovery: include failedJob and the etcd pod objects (e.g., use healthyEtcdPods()), ensure the test harness simulates a GET error for the StatefulSet (not a NotFound) instead of returning unhealthyStatefulSet/healthyStatefulSet, and assert the behavior you expect for that branch (e.g., expectedReason remains hyperv1.EtcdRecoveryJobFailedReason and expectJobDeleted is false). Reference the existing identifiers failedJob, reconcileETCDMemberRecovery, healthyEtcdPods, unhealthyStatefulSet/healthyStatefulSet and add the new case to the testCases array so the fake client error path is exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go`:
- Around line 6640-6677: Add a regression test case in the testCases slice that
covers the non-NotFound StatefulSet read error path in
reconcileETCDMemberRecovery: include failedJob and the etcd pod objects (e.g.,
use healthyEtcdPods()), ensure the test harness simulates a GET error for the
StatefulSet (not a NotFound) instead of returning
unhealthyStatefulSet/healthyStatefulSet, and assert the behavior you expect for
that branch (e.g., expectedReason remains hyperv1.EtcdRecoveryJobFailedReason
and expectJobDeleted is false). Reference the existing identifiers failedJob,
reconcileETCDMemberRecovery, healthyEtcdPods,
unhealthyStatefulSet/healthyStatefulSet and add the new case to the testCases
array so the fake client error path is exercised.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6fb89972-970a-4f3f-b1e7-2cadf40e46f3
📒 Files selected for processing (2)
hypershift-operator/controllers/hostedcluster/etcd_recovery.gohypershift-operator/controllers/hostedcluster/hostedcluster_controller_test.go
a2a7e34 to
01953c1
Compare
| etcdRecoveryActiveCondition.Status = metav1.ConditionFalse | ||
| etcdRecoveryActiveCondition.Reason = hyperv1.AsExpectedReason | ||
| etcdRecoveryActiveCondition.LastTransitionTime = r.now() | ||
| meta.SetStatusCondition(&hcluster.Status.Conditions, etcdRecoveryActiveCondition) |
There was a problem hiding this comment.
What about etcdRecoveryActiveCondition.Message in this path?
There was a problem hiding this comment.
Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.
| Type: string(hyperv1.EtcdRecoveryActive), | ||
| Status: metav1.ConditionFalse, | ||
| Reason: hyperv1.AsExpectedReason, | ||
| ObservedGeneration: hcluster.Generation, |
There was a problem hiding this comment.
What about etcdRecoveryActiveCondition.Message in this path?
There was a problem hiding this comment.
Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.
| var pods []crclient.Object | ||
| for i := 0; i < 3; i++ { | ||
| pods = append(pods, &corev1.Pod{ | ||
| ObjectMeta: metav1.ObjectMeta{ |
There was a problem hiding this comment.
May you add a case where RestartCount is > 0 and Running state?
To cover case when POD recovers.
There was a problem hiding this comment.
Thanks, I have made all the changes suggested on the etcd_recovery.go & hostedcluster_controller_test.go file.
|
/cc @sdminonne |
…when etcd is healthy When the etcd recovery job fails but etcd self-heals, the EtcdRecoveryJobFailed condition was never cleared. This caused the OpenShift Console to display a stale error message even when the cluster was fully healthy. This fix adds two checks: - When a failed recovery job exists but etcd StatefulSet is fully available (3/3), clean up the job and clear the condition. Transient API errors are propagated instead of silently falling through to the failure path. - When no failing etcd pods exist and etcd is healthy, clear any stale EtcdRecoveryJobFailed condition. Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
01953c1 to
7788d6b
Compare
|
I have the complete root cause for both failures. Both are caused by the same upstream infrastructure issue — Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth job failures are caused by a transient infrastructure outage of Root CauseThe root cause is a transient outage of the This caused two distinct but related failure modes:
These failures are entirely infrastructure-related and have nothing to do with the PR's code changes (which modify Recommendations
Evidence
|
|
/retest |
|
@vsolanki12: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What this PR does / why we need it:
When the etcd recovery job fails but etcd self-heals, the
EtcdRecoveryJobFailedcondition was never cleared. This caused the OpenShift Console to display a stale error message ("Error in Etcd Recovery job: the Etcd cluster requires manual intervention.") on the HostedCluster overview page, even when the cluster was fully healthy (Available=True,Degraded=False,EtcdAvailable=True).This fix adds two checks in
reconcileETCDMemberRecovery:EtcdRecoveryJobFailedcondition.Which issue(s) this PR fixes:
Fixes https://issues.redhat.com/browse/OCPBUGS-84577
Special notes for your reviewer:
ENABLE_ETCD_RECOVERYenv var and only applies to managed, highly-available etcd clusters.etcd_manual_intervention_requiredmetric (which reads this condition) will also correctly reset.Checklist:
Summary by CodeRabbit
Bug Fixes
Tests