OCPBUGS-65896: controller/workload: Rework Progressing condition#2203
OCPBUGS-65896: controller/workload: Rework Progressing condition#2203tchap wants to merge 1 commit into
Conversation
|
@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds deployment progression helpers that derive operator Progressing conditions, detect progress and timeouts, integrates them into the workload controller replacing generation-based checks, and expands tests and fake pod listers to cover rollout, timeout, and pod-listing error scenarios. ChangesDeployment progression + workload integration
Sequence Diagram(s)sequenceDiagram
participant Controller
participant KubeAPI as "Kubernetes API (Deployment)"
participant PodLister as "Pod Lister"
participant OperatorStatus as "OperatorStatus"
Controller->>KubeAPI: Get Deployment object / status
KubeAPI-->>Controller: DeploymentStatus (replicas, conditions, revision, observedGeneration)
Controller->>PodLister: List pods for the Deployment
PodLister-->>Controller: Pod list or error
alt Pod list error
Controller->>OperatorStatus: Mark Degraded / Progressing False (pod-list error)
else Normal path
Controller->>Controller: DeploymentProgressingCondition(deployment)
Controller->>Controller: HasDeploymentProgressed(status) / HasDeploymentTimedOutProgressing(status)
alt Timed out
Controller->>OperatorStatus: Set Progressing False (timeout) and Degraded
else Still progressing
Controller->>OperatorStatus: Set Progressing True (updating) or Progressing False (as expected)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/apps/deployment/progressing.go (1)
15-43: 💤 Low valueInconsistent Type field assignment in the default case.
Line 38 uses
string(operatorv1.OperatorStatusTypeProgressing)while lines 22 and 30 assignoperatorv1.OperatorStatusTypeProgressingdirectly. For consistency, remove the unnecessarystring()conversion.♻️ Suggested fix
default: return operatorv1.OperatorCondition{ - Type: string(operatorv1.OperatorStatusTypeProgressing), + Type: operatorv1.OperatorStatusTypeProgressing, Status: operatorv1.ConditionFalse, Reason: "AsExpected", }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/apps/deployment/progressing.go` around lines 15 - 43, The default branch of DeploymentProgressingCondition sets Type using string(operatorv1.OperatorStatusTypeProgressing) while the other cases use operatorv1.OperatorStatusTypeProgressing directly; change the default case to assign Type: operatorv1.OperatorStatusTypeProgressing (remove the unnecessary string() conversion) so all branches use the same type value and maintain consistency.pkg/operator/apiserver/controller/workload/workload_test.go (1)
60-60: 💤 Low valueThe
podListErrfield is added but never exercised in any test scenario.This field enables testing error handling in
PodContainersStatus(workload.go line 330), but no scenario sets it to a non-nil value. Consider adding a test scenario that verifies the error message fallback when pod listing fails, or remove the field if it's not needed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/apiserver/controller/workload/workload_test.go` at line 60, The test adds a podListErr field but never exercises it; add a test case in workload_test.go that sets podListErr to a non-nil error and invokes PodContainersStatus (the function in workload.go around line 330) to verify the code returns the expected fallback/error message path, asserting the returned status or error string matches the fallback; alternatively, if you intentionally don't need error-path coverage, remove podListErr from the test fixture to avoid dead test fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/apps/deployment/progressing.go`:
- Around line 15-43: The default branch of DeploymentProgressingCondition sets
Type using string(operatorv1.OperatorStatusTypeProgressing) while the other
cases use operatorv1.OperatorStatusTypeProgressing directly; change the default
case to assign Type: operatorv1.OperatorStatusTypeProgressing (remove the
unnecessary string() conversion) so all branches use the same type value and
maintain consistency.
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Line 60: The test adds a podListErr field but never exercises it; add a test
case in workload_test.go that sets podListErr to a non-nil error and invokes
PodContainersStatus (the function in workload.go around line 330) to verify the
code returns the expected fallback/error message path, asserting the returned
status or error string matches the fallback; alternatively, if you intentionally
don't need error-path coverage, remove podListErr from the test fixture to avoid
dead test fields.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 45e26b26-ec86-46f2-b1a7-ef89c6815adf
📒 Files selected for processing (4)
pkg/apps/deployment/progressing.gopkg/apps/deployment/progressing_test.gopkg/operator/apiserver/controller/workload/workload.gopkg/operator/apiserver/controller/workload/workload_test.go
cb354ea to
94e5f12
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/operator/apiserver/controller/workload/workload_test.go (1)
902-904: 💤 Low valueConsider supporting error simulation in fakePodNamespaceLister.
The
Listmethod always returnsnilfor the error. If future tests need to simulate pod listing failures (e.g., to test error handling paths), this would need to be extended.♻️ Suggested enhancement for error simulation
type fakePodLister struct { pods []*corev1.Pod + err error } type fakePodNamespaceLister struct { lister *fakePodLister } func (f *fakePodNamespaceLister) List(selector labels.Selector) (ret []*corev1.Pod, err error) { + if f.lister.err != nil { + return nil, f.lister.err + } return f.lister.pods, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines 902 - 904, The fakePodNamespaceLister.List currently always returns f.lister.pods and nil error; update the fake to allow simulating failures by adding an error field or function hook to the fakePodNamespaceLister (e.g., err error or listFunc func(labels.Selector) ([]*corev1.Pod, error)) and change List(selector labels.Selector) to return either the configured error or delegate to the hook/return the pods; update tests to inject errors where needed using the new field/hook.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 902-904: The fakePodNamespaceLister.List currently always returns
f.lister.pods and nil error; update the fake to allow simulating failures by
adding an error field or function hook to the fakePodNamespaceLister (e.g., err
error or listFunc func(labels.Selector) ([]*corev1.Pod, error)) and change
List(selector labels.Selector) to return either the configured error or delegate
to the hook/return the pods; update tests to inject errors where needed using
the new field/hook.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b7afd2ac-4fd1-4b8d-a012-341124ddf6a3
📒 Files selected for processing (4)
pkg/apps/deployment/progressing.gopkg/apps/deployment/progressing_test.gopkg/operator/apiserver/controller/workload/workload.gopkg/operator/apiserver/controller/workload/workload_test.go
4af5bd6 to
b32dd27
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/operator/apiserver/controller/workload/workload.go`:
- Around line 307-308: The degraded-status path must consider rollout timeouts
even when AvailableReplicas haven't dropped: change the logic that computes
workloadIsBeingUpdatedTooLong (call to
deployment.HasDeploymentTimedOutProgressing) and workloadIsBeingUpdated
(currently !deployment.HasDeploymentProgressed &&
!workloadIsBeingUpdatedTooLong) so that a ProgressDeadlineExceeded timeout can
set DeploymentDegraded and produce a ProgressDeadlineExceeded reason instead of
only UnavailablePod; update the status-setting code that uses
workloadIsBeingUpdated and the degraded branch (the block referencing
AvailableReplicas and UnavailablePod) to check workloadIsBeingUpdatedTooLong
first and surface ProgressDeadlineExceeded as the condition reason; apply the
same change to the similar code in the other block noted (lines referenced in
review) so timeout cases always produce DeploymentDegraded with the timeout
reason.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 835cd7fd-5330-41b0-aa4a-604609fc43ac
📒 Files selected for processing (4)
pkg/apps/deployment/progressing.gopkg/apps/deployment/progressing_test.gopkg/operator/apiserver/controller/workload/workload.gopkg/operator/apiserver/controller/workload/workload_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/apps/deployment/progressing_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/operator/apiserver/controller/workload/workload.go (1)
307-329:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSurface rollout timeouts through
DeploymentDegraded.
ProgressDeadlineExceededis detected here, but the degraded path still only triggers whenAvailableReplicas < desiredReplicasand always reportsUnavailablePod. That means a timed-out rollout can leaveDeploymentDegraded=Falsewhile old replicas are still serving, and even when it does degrade it masks the timeout behind the generic unavailable-pod reason. This still misses the failure mode called out in the PR description.Suggested direction
- _, workloadIsBeingUpdatedTooLong := deployment.HasDeploymentTimedOutProgressing(workload.Status) + timedOutMessage, workloadIsBeingUpdatedTooLong := deployment.HasDeploymentTimedOutProgressing(workload.Status) workloadIsBeingUpdated := !deployment.HasDeploymentProgressed(workload.Status) && !workloadIsBeingUpdatedTooLong @@ - if !workloadHasAllPodsAvailable && (!workloadIsBeingUpdated || workloadIsBeingUpdatedTooLong) { + if workloadIsBeingUpdatedTooLong { + deploymentDegradedCondition = deploymentDegradedCondition. + WithStatus(operatorv1.ConditionTrue). + WithReason("ProgressDeadlineExceeded"). + WithMessage(fmt.Sprintf("deployment/%s.%s has timed out progressing: %s", workload.Name, c.targetNamespace, timedOutMessage)) + } else if !workloadHasAllPodsAvailable && !workloadIsBeingUpdated { numNonAvailablePods := desiredReplicas - workload.Status.AvailableReplicas deploymentDegradedCondition = deploymentDegradedCondition. WithStatus(operatorv1.ConditionTrue). WithReason("UnavailablePod")The timeout scenarios in
pkg/operator/apiserver/controller/workload/workload_test.goshould follow this once fixed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/operator/apiserver/controller/workload/workload.go` around lines 307 - 329, The current logic only sets DeploymentDegraded for unavailable pods and masks rollout timeouts; update the condition handling so that when deployment.HasDeploymentTimedOutProgressing(workload.Status) is true you set deploymentDegradedCondition to ConditionTrue with reason "ProgressDeadlineExceeded" and a message describing the timeout (include workload.Name, c.targetNamespace and any podContainersStatus if available) instead of/alongside the generic "UnavailablePod" path; specifically, after computing workloadIsBeingUpdatedTooLong and workloadIsBeingUpdated, branch on the timed-out case (use deployment.HasDeploymentTimedOutProgressing) to set deploymentDegradedCondition.WithStatus(operatorv1.ConditionTrue).WithReason("ProgressDeadlineExceeded").WithMessage(...), and only use the "UnavailablePod" reason when the failure is strictly due to AvailableReplicas < desiredReplicas and not due to the timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@pkg/operator/apiserver/controller/workload/workload.go`:
- Around line 307-329: The current logic only sets DeploymentDegraded for
unavailable pods and masks rollout timeouts; update the condition handling so
that when deployment.HasDeploymentTimedOutProgressing(workload.Status) is true
you set deploymentDegradedCondition to ConditionTrue with reason
"ProgressDeadlineExceeded" and a message describing the timeout (include
workload.Name, c.targetNamespace and any podContainersStatus if available)
instead of/alongside the generic "UnavailablePod" path; specifically, after
computing workloadIsBeingUpdatedTooLong and workloadIsBeingUpdated, branch on
the timed-out case (use deployment.HasDeploymentTimedOutProgressing) to set
deploymentDegradedCondition.WithStatus(operatorv1.ConditionTrue).WithReason("ProgressDeadlineExceeded").WithMessage(...),
and only use the "UnavailablePod" reason when the failure is strictly due to
AvailableReplicas < desiredReplicas and not due to the timeout.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0dfdc39b-eff3-4603-971a-9ddfa4e4f103
📒 Files selected for processing (4)
pkg/apps/deployment/progressing.gopkg/apps/deployment/progressing_test.gopkg/operator/apiserver/controller/workload/workload.gopkg/operator/apiserver/controller/workload/workload_test.go
b32dd27 to
233f3e6
Compare
Replace the generation-based progressing check and the 15-minute v1helpers.IsUpdatingTooLong timer with the deployment controller's native ProgressDeadlineExceeded condition. This avoids false Progressing=True signals on scaling operations (where generation changes but no rollout occurs) and uses a more accurate stall detection mechanism. Extract progressing condition logic into a new DeploymentProgressingCondition helper in pkg/apps/deployment, along with exported HasDeploymentProgressed and HasDeploymentTimedOutProgressing helpers.
233f3e6 to
6741f3e
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tchap The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@tchap: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Replace the generation-based progressing check and the 15-minute v1helpers.IsUpdatingTooLong timer with the deployment controller's native ProgressDeadlineExceeded condition. This avoids false Progressing=True signals on scaling operations (where generation changes but no rollout occurs) and uses a more accurate stall detection mechanism.
Extract progressing condition logic into a new DeploymentProgressingCondition helper in pkg/apps/deployment, along with exported HasDeploymentProgressed and HasDeploymentTimedOutProgressing helpers. The degraded condition now reports ProgressDeadlineExceeded when the deployment has timed out, instead of the generic UnavailablePod reason.
Testing PR for CAO
Notes for Review
I published the helpers in a separate package because there are multiple other components that will be able to use them later on. I also reviewed the test cases as much as I could. Claude says all combinations are covered.
I also made sure not to touch Degraded in this PR, it's gonna come next.
Summary by CodeRabbit
Bug Fixes
Refactor
Tests