OCPBUGS-79418: ctrl: sched: add topologySpreadConstraints to deployment#4177
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughScheduler pod placement shifts from affinity rules to hostname-based topology spread constraints. Explicit scheduler replica counts are now validated against auto-detected eligible nodes. Deployment-completion waiting is refactored to accept explicit replica counts and validate generation and per-field replica readiness independently with detailed logging. ChangesScheduler topology spread and replica validation
Sequence Diagram(s)sequenceDiagram
participant NUMAResourcesScheduler
participant Controller as computeSchedulerReplicas
participant NodeAPI
participant KubeAPI
NUMAResourcesScheduler->>Controller: spec.Replicas (may be nil)
Controller->>NodeAPI: List nodes by platform role
NodeAPI-->>Controller: eligible node count
alt explicit replicas set
Controller->>Controller: validate spec.Replicas ≤ node count
alt validation fails
Controller-->>NUMAResourcesScheduler: error
else validation passes
Controller-->>NUMAResourcesScheduler: spec.Replicas
end
else autodetect
Controller-->>NUMAResourcesScheduler: detected node count
end
NUMAResourcesScheduler->>KubeAPI: patch Deployment with TopologySpreadConstraints
KubeAPI-->>NUMAResourcesScheduler: Deployment updated
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/controller/numaresourcesscheduler_controller.go`:
- Around line 279-285: The current replica validation rejects spec.replicas: 0
when replicaCount is 0; update the logic in the block handling
schedSpec.Replicas so that an explicit zero is always accepted: first check if
*schedSpec.Replicas == 0 and if so return schedSpec.Replicas, nil; otherwise
keep the existing guard that errors when *schedSpec.Replicas > replicaCount
(using the existing replicaCount and nodeRoleDescription symbols) and return the
validated value. This preserves the existing maximum check but allows
intentional scale-to-zero.
In `@test/e2e/sched/sched_test.go`:
- Around line 392-396: The current test compares Deployment.ResourceVersion
immediately (using schedDpPostUpdate.ResourceVersion ==
schedDp.ResourceVersion), which is flaky; replace that assertion with a
Consistently check that the deployment's spec and generation do not change over
a short interval. Specifically, fetch the deployment via
podlist.With(...).DeploymentByOwnerReference(ctx, nroSchedObj.UID) inside a
Gomega Consistently block and assert that schedDpPostUpdate.Spec (or relevant
spec fields) and schedDpPostUpdate.Generation remain equal to the original
schedDp.Spec and schedDp.Generation for the duration; remove the ResourceVersion
equality assertion. Apply the same replacement for the similar check around
schedDpPostUpdate/schedDp at lines 405-409.
- Around line 485-493: The cleanup uses an uninitialized pointer `var node
*corev1.Node`, causing nil deref when calling `e2eclient.Client.Get` and
accessing `node.Spec.Unschedulable`; allocate a concrete Node before use (e.g.,
declare `node := &corev1.Node{}`) in the `DeferCleanup` closure, then call
`e2eclient.Client.Get(ctx, client.ObjectKey{Name: targetNodeName}, node)` and
update `node.Spec.Unschedulable = false` followed by
`e2eclient.Client.Update(ctx, node)` inside the `Eventually` block so
`needToUncordon`, `targetNodeName`, `DeferCleanup`, and the get/update calls
work safely.
- Around line 503-518: In the Eventually polling loop that calls
podlist.With(...).ByDeployment and assigns to recreatedPod, guard against nil
before dereferencing recreatedPod by moving or conditionalizing the klog.InfoS
call and any access to recreatedPod.Namespace/Name until after you assert
recreatedPod is not nil (or wrap the logging in an if recreatedPod != nil
block); ensure g.Expect(recreatedPod).ToNot(BeNil()) still runs in the loop so
the loop can retry when no pending pod is found and only after that check use
recreatedPod fields for logging or further assertions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: bb00a70c-f80a-4212-b730-569aa2e631d4
📒 Files selected for processing (5)
internal/controller/numaresourcesscheduler_controller.gointernal/controller/numaresourcesscheduler_controller_test.gopkg/objectupdate/sched/sched.gopkg/objectupdate/sched/sched_test.gotest/e2e/sched/sched_test.go
|
/cc |
9a45f1f to
aad497d
Compare
|
requires #4193 |
aad497d to
b3da392
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/objectupdate/sched/sched.go`:
- Around line 107-110: The code only rejects nil dp.Spec.Template.Labels but
allows an empty map which yields an empty MatchLabels selector; update the
validation to treat empty maps as invalid by checking both nil and len(labels)
== 0 (i.e., if labels == nil || len(labels) == 0) and return an error before
constructing the selector/MatchLabels so you never build a LabelSelector with an
empty MatchLabels.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 738f118f-0f41-492e-9e1e-9f31e89f752d
📒 Files selected for processing (5)
internal/controller/numaresourcesscheduler_controller.gointernal/controller/numaresourcesscheduler_controller_test.gopkg/objectupdate/sched/sched.gopkg/objectupdate/sched/sched_test.gotest/e2e/sched/sched_test.go
ffromani
left a comment
There was a problem hiding this comment.
Good work. Do we have a test which actually verify the happy path? 3 control-plane nodes, check we autodetect 3 replicas and each one lands on a different node.
we have that as a step in one of the tests: |
2207af4 to
c2e260d
Compare
|
/retest |
|
@shajmakh: This pull request references Jira Issue OCPBUGS-79418, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Tal-or
left a comment
There was a problem hiding this comment.
LGTM overall, left minor comment
ffromani
left a comment
There was a problem hiding this comment.
let's add one more e2e test (suggested inline) and we can merge
c2e260d to
098333c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/wait/deployment.go`:
- Around line 53-63: areDeploymentReplicasAvailable currently always returns
true, which makes IsDeploymentComplete succeed as soon as ObservedGeneration
advances; change areDeploymentReplicasAvailable to perform a real availability
check: compare newStatus.Replicas, newStatus.UpdatedReplicas and
newStatus.AvailableReplicas against the desired replicas value and return false
if any required condition is not met (log the mismatches with klog.InfoS as
already used), returning true only when the availability predicate is satisfied;
if strict equality is too strict for your rollout semantics, implement a clear
predicate (e.g., UpdatedReplicas >= replicas and AvailableReplicas >=
minDesired) instead of unconditional true so IsDeploymentComplete behaves
correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: a48fb4bf-6aa6-4125-8967-f83a3a69196d
📒 Files selected for processing (6)
internal/controller/numaresourcesscheduler_controller.gointernal/controller/numaresourcesscheduler_controller_test.gointernal/wait/deployment.gopkg/objectupdate/sched/sched.gopkg/objectupdate/sched/sched_test.gotest/e2e/sched/sched_test.go
🚧 Files skipped from review as they are similar to previous changes (5)
- internal/controller/numaresourcesscheduler_controller.go
- pkg/objectupdate/sched/sched.go
- test/e2e/sched/sched_test.go
- internal/controller/numaresourcesscheduler_controller_test.go
- pkg/objectupdate/sched/sched_test.go
PodAntiAffinity is a legacy way to handle HA scenarios and in project state it caused deadlocks and regressions. This reverts commit b4e7fd1. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
It's reasonable to have high-availability enabled to have scheduler pods number matches the control-plane nodes' number or workers on hypershift. but a number greater than that makes no point. This commit degrades the scheduler status in case a higher number of replicas is requested. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
cc90aeb to
01d534d
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/wait/deployment.go (1)
27-40:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThis waiter can still report success before a no-scale rollout is observed.
IsDeploymentComplete(dp.Generation, ...)only proves the status is not older than the pre-existing Deployment generation. When the rollout is triggered indirectly and keeps the same replica count, callers liketest/e2e/sched/sched_test.goLine 80 and Line 135 pass a Deployment fetched before the CR update, soObservedGeneration == dp.Generationis already true and this can return before the new image/spec is actually rolled out.That weakens the new “generation-aware” contract for exactly the same-replica updates this PR adds. The waiter needs a way to observe a newer Deployment generation for indirect rollouts, or a separate helper for “wait for reconcile after parent object change” instead of reusing the stale generation from the caller.
Also applies to: 66-72
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/wait/deployment.go` around lines 27 - 40, The waiter currently can return success using the caller's stale dp.Generation; change ForDeploymentCompleteWithReplicas (and the similar block at 66-72) so it will not treat the Deployment as complete until a newer generation has been observed: require that either updatedDp.Generation > dp.Generation or updatedDp.Status.ObservedGeneration > dp.Generation (i.e., the controller has observed a newer generation) before allowing IsDeploymentComplete to return success; in practice, check the updatedDp.Generation/updatedDp.Status.ObservedGeneration and only call/accept IsDeploymentComplete when the newer generation has been observed, otherwise keep polling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e/sched/sched_test.go`:
- Around line 653-660: The helper infoUponDeploymentCompleteFailure logs
deployment status but only iterates pods when
podlist.With(e2eclient.Client).ByDeployment(...) returns an error, which is
inverted; change the control flow so that after calling
podlist.With(...).ByDeployment(ctx, *schedDp) you iterate and log each pod
(using pods range and klog.InfoS with "podName"/"podStatus") when err == nil,
and handle the error case by logging the retrieval error (include err) instead;
update the branches around the call to ByDeployment to reflect this corrected
logic.
In `@test/e2e/serial/tests/scheduler_removal.go`:
- Around line 162-163: The wait call mistakenly calls Interval twice which
overwrites the first interval; update the chain that calls
wait.With(fxt.Client).Interval(2*time.Second).Interval(30*time.Second).ForDeploymentCompleteWithReplicas(...)
so the second Interval becomes a Timeout (e.g., Timeout(30*time.Second)) to
bound the overall wait while keeping a 2s poll cadence; target the wait.With /
Interval / Timeout / ForDeploymentCompleteWithReplicas chain in
scheduler_removal.go and ensure the poll interval remains 2s and the overall
timeout is 30s.
In `@test/e2e/serial/tests/workload_placement.go`:
- Line 286: After triggering a rollout and assigning updatedDp, change calls
that pass the original dp into the generation-aware waiter to instead pass the
updated deployment object returned from the update; specifically update the
three calls to
wait.With(fxt.Client).Interval(...).Timeout(...).ForDeploymentCompleteWithReplicas(...)
so they use updatedDp (the result of the update) rather than dp to ensure the
waiter observes the new generation when waiting for replicas, keeping references
to function ForDeploymentCompleteWithReplicas and variables dp/updatedDp to
locate the sites.
---
Outside diff comments:
In `@internal/wait/deployment.go`:
- Around line 27-40: The waiter currently can return success using the caller's
stale dp.Generation; change ForDeploymentCompleteWithReplicas (and the similar
block at 66-72) so it will not treat the Deployment as complete until a newer
generation has been observed: require that either updatedDp.Generation >
dp.Generation or updatedDp.Status.ObservedGeneration > dp.Generation (i.e., the
controller has observed a newer generation) before allowing IsDeploymentComplete
to return success; in practice, check the
updatedDp.Generation/updatedDp.Status.ObservedGeneration and only call/accept
IsDeploymentComplete when the newer generation has been observed, otherwise keep
polling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: 41902836-39ef-489b-a0b6-58348bb7bdfb
📒 Files selected for processing (14)
internal/controller/numaresourcesscheduler_controller.gointernal/controller/numaresourcesscheduler_controller_test.gointernal/wait/deployment.gopkg/objectupdate/sched/sched.gopkg/objectupdate/sched/sched_test.gotest/e2e/sched/sched_test.gotest/e2e/serial/tests/configuration.gotest/e2e/serial/tests/non_regression.gotest/e2e/serial/tests/resource_accounting.gotest/e2e/serial/tests/scheduler_removal.gotest/e2e/serial/tests/workload_overhead.gotest/e2e/serial/tests/workload_placement.gotest/e2e/serial/tests/workload_placement_nodelabel.gotest/e2e/serial/tests/workload_placement_tmpol.go
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/objectupdate/sched/sched_test.go
- pkg/objectupdate/sched/sched.go
- internal/controller/numaresourcesscheduler_controller.go
- internal/controller/numaresourcesscheduler_controller_test.go
b1ee4d6 to
ef1ff13
Compare
| func areDeploymentReplicasAvailable(newStatus *appsv1.DeploymentStatus, replicas int32) bool { | ||
| return newStatus.UpdatedReplicas == replicas && | ||
| newStatus.Replicas == replicas && | ||
| newStatus.AvailableReplicas == replicas | ||
| if newStatus.Replicas != replicas { | ||
| klog.InfoS("newStatus.Replicas", "expected", replicas, "found", newStatus.Replicas) | ||
| return false | ||
| } | ||
| if newStatus.UpdatedReplicas != replicas { | ||
| klog.InfoS("newStatus.UpdatedReplicas", "expected", replicas, "found", newStatus.UpdatedReplicas) | ||
| return false | ||
| } | ||
| if newStatus.AvailableReplicas != replicas { | ||
| klog.InfoS("newStatus.AvailableReplicas", "expected", replicas, "found", newStatus.AvailableReplicas) | ||
| return false | ||
| } | ||
| return true | ||
| } | ||
|
|
||
| func IsDeploymentComplete(dp *appsv1.Deployment, newStatus *appsv1.DeploymentStatus) bool { | ||
| return areDeploymentReplicasAvailable(newStatus, *(dp.Spec.Replicas)) && | ||
| newStatus.ObservedGeneration >= dp.Generation | ||
| if newStatus.ObservedGeneration < dp.Generation { | ||
| klog.InfoS("generation is older than expected to indicate the deployment is complete", "old", dp.Generation, "new", newStatus.ObservedGeneration) | ||
| return false | ||
| } | ||
|
|
||
| return areDeploymentReplicasAvailable(newStatus, *(dp.Spec.Replicas)) |
There was a problem hiding this comment.
This is a bit heavy handed. Currently this is called only in test code, but the moment we pull in production code we will have log spam. There are options, but all require more work to fix. Can we simplify this commit or drop it entirely?
There was a problem hiding this comment.
this was needed for debugging, it essentially describes why the functions fail which is why we have the last fixing commit so this is very helpful details for troubleshooting the tests.
If ever we decide to use that in production, we need to be aware of the old use which is initially meant for tests. we can anyway create a production version and improve it then as we see fit.
Thinking about the backport, I think it's better to keep the old function ForDeploymentComplete to reduce any tests conflicts which will then simplify the commit a bit in terms of the adaptations across the codebase. would that sound good enough?
There was a problem hiding this comment.
discouraging production use could be as simple as move this code in a subtree of test/ perhaps test/internal/wait/... but I won't do in this PR.
A way forward would be post a followup PR with first the trivial move and then adding the logs
There was a problem hiding this comment.
actually let's move forward and let's revisit later
| // not enough to guarantee that the deployment is ready with the NEW replica, thus need to cover that by | ||
| // additional checks as the context requires | ||
| // ForDeploymentCompleteWithReplicas waits for the deployment to be complete and have the specified number of replicas. | ||
| func (wt Waiter) ForDeploymentCompleteWithReplicas(ctx context.Context, dp *appsv1.Deployment, newExpectedReplicas int32) (*appsv1.Deployment, error) { |
There was a problem hiding this comment.
good addition. If most of the codebase actually uses the pattern
ForDeploymentCompleteWithReplicas(ctx, dp, *dp.Spec.Replicas)
and at glance it seems the case, it's worth keeping the old API ForDeploymentComplete
There was a problem hiding this comment.
I intentionally updated the codebase because ForDeploymentComplete without having the correct expected replicas count does not sound complete and can be confusing. Internally in the function, replicas check is a core one and in order to avoid using the incorrect function for deployment waiter I overrode the old API. This was actually an old bug as noted in the omitted comment; the new version is expected to handle all deployments now.
There was a problem hiding this comment.
ok, so let's just rename ForDeploymentCompleteWithReplicas to ForDeploymentComplete
There was a problem hiding this comment.
in the latest push I restored ForDeploymentComplete to keep the diff as less as possible to reduce conflicts while backporting. I can address changing the whole code base in a followup PR including #4177 (comment)
ffromani
left a comment
There was a problem hiding this comment.
/approve
/lgtm
we can defer the remaining comments to followups
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani, shajmakh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
ef1ff13 to
e3a5016
Compare
To evenly spread the pods across the nodes in a a balanced way taking into account the new replicaset pods while rollout. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
The check includes many verification steps that only when all passed the function pass. Sometimes the check fails without extra details of what exactly was the step that failed. This commit adds debugging logs to provide these details. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
So far the old function was waiting for the deployment to be ready with the same replicas count as the old one. It is however possible that the new deployment version includes a different replicas field hence should expect to a different replicas count. The old version would fail on such case where the old count vs the new one are not the same. This commit fixes this issue by introducing new function that accepts the expected new replicas count and compare against it. The old function API still exists but is redirected to use the new function with same old behavior. Existing tests are not affected by this. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
e3a5016 to
f30a924
Compare
|
Actionable comments posted: 0 |
|
@shajmakh: This pull request references Jira Issue OCPBUGS-79418, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/unhold |
|
/cherry-pick release-4.22,release-4.21, release-4.20 |
|
@shajmakh: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@shajmakh: Jira Issue OCPBUGS-79418: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-79418 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@shajmakh: cannot checkout DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cherry-pick release-4.22 release-4.21 release-4.20 |
|
@shajmakh: new pull request created: #4265 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
To evenly spread the pods across the nodes in a balanced way taking
into account the new replicaset pods while rollout.
previous discussions:
(reverted) #4021
(cancelled due to regressions) #4028