CNF-23239: controller: nrop: per-tree processing and progress#3998
CNF-23239: controller: nrop: per-tree processing and progress#3998ffromani wants to merge 12 commits into
Conversation
|
/hold |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ffromani The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
b9f2280 to
6aeebce
Compare
📝 WalkthroughWalkthroughThis PR refactors the NUMAResourcesOperator controller from all-trees-at-once reconciliation to per-tree iteration. It separates tree-agnostic object state (RBAC, metrics) from per-tree state derivation, adds Step state predicates and message updates, and replaces multi-phase reconciliation with a single orchestration that applies tree-agnostic resources, then reconciles each tree and aggregates results. ChangesPer-tree reconciliation refactor with tree-agnostic object state
🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
6aeebce to
7a66a48
Compare
Tal-or
left a comment
There was a problem hiding this comment.
I know it's still WIP but already adding my comments so they won't slip.
| result.mcpStatuses = syncMachineConfigPoolNodeGroupConfigStatuses(result.mcpStatuses, singleTreeSlice) | ||
|
|
||
| return intreconcile.StepSuccess() | ||
| result.step = intreconcile.StepSuccess() |
There was a problem hiding this comment.
if we have pausedMCPNames, we should not set this StepSuccess() but StepOngoing() so controller wont continue to update DS for this specific tree
There was a problem hiding this comment.
StepOngoing() wants to requeue the request though. I'm not 100% sure this is the right approach.
There was a problem hiding this comment.
StepSuccess continues to update the DS, and doing that while the MCP paused, might causes to misalignment with security context/ SELinux policy.
Requeue the request for later is the lesser evil option IMO.
Another option is not continue and not requeue, a new request will come in anyway once the MCP moved to paused: false because we're watching it.
Not sure if we have a step that does that though.
There was a problem hiding this comment.
we can probably use StepOngoing with 0 value; this should report the right condition without causing a requeue
|
@ffromani: This pull request references CNF-23239 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
7a66a48 to
bc3c6fe
Compare
bc3c6fe to
9628673
Compare
9628673 to
c508fc8
Compare
|
@ffromani: This pull request references CNF-23239 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/hold cancel |
Tal-or
left a comment
There was a problem hiding this comment.
Usually it's helpful to break the PR into smaller commits, but here (especially in the few last commits) some of the newer commits are overriding/updating the logic of their previous ones.
I know it was made to convey the steps clearly but it's bit confusing because now I see that I commented on lines of code that were deleted/moved in later commits.
Anyway lets address what we have right now and I'll have another round later on this week.
| } | ||
|
|
||
| func (em *ExistingManifests) MachineConfigsState(mf Manifests) ([]MachineConfigObjectState, sets.Set[string]) { | ||
| var allRet []MachineConfigObjectState |
There was a problem hiding this comment.
Since this area is going through a major refactoring anyway, it would be nice to make the names of variables and functions more accurate - it's the MachineConfigPool* that we're checking/updating and not MachineConfig*
as we all aware those are related though different objects.
There was a problem hiding this comment.
I agree about improving, thing is this function want to compute the existing state of MachineConfig objects, and compute the desired state of MachineConfig, which are indeed related to MachineConfigPools, but the pool is the scoping mechanism.
IOW, we want to reconcile MachineConfigs, and return ObjectState about these.
| daemonSets: make(map[string]daemonSetManifest), | ||
| machineConfigs: make(map[string]machineConfigManifest), |
There was a problem hiding this comment.
If that's PerTree how come more than a single DamonSet and/or MachineConfig?
There was a problem hiding this comment.
Mostly, because I'm reusing ExistingManifests which predates the split and had 1:N relationship because the horizontal partitioning.
We can probably simplify further as followup of this PR
| return nropv1.MachineConfigPool{Name: name} | ||
| } | ||
|
|
||
| type treeReconcileResult struct { |
There was a problem hiding this comment.
slice of DSs and MCPs implies that this is for all trees so should be treesReconcileResult in plural or returning a slice of treeReconcileResult one per tree
There was a problem hiding this comment.
this is meant to be a reconcile result of a single tree (therefore perTreeReconcileResult is probably better), but the problem here is we can't yet enforce the 1:1 mapping between trees and MCPs (https://github.com/openshift-kni/numaresources-operator/blob/main/internal/api/annotations/annotations.go#L25).
Well, there's an argument about doing that in 5.0 - we had 3 EUS version past 4.18 already, but then should be done as prerequisite of this PR, and would make backporting significantly more complex.
This is why I'm still carrying this forward and postponing to 5.2.
| mcResult := treeReconcileResult{} | ||
| if r.Platform == platform.OpenShift { | ||
| mcResult = r.reconcileResourceMachineConfig(ctx, instance, treeExisting, tree) | ||
| if mcResult.step.EarlyStop() /* || mcResult.pausedMCPNames.Len() > 0*/ { |
There was a problem hiding this comment.
no, this is is a gap on my side. Let me have another cleanup pass. Restoring WIP for the time being.
|
/retitle WIP: CNF-23239: controller: nrop: per-tree processing and progress thanks for the review. I'll need to do more cleanup |
|
@ffromani: This pull request references CNF-23239 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ffromani: This pull request references CNF-23239 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@ffromani: This pull request references CNF-23239 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a847972 to
ee816e3
Compare
|
@ffromani: This pull request references CNF-23239 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
ee816e3 to
da33cd8
Compare
|
@ffromani: This pull request references CNF-23239 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Exposing the name is useful for logging purposes; no logic is planned upon the name of the objects. Signed-off-by: Francesco Romani <fromani@redhat.com>
this allow to update an existing message, because, by design, WithMessage() want to avoid overwriting. Will be used to summarize the status in the upcoming "vertical" split of the NodeGroups reconciliation AI-attribution: AIA PAI CeNc Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
add helpers which we will use in the "vertical split" of nodegroup reconciliation. Naming explanation: the split is about how the reconciliation makes progress. In the model we had till now, retroactively called "horizontal split", each NodeGroup must make progress on each reconciliation stage before any of these can progress to the next stage. With the upcoming proposed approach, the "vertical split", each NodeGroup will advance independently on the reconciliation step, and only the global reported status is set pulled to the slowest state. AI-attribution: AIA PAI Ce Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
to reduce the diff noise. No changes in behavior, trivial code movement. AI-attribution: AIA PAI Ce Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
Add per-tree variants of the key ExistingManifests methods: - TreeAgnostic(mf): returns tree-independent object states - PerTreeState(mf, tree): returns object states for a single tree - FromClientTreeAgnostic(): creates ExistingManifests without tree data - PerTree(ctx, cli, tree): creates a per-tree ExistingManifests clone - MachineConfigsStateForTree(mf, tree): per-tree machine config state The existing State(), FromClient(), and MachineConfigsState() are kept as compatibility wrappers that delegate to the new per-tree functions, so the controller remains unchanged. AI-attribution: AIA PAI CeNc Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
trivial signature change to enable upcoming refactoring. No intended change in behavior. AI-attribution: AIA PAI Ce Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
Add the building blocks for per-tree reconciliation without changing existing code paths: - perTreeResult type to capture per-tree outcomes - reconcilePerTreeMachineConfig: per-tree machine config reconciliation - reconcilePerTreeDaemonSet: per-tree daemonset reconciliation - setupTreeAgnosticManifests: extracted tree-agnostic manifest setup - applyObjects: generic object state applier - collectDaemonSets, reduceTreeResults, shouldReplaceStep, treeSummaryMessage: aggregation helpers These functions are not yet wired into the reconcile loop; they will replace the existing all-trees-at-once functions in a follow-up. AI-attribution: AIA PAI CeNc Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
Rewrite reconcileResource to use the per-tree reconcile functions introduced in the previous commit instead of the old horizontal functions. Use FromClientTreeAgnostic for tree-agnostic resource setup and PerTree for per-tree processing. Move dangling resource cleanup into reconcileResource and use horizontal orchestration: all trees complete machine config step first, then all trees complete daemonset step. AI-attribution: AIA PAI Ce Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
Change syncMachineConfigPoolsStatuses and syncMachineConfigPoolNodeGroupConfigStatuses from variadic trees ...nodegroupv1.Tree to single tree nodegroupv1.Tree, removing the outer tree iteration loop from each. These functions are now only called from reconcilePerTreeMachineConfig which already passes a single tree. AI-attribution: AIA PAI Ce Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
…eConfig Rename syncMachineConfigs to syncMachineConfig and change its signature from variadic trees ...nodegroupv1.Tree to single tree nodegroupv1.Tree. Switch the internal call from MachineConfigsState to MachineConfigsStateForTree. Also narrow validateMachineConfigLabels from []nodegroupv1.Tree to single nodegroupv1.Tree, removing the outer tree loop. AI-attribution: AIA PAI Ce Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
Remove the horizontal compat wrappers (State, FromClient, MachineConfigsState without tree param) and the trees field from ExistingManifests, now that the controller uses the per-tree API directly. Rename MachineConfigsStateForTree to MachineConfigsState and update the controller call site and tests accordingly. AI-attribution: AIA PAI CeNc Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
Change reconcileResource from horizontal orchestration (all trees complete machine config step, then all trees complete daemonset step) to vertical processing (each tree progresses through all steps independently). This means a single slow nodegroup no longer blocks progress for the others. Move tree normalization from the pre-processing loop in Reconcile into the per-tree loop in reconcileResource, where it naturally belongs alongside the rest of per-tree processing. Add a test verifying that a tree whose MCP is ready creates its DaemonSet even while another tree's MCP is still pending. AI-attribution: AIA PAI CeNc Hin R claude-4.6-opus-1M v1.0 Signed-off-by: Francesco Romani <fromani@redhat.com>
da33cd8 to
ef94726
Compare
|
reviewable again |
|
@ffromani: This pull request references CNF-23239 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
internal/reconcile/step_test.go (1)
45-61: ⚡ Quick winAdd coverage for
StepOngoing(0)semanticsPlease add a regression test asserting that
StepOngoing(0)is still considered ongoing (and not failed). This is a critical path for paused MCP handling in the controller.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/reconcile/step_test.go` around lines 45 - 61, Add a regression test that calls StepOngoing(0) and asserts it is ongoing and not failed: create a new test (e.g., TestStepOngoingZeroIsOngoing) that constructs st := StepOngoing(0) and uses assert.True(t, st.Ongoing()) and assert.False(t, st.Failed()) to verify the semantics; place it alongside the existing TestStepOngoingIsOngoing/TestStepFailedIsFailed/TestStepSuccessIsNeitherOngoingNorFailed tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/controller/numaresourcesoperator_controller_test.go`:
- Around line 2285-2297: Add assertions that the pending tree's DaemonSet is
still absent and the global Available condition is false: attempt to Get the
DaemonSet for mcp1 (using objectnames.GetComponentName(nro.Name, mcp1.Name) and
testNamespace) and expect a NotFound error, then fetch nro and verify
getConditionByType(nro.Status.Conditions, status.ConditionAvailable) is not nil
and its Status equals metav1.ConditionFalse. This complements the existing
checks that pool2 DS exists and Progressing is true.
In `@internal/controller/numaresourcesoperator_controller.go`:
- Around line 822-824: The code currently only logs objState.Error and
continues; change this to fail fast: when objState.Error != nil, log the error
with context and immediately stop processing by returning or propagating the
error from the enclosing reconcile/processing function (do not continue to apply
manifests). Ensure you wrap or annotate objState.Error with a clear message
(e.g., "failed building manifest/state for <object>") before returning so
callers can handle requeueing or surface the failure.
In `@internal/reconcile/step.go`:
- Around line 46-49: The Ongoing() method currently returns true only when
Result.RequeueAfter > 0 which incorrectly treats StepOngoing(0) as not ongoing;
update Step.Ongoing() to treat zero as an ongoing/ progressing value (e.g.,
return rs.Result.RequeueAfter >= 0) or otherwise compare against the defined
sentinel for "no requeue" so that StepOngoing(0) is classified as ongoing;
change the condition in the Ongoing() implementation that references
Result.RequeueAfter so zero is considered progressing (and ensure this aligns
with the StepOngoing(0) semantics).
---
Nitpick comments:
In `@internal/reconcile/step_test.go`:
- Around line 45-61: Add a regression test that calls StepOngoing(0) and asserts
it is ongoing and not failed: create a new test (e.g.,
TestStepOngoingZeroIsOngoing) that constructs st := StepOngoing(0) and uses
assert.True(t, st.Ongoing()) and assert.False(t, st.Failed()) to verify the
semantics; place it alongside the existing
TestStepOngoingIsOngoing/TestStepFailedIsFailed/TestStepSuccessIsNeitherOngoingNorFailed
tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Enterprise
Run ID: cd1941fa-fd9e-4e86-911b-8ba816943886
📒 Files selected for processing (11)
api/v1/helper/nodegroup/nodegroup.goapi/v1/helper/nodegroup/nodegroup_test.gointernal/controller/numaresourcesoperator_controller.gointernal/controller/numaresourcesoperator_controller_test.gointernal/reconcile/step.gointernal/reconcile/step_test.gopkg/objectstate/rte/machineconfigpool.gopkg/objectstate/rte/machineconfigpool_test.gopkg/objectstate/rte/rte.gopkg/status/conditioninfo/conditioninfo.gopkg/status/conditioninfo/conditioninfo_test.go
| By("Verify pool2 DaemonSet exists even though pool1 is still pending") | ||
| ds := &appsv1.DaemonSet{} | ||
| mcp1DSKey := client.ObjectKey{ | ||
| Name: objectnames.GetComponentName(nro.Name, mcp1.Name), | ||
| mcp2DSKey := client.ObjectKey{ | ||
| Name: objectnames.GetComponentName(nro.Name, mcp2.Name), | ||
| Namespace: testNamespace, | ||
| } | ||
| Expect(reconciler.Client.Get(ctx, mcp1DSKey, ds)).To(Succeed()) | ||
| Expect(reconciler.Client.Get(ctx, mcp2DSKey, ds)).To(Succeed()) | ||
|
|
||
| mcp2DSKey := client.ObjectKey{ | ||
| Name: objectnames.GetComponentName(nro.Name, mcp2.Name), | ||
| By("Verify global status is Progressing, not Available") | ||
| Expect(reconciler.Client.Get(ctx, client.ObjectKeyFromObject(nro), nro)).To(Succeed()) | ||
| progressingCond := getConditionByType(nro.Status.Conditions, status.ConditionProgressing) | ||
| Expect(progressingCond).ToNot(BeNil()) | ||
| Expect(progressingCond.Status).To(Equal(metav1.ConditionTrue)) |
There was a problem hiding this comment.
Assert the pending tree is still blocked in the first reconcile
This test currently proves only the positive path (ready tree DS exists, Progressing=true). It does not assert that the pending tree DS is still absent and Available is false at that point, so a regression to eager DS creation could still pass.
Suggested test hardening
By("Verify pool2 DaemonSet exists even though pool1 is still pending")
ds := &appsv1.DaemonSet{}
mcp2DSKey := client.ObjectKey{
Name: objectnames.GetComponentName(nro.Name, mcp2.Name),
Namespace: testNamespace,
}
Expect(reconciler.Client.Get(ctx, mcp2DSKey, ds)).To(Succeed())
+
+ By("Verify pool1 DaemonSet is not created yet")
+ mcp1DSKeyPending := client.ObjectKey{
+ Name: objectnames.GetComponentName(nro.Name, mcp1.Name),
+ Namespace: testNamespace,
+ }
+ err = reconciler.Client.Get(ctx, mcp1DSKeyPending, &appsv1.DaemonSet{})
+ Expect(apierrors.IsNotFound(err)).To(BeTrue(), "pool1 daemonset should still be pending")
By("Verify global status is Progressing, not Available")
Expect(reconciler.Client.Get(ctx, client.ObjectKeyFromObject(nro), nro)).To(Succeed())
progressingCond := getConditionByType(nro.Status.Conditions, status.ConditionProgressing)
Expect(progressingCond).ToNot(BeNil())
Expect(progressingCond.Status).To(Equal(metav1.ConditionTrue))
+ availableCond := getConditionByType(nro.Status.Conditions, status.ConditionAvailable)
+ Expect(availableCond).ToNot(BeNil())
+ Expect(availableCond.Status).To(Equal(metav1.ConditionFalse))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| By("Verify pool2 DaemonSet exists even though pool1 is still pending") | |
| ds := &appsv1.DaemonSet{} | |
| mcp1DSKey := client.ObjectKey{ | |
| Name: objectnames.GetComponentName(nro.Name, mcp1.Name), | |
| mcp2DSKey := client.ObjectKey{ | |
| Name: objectnames.GetComponentName(nro.Name, mcp2.Name), | |
| Namespace: testNamespace, | |
| } | |
| Expect(reconciler.Client.Get(ctx, mcp1DSKey, ds)).To(Succeed()) | |
| Expect(reconciler.Client.Get(ctx, mcp2DSKey, ds)).To(Succeed()) | |
| mcp2DSKey := client.ObjectKey{ | |
| Name: objectnames.GetComponentName(nro.Name, mcp2.Name), | |
| By("Verify global status is Progressing, not Available") | |
| Expect(reconciler.Client.Get(ctx, client.ObjectKeyFromObject(nro), nro)).To(Succeed()) | |
| progressingCond := getConditionByType(nro.Status.Conditions, status.ConditionProgressing) | |
| Expect(progressingCond).ToNot(BeNil()) | |
| Expect(progressingCond.Status).To(Equal(metav1.ConditionTrue)) | |
| By("Verify pool2 DaemonSet exists even though pool1 is still pending") | |
| ds := &appsv1.DaemonSet{} | |
| mcp2DSKey := client.ObjectKey{ | |
| Name: objectnames.GetComponentName(nro.Name, mcp2.Name), | |
| Namespace: testNamespace, | |
| } | |
| Expect(reconciler.Client.Get(ctx, mcp2DSKey, ds)).To(Succeed()) | |
| By("Verify pool1 DaemonSet is not created yet") | |
| mcp1DSKeyPending := client.ObjectKey{ | |
| Name: objectnames.GetComponentName(nro.Name, mcp1.Name), | |
| Namespace: testNamespace, | |
| } | |
| err = reconciler.Client.Get(ctx, mcp1DSKeyPending, &appsv1.DaemonSet{}) | |
| Expect(apierrors.IsNotFound(err)).To(BeTrue(), "pool1 daemonset should still be pending") | |
| By("Verify global status is Progressing, not Available") | |
| Expect(reconciler.Client.Get(ctx, client.ObjectKeyFromObject(nro), nro)).To(Succeed()) | |
| progressingCond := getConditionByType(nro.Status.Conditions, status.ConditionProgressing) | |
| Expect(progressingCond).ToNot(BeNil()) | |
| Expect(progressingCond.Status).To(Equal(metav1.ConditionTrue)) | |
| availableCond := getConditionByType(nro.Status.Conditions, status.ConditionAvailable) | |
| Expect(availableCond).ToNot(BeNil()) | |
| Expect(availableCond.Status).To(Equal(metav1.ConditionFalse)) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/controller/numaresourcesoperator_controller_test.go` around lines
2285 - 2297, Add assertions that the pending tree's DaemonSet is still absent
and the global Available condition is false: attempt to Get the DaemonSet for
mcp1 (using objectnames.GetComponentName(nro.Name, mcp1.Name) and testNamespace)
and expect a NotFound error, then fetch nro and verify
getConditionByType(nro.Status.Conditions, status.ConditionAvailable) is not nil
and its Status equals metav1.ConditionFalse. This complements the existing
checks that pool2 DS exists and Progressing is true.
| if objState.Error != nil { | ||
| klog.Warningf("error loading object: %v", objState.Error) | ||
| } |
There was a problem hiding this comment.
Do not continue when objState.Error is set
At Line 822-824, logging and continuing can mask manifest/state construction failures and may cascade into invalid apply attempts. Fail fast here to avoid partial/inconsistent reconciliation.
Suggested fix
if objState.Error != nil {
- klog.Warningf("error loading object: %v", objState.Error)
+ return fmt.Errorf("failed to load object state: %w", objState.Error)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if objState.Error != nil { | |
| klog.Warningf("error loading object: %v", objState.Error) | |
| } | |
| if objState.Error != nil { | |
| return fmt.Errorf("failed to load object state: %w", objState.Error) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/controller/numaresourcesoperator_controller.go` around lines 822 -
824, The code currently only logs objState.Error and continues; change this to
fail fast: when objState.Error != nil, log the error with context and
immediately stop processing by returning or propagating the error from the
enclosing reconcile/processing function (do not continue to apply manifests).
Ensure you wrap or annotate objState.Error with a clear message (e.g., "failed
building manifest/state for <object>") before returning so callers can handle
requeueing or surface the failure.
| // Ongoing returns true if the reconciliation is still ongoing/progressing | ||
| func (rs Step) Ongoing() bool { | ||
| return rs.Result.RequeueAfter > 0 | ||
| } |
There was a problem hiding this comment.
Ongoing() misclassifies valid progressing steps
At Line 48, RequeueAfter > 0 treats StepOngoing(0) as not ongoing, even though it is explicitly a progressing step. This breaks per-tree aggregation semantics (progress counters/selection) for paused MCP flows.
Suggested fix
func (rs Step) Ongoing() bool {
- return rs.Result.RequeueAfter > 0
+ return rs.ConditionInfo.Type == status.ConditionProgressing && rs.Error == nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Ongoing returns true if the reconciliation is still ongoing/progressing | |
| func (rs Step) Ongoing() bool { | |
| return rs.Result.RequeueAfter > 0 | |
| } | |
| // Ongoing returns true if the reconciliation is still ongoing/progressing | |
| func (rs Step) Ongoing() bool { | |
| return rs.ConditionInfo.Type == status.ConditionProgressing && rs.Error == nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/reconcile/step.go` around lines 46 - 49, The Ongoing() method
currently returns true only when Result.RequeueAfter > 0 which incorrectly
treats StepOngoing(0) as not ongoing; update Step.Ongoing() to treat zero as an
ongoing/ progressing value (e.g., return rs.Result.RequeueAfter >= 0) or
otherwise compare against the defined sentinel for "no requeue" so that
StepOngoing(0) is classified as ongoing; change the condition in the Ongoing()
implementation that references Result.RequeueAfter so zero is considered
progressing (and ensure this aligns with the StepOngoing(0) semantics).
ffromani
left a comment
There was a problem hiding this comment.
I acknowledge the review difficulties. Not sure how to help further unless we want to completely pivot direction and make a clean-as-possible main-only, minimal PR. That would make for a trivial backport down to 4.21 or so. Past that, it will both become hard and touch branches deeper and deeper in stable state. Which alternatives do we have?
| } | ||
|
|
||
| func (em *ExistingManifests) MachineConfigsState(mf Manifests) ([]MachineConfigObjectState, sets.Set[string]) { | ||
| var allRet []MachineConfigObjectState |
There was a problem hiding this comment.
I agree about improving, thing is this function want to compute the existing state of MachineConfig objects, and compute the desired state of MachineConfig, which are indeed related to MachineConfigPools, but the pool is the scoping mechanism.
IOW, we want to reconcile MachineConfigs, and return ObjectState about these.
| daemonSets: make(map[string]daemonSetManifest), | ||
| machineConfigs: make(map[string]machineConfigManifest), |
There was a problem hiding this comment.
Mostly, because I'm reusing ExistingManifests which predates the split and had 1:N relationship because the horizontal partitioning.
We can probably simplify further as followup of this PR
| return nropv1.MachineConfigPool{Name: name} | ||
| } | ||
|
|
||
| type treeReconcileResult struct { |
There was a problem hiding this comment.
this is meant to be a reconcile result of a single tree (therefore perTreeReconcileResult is probably better), but the problem here is we can't yet enforce the 1:1 mapping between trees and MCPs (https://github.com/openshift-kni/numaresources-operator/blob/main/internal/api/annotations/annotations.go#L25).
Well, there's an argument about doing that in 5.0 - we had 3 EUS version past 4.18 already, but then should be done as prerequisite of this PR, and would make backporting significantly more complex.
This is why I'm still carrying this forward and postponing to 5.2.
|
/test ci-e2e-install-hypershift |
implement "vertical split" vs current "horizontal split"
the split is about how the reconciliation makes progress. In the model we had till now, retroactively called "horizontal split", each NodeGroup must make progress on each reconciliation stage before any of these can progress to the next stage. IOW, the slowest NodeGroup blocks all the other NodeGroups at the stage it is stuck at.
With the approach proposed here, "vertical split", each NodeGroup advances independently on the reconciliation step, and only the global reported status is set pulled to the slowest state; IOW, with N nodegroups, N-1 can be Available and settled and 1 still Progressing; the global state will still be Progressing, but changes to the other NodeGroups can happen independently.
To simplify the implementation, a single goroutine handles all the NodeGroup reconciliation, so factually a NodeGroup can still influence the others, but this (bar more serious bugs we should fix anyway) can only create reconciliation delays, a far better stance than the issues caused by the horizontal split.
backportability
This PR is broken down in commits which are intentionally shaped to make the backport down to the distant 4.18 as low risk as possible.
Compatibility layers are added and then removed in the context of the same PR explicitly and only in order
to enable gradual testing, not because their own merit.
review notice
Adding and removing code in the same PR makes the review harder, even though the final tree state is identical.
It may be useful to review the final tree state (e.g. git diff ..HEAD) in addition to individual commits, since the end result is identical to a cleaner, main-only series.
The most independent commits are intentionally pushed towards the beginning of the series to enable this review mode as much as possible.
LLM-assisted backport effort summary
rte.godiverges more (after PR #3801 backport lands)MachineConfigsState,machineconfigpool.goheavily diverged (see below)4.18 effort breakdown
api/v1/helper/→api/numaresourcesoperator/v1/helper/step.goidentical;conditioninfo.gonearly identicalinternal/controller/→controllers/machineconfigpool.gois 134 lines (vs 244 on main), noMachineConfigObjectStatetype, noMachineConfigsStatemethod;rte.gois 436 lines (vs 379) with 288 diff linesmachineconfigpool.gobaseline4.21/4.20/4.19 prerequisite
These branches require PR #3801 ("OCPBUGS-84226: per-pool MachineConfig state
with paused MCP awareness") to be backported first. PR #4025 covers 4.21;
4.20 and 4.19 need equivalent backports. Once that prerequisite lands,
these branches have the same
MachineConfigObjectStatetype andMachineConfigsStatesignature as main, making the cherry-picksstraightforward.
LLM-assisted risk assessment
The proposed split should significantly reduce backport risk compared to
a simpler, monolithic approach:
cherry-picks on all branches. Done and verified quickly.
the objectstate layer but adds compat wrappers so the controller continues
working unchanged. Existing controller tests validate the compat wrappers
without any controller changes.
it is correct.
By this point all building blocks (1-7) are proven and verified.
narrowing, safe after commit 8 removed the old callers.
switching from horizontal to vertical orchestration.
The compat wrappers in commit 5 are the key risk reduction: they allow
validating the rte.go refactoring on any branch (including 4.18, the
hardest target) without touching the controller at all. Once that is
proven, the controller changes build on a solid foundation.