WIP: OCPBUGS-61756: ctrl: kubeletConfig: avoid updating ConfigMap on Paused MCP#2261
WIP: OCPBUGS-61756: ctrl: kubeletConfig: avoid updating ConfigMap on Paused MCP#2261shajmakh wants to merge 5 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shajmakh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
92eaeed to
80910ba
Compare
80910ba to
29846b9
Compare
|
/hold |
Tal-or
left a comment
There was a problem hiding this comment.
Good change overall.
In case the MCP is paused, the final update to the configmap will be triggered when the MCP changes back to unpause?
29846b9 to
1e86d9a
Compare
|
/retest |
|
relies on #2324 |
|
depends on #2426 |
b413287 to
e133797
Compare
|
@shajmakh: This pull request references Jira Issue OCPBUGS-61756, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
|
||
| func (em *ExistingManifests) MachineConfigsState(mf Manifests) ([]objectstate.ObjectState, MCPWaitForUpdatedFunc) { | ||
| func (em *ExistingManifests) MachineConfigsState(mf Manifests) ([]objectstate.ObjectState, MCPWaitForUpdatedFunc, sets.Set[string]) { | ||
| pausedMCPs := sets.New[string]() |
There was a problem hiding this comment.
we may want to call this "mcpsToSkip" to include cases where the MCP is empty (solves https://issues.redhat.com/browse/OCPBUGS-52859)
There was a problem hiding this comment.
tested locally and it made no difference to include empty MCPs, besides the bug no longer reproduces.
I don't tend to include the empty MCPs in this patch given this data, but if this bug is to hit again, the fix should be straightforward to include the zero-machine-count MCPs as followes:
// those are either paused MCPs or empty MCPs (no machine count)
ignoredMCPs := sets.New[string]()
var ret []objectstate.ObjectState
if mf.Core.MachineConfig == nil {
return ret, nullMachineConfigPoolUpdated, ignoredMCPs
}
enabledMCCount := 0
for _, tree := range em.trees {
for _, mcp := range tree.MachineConfigPools {
if mcp.Spec.Paused || mcp.Status.MachineCount == 0 {
klog.Warningf("the machine config pool %q is paused or have zero machine count", mcp.Name)
ignoredMCPs.Insert(mcp.Name)
continue
}
e133797 to
1b503cf
Compare
|
/retest |
e25bcee to
f5ad21e
Compare
|
/unhold |
f5ad21e to
5e477f1
Compare
c129242 to
22533cb
Compare
|
/retest |
This is a preliminary step to expand the scenarios where the reconciliation can return more possible errors where some are tolerable. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Having a paused MCP should prevent updating the corresponding config map for the specified node group. So far, the code wasn't considering the case of paused MCPs, which lead to creating/updating the config map to the newest kubeletconfig CR updates,a thing that caused a mismatch between the configuration in the config map vs the one reflected on the NRTs. In this commit, we modify the kubeletconfig controller to handle paused MCPs such that it skips updating existing RTE config maps; and for new node groups whose MCP is paused, the controller will fetch the old machineConfig (before the pause) and creates RTE config map based on the decoded kubeletconfig data from it. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
We want to know when a paused MCP goes unpaused and do the needed updates to the config map when that happens. In previous commit, when a paused MCP was detected, the reconcile would requeue after some time; that was intended to catch the unpause operation eventually. In this commit, we don't requeue on detected paused MCP but rather watch for pause field updates on targeted MCPs. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
The situtaion is mainly mitigated in the kubeletconfig controller, however we have a bug in the NRO controller such that it puts the NRO CR in progressing state because paused MCP is not in updated condition. This commit ignores checking whether a paused MCP is up-to-date, and introduces a new status condition to report paused MCPs if exist. Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Node group can be set with either MCPselector or a poolName. Validation of the nodegroup configuraion is checked in the reconcile logic, and because poolName is set and MCPSelector is not a valid nodeGroup we need to consider this and reconcile. This was missed when nodeGroup.PoolName was first introduced and didn't cause any observed reconcile failures because: 1. any change on the nodeGroups would anyway trigger reconciliation 2. so far the MCP related updates were caused by Kubeletconfig updates which wasn't blocked until the paused-MCP case got handled (which is why this appeared now). Signed-off-by: Shereen Haj <shajmakh@redhat.com>
|
/retest |
1 similar comment
|
/retest |
|
@shajmakh: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
I apologize for missing this question; yes that is the correct behavior, and is covered in the controller tests. |
ffromani
left a comment
There was a problem hiding this comment.
the direction seems fine, I have implementation comments.
The last commit would need another review round from me, need more time to grasp it.
| setCtrlRef func(owner, controlled metav1.Object, scheme *runtime.Scheme, opts ...controllerutil.OwnerReferenceOption) error | ||
| } | ||
|
|
||
| type reconcileErrorHandler struct { |
There was a problem hiding this comment.
this is not a handler. a Handler manages something. This seems more a reconcileResult
There was a problem hiding this comment.
can we reuse a reconcile.Step, extend it, or even create a variant of?
| } | ||
|
|
||
| if mcp.Spec.Paused { | ||
| klog.InfoS("detected paused MCP", "name", mcp.Name) |
There was a problem hiding this comment.
this will be printed at every reconciliation step it seems. Let's make it V(4)
| } | ||
| } | ||
|
|
||
| // use local version of github.com/openshift/machine-config-operator/pkg/controller/common.ParseAndConvertConfig |
There was a problem hiding this comment.
this is just narrating in english what the code does. Either give context about why or remove the comment
| }, reconcileErrorHandler{result: ctrl.Result{Requeue: true, RequeueAfter: MachineConfigPoolPausedRetryPeriod}} | ||
| } | ||
|
|
||
| klog.InfoS("MachineConfigPool of KubeletConfig %s is paused and configMap %s exists", kcKey.Name, existingCM.Name) |
There was a problem hiding this comment.
please check verbosity to avoid log spam
| Namespace: testNamespace, | ||
| Name: objectnames.GetComponentName(nro.Name, poolName2), | ||
| } | ||
| Expect(reconciler.Client.Get(context.TODO(), key, cm)).To(HaveOccurred()) |
| fakeRecorder, ok := reconciler.Recorder.(*record.FakeRecorder) | ||
| Expect(ok).To(BeTrue()) | ||
| event := <-fakeRecorder.Events | ||
| Expect(event).To(ContainSubstring("ProcessOK")) | ||
| Expect(event).To(ContainSubstring("Updated RTE config")) |
| fakeRecorder, ok := reconciler.Recorder.(*record.FakeRecorder) | ||
| Expect(ok).To(BeTrue()) | ||
| event := <-fakeRecorder.Events | ||
| Expect(event).To(ContainSubstring("ProcessOK")) | ||
| Expect(event).To(ContainSubstring(mcoKCPaused.Name)) |
| return requests | ||
| } | ||
|
|
||
| func (r *KubeletConfigReconciler) nodeGroupToMachineConfigPool(ctx context.Context, object client.Object) []reconcile.Request { |
|
|
||
| ngstatus := nro.Status.NodeGroups | ||
| targetMCPs := sets.New[string]() | ||
| for _, ngstatus := range ngstatus { |
There was a problem hiding this comment.
please don't use a loop variable named like the collection it iterates on
There was a problem hiding this comment.
that I missed. Thanks for catching.
| klog.ErrorS(err, "failed to list KubeletConfigs %v") | ||
| } | ||
| //map mcp to kubeletconfig requests | ||
| for _, mcpName := range targetMCPs.UnsortedList() { |
There was a problem hiding this comment.
can't we just do
for _, ngstatus := range nro.Status.NodeGroups {
mcpObj, ok := mcpMap[ngstatus.PoolName]
if !ok {
continue
}
mcpLabels := labels.Set(mcpObj.Labels)
...
Tal-or
left a comment
There was a problem hiding this comment.
Some comments inside.
One thing that bother me the most is the tolerateError concept.
if it's tolerable, I wouldn't consider it an error.
can't we simply log the error in place, and Requeue the request as we usually do?
|
|
||
| r.Recorder.Event(instance, "Normal", "ProcessOK", fmt.Sprintf("Updated RTE config %s/%s from kubelet config %s", cm.Namespace, cm.Name, req.NamespacedName.String())) | ||
| return ctrl.Result{}, nil | ||
| //return ctrl.Result{}, nil |
| if err != nil { | ||
| return nil, reconcileErrorHandler{err: err} | ||
| } | ||
| return cm, errHandler // FIXME use predicate |
There was a problem hiding this comment.
Is there a reason why not to implement predicate in this PR already? it's not suppose to be lot of code, so I guess we can squeeze it here as another small commit.
| if !apierrors.IsNotFound(err) { | ||
| return nil, reconcileErrorHandler{ | ||
| err: err, | ||
| tolerateError: true, |
There was a problem hiding this comment.
If we need to support the operation even when MCP are paused, why these errors are tolerate?
Probably I picked the wrong name for the struct and the fields inside. Introducing the new struct was meant to hold the reconciliation result and interpret it as an event which can be successful complete reconciliation, failure, or skipped reconciliation due to an unblocking error. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Having a paused MCP should prevent updating/creating the corresponding
config map for the specified node group. So far, the code wasn't
considering the case of paused MCPs, which leads to creating/updating the config map
a thing that caused a mismatch between the configuration in the config
map and the one reflected on the NRTs.
In this commit, we modify the kubeletconfig controller to skip on
updating RTE config maps that belong to paused MCPs.