[manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness#4025
Conversation
Previously, MachineConfigsState returned a single wait function for all pools - either IsMachineConfigPoolUpdated or IsMachineConfigPoolUpdatedAfterDeletion - chosen globally based on whether any pool had custom SELinux policy enabled. This broke mixed configurations where some pools use custom policy and others use the built-in default. Each MachineConfigObjectState now carries its own WaitForUpdated function and pool name. The controller builds a per-pool wait map so each pool is checked with the correct wait logic independently. Signed-off-by: Talor Itzhak <titzhak@redhat.com> Co-Authored-By: Francesco Romani <fromani@redhat.com> AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
When an MCP has spec.paused=true, MCO will not apply pending MachineConfig changes to its nodes. This leaves the MCP in an UPDATING=true state indefinitely. The controller expects UPDATING=false and UPDATED=true before proceeding, so it keeps requeueing — leaving RTE DaemonSets in a half-baked state where NROP never finishes configuring them. This is especially critical during 4.16 → 4.18 upgrades: the operator deletes the MachineConfig that provided the old SELinux policy, which triggers MCO to roll out the change. On a paused MCP that rollout never starts, so UPDATED stays false and the controller requeues forever. Skip paused MCPs so the operator can converge for all non-paused pools and surface the paused pool names for status reporting. Signed-off-by: Talor Itzhak <titzhak@redhat.com> Co-Authored-By: Francesco Romani <fromani@redhat.com>
When an MCP is paused, MCO will not apply the custom SELinux policy MachineConfig to its nodes. Without this policy, RTE pods get stuck forever trying to connect to kubelet's podresources socket — blocked by SELinux AVC denials (container_device_plugin_t denied write on container_var_lib_t sock_file). The operator keeps reporting Progressing/DaemonSetIsUpdating with no indication of root cause. Surface paused MCP state as a dedicated operator condition so users can identify the problem directly from the CR status. Backfill the condition on upgrade from older versions that lack it. Signed-off-by: Talor Itzhak <titzhak@redhat.com> Co-Authored-By: Shereen Haj <shajmakh@redhat.com> AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Add unit tests for MachineConfigsState covering custom/default/mixed policies, paused pools, and edge cases. Add e2e test for mixed SELinux policy across node groups. Co-Authored-By: Francesco Romani <fromani@redhat.com> AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0 Signed-off-by: Talor Itzhak <titzhak@redhat.com>
|
@Tal-or: This pull request references Jira Issue OCPBUGS-85647, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Tal-or The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
When the PR becomes final and ready for review, please make sure to remove |
- Replace DefaultBaseConditions with newBaseConditions (unexported in 4.21) - Replace metahelper.FindStatusCondition with FindCondition (4.21 local helper) - Add isNROOperSyncedAt helper using 4.21 status.FindCondition - Fix errors import shadowing (alias k8s apierrors, keep stdlib errors) AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0 Signed-off-by: Talor Itzhak <titzhak@redhat.com>
f236a87 to
d50b242
Compare
|
/retest |
manual cherry-pick of #3843