Skip to content

[manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness#4025

Open
Tal-or wants to merge 5 commits into
openshift-kni:release-4.21from
Tal-or:4.21_cherry_pick_manual_OCPBUGS-84690
Open

[manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness#4025
Tal-or wants to merge 5 commits into
openshift-kni:release-4.21from
Tal-or:4.21_cherry_pick_manual_OCPBUGS-84690

Conversation

@Tal-or
Copy link
Copy Markdown
Collaborator

@Tal-or Tal-or commented May 14, 2026

manual cherry-pick of #3843

Tal-or and others added 4 commits May 14, 2026 16:42
Previously, MachineConfigsState returned a single wait function for all
pools - either IsMachineConfigPoolUpdated or
IsMachineConfigPoolUpdatedAfterDeletion - chosen globally based on
whether any pool had custom SELinux policy enabled. This broke mixed
configurations where some pools use custom policy and others use the
built-in default.

Each MachineConfigObjectState now carries its own WaitForUpdated
function and pool name. The controller builds a per-pool wait map so
each pool is checked with the correct wait logic independently.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Co-Authored-By: Francesco Romani <fromani@redhat.com>
AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
When an MCP has spec.paused=true, MCO will not apply pending
MachineConfig changes to its nodes. This leaves the MCP in an
UPDATING=true state indefinitely. The controller expects
UPDATING=false and UPDATED=true before proceeding, so it keeps
requeueing — leaving RTE DaemonSets in a half-baked state where
NROP never finishes configuring them.

This is especially critical during 4.16 → 4.18 upgrades: the
operator deletes the MachineConfig that provided the old SELinux
policy, which triggers MCO to roll out the change. On a paused
MCP that rollout never starts, so UPDATED stays false and the
controller requeues forever.

Skip paused MCPs so the operator can converge for all non-paused
pools and surface the paused pool names for status reporting.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Co-Authored-By: Francesco Romani <fromani@redhat.com>
When an MCP is paused, MCO will not apply the custom SELinux policy
MachineConfig to its nodes. Without this policy, RTE pods get stuck
forever trying to connect to kubelet's podresources socket — blocked
by SELinux AVC denials (container_device_plugin_t denied write on
container_var_lib_t sock_file). The operator keeps reporting
Progressing/DaemonSetIsUpdating with no indication of root cause.

Surface paused MCP state as a dedicated operator condition so users
can identify the problem directly from the CR status. Backfill the
condition on upgrade from older versions that lack it.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Co-Authored-By: Shereen Haj <shajmakh@redhat.com>
AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Add unit tests for MachineConfigsState covering custom/default/mixed
policies, paused pools, and edge cases. Add e2e test for mixed SELinux
policy across node groups.

Co-Authored-By: Francesco Romani <fromani@redhat.com>
AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 14, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@Tal-or: This pull request references Jira Issue OCPBUGS-85647, which is invalid:

  • expected dependent Jira Issue OCPBUGS-84690 to be in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but it is ON_QA instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

manual cherry-pick of #3843

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 3f95deb2-5dc4-43f4-bcc7-d8712321f614

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from mrniranjan and shajmakh May 14, 2026 14:21
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 14, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 14, 2026
@ffromani
Copy link
Copy Markdown
Member

When the PR becomes final and ready for review, please make sure to remove Co-authoured-by tags and use AI-attribution tags instead (https://aiattribution.github.io/create-attribution)

@Tal-or Tal-or changed the title WIP: [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness [manual] [release-4.21] OCPBUGS-85647: objectstate/rte: per-pool MachineConfig state with paused MCP awareness May 18, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 18, 2026
- Replace DefaultBaseConditions with newBaseConditions (unexported in 4.21)
- Replace metahelper.FindStatusCondition with FindCondition (4.21 local helper)
- Add isNROOperSyncedAt helper using 4.21 status.FindCondition
- Fix errors import shadowing (alias k8s apierrors, keep stdlib errors)

AIA: Primarily AI, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or Tal-or force-pushed the 4.21_cherry_pick_manual_OCPBUGS-84690 branch from f236a87 to d50b242 Compare May 18, 2026 06:43
@Tal-or
Copy link
Copy Markdown
Collaborator Author

Tal-or commented May 18, 2026

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants