Skip to content

CNF-23818: machineconfigpools: avoid RTE pods duplication when single node matching multiple MCPs#3995

Open
Tal-or wants to merge 6 commits into
openshift-kni:mainfrom
Tal-or:primary_pool_selection
Open

CNF-23818: machineconfigpools: avoid RTE pods duplication when single node matching multiple MCPs#3995
Tal-or wants to merge 6 commits into
openshift-kni:mainfrom
Tal-or:primary_pool_selection

Conversation

@Tal-or
Copy link
Copy Markdown
Collaborator

@Tal-or Tal-or commented May 12, 2026

Problem Statement
When a node's labels match multiple MachineConfigPools (e.g. a custom pool that extends the worker role), the current DaemonSet's NodeSelector — which copies the MCP's own nodeSelector — causes multiple RTE DaemonSets to target the same node, resulting in duplicate pods.

Root Cause
The MCP nodeSelectors are inherently overlapping: for example, a node with both node-role.kubernetes.io/worker and node-role.kubernetes.io/workerB labels matches the worker MCP and the workerB MCP simultaneously. Using those selectors directly as DaemonSet's NodeSelectors guarantees
collisions.

Suggested Approach
Place a single, NRO-owned label on each node that encodes its primary pool. The DaemonSet's NodeSelector then targets this label instead of the MCP's nodeSelector, making placement 1:1 by construction.

We reuse MCO's own priority logic (master > custom > worker) to determine each node's primary pool, so the behavior is consistent with how MCO itself decides under what MCP the node belongs.

NOTE:
This is not relevant for HyperShift since on HyperShift we're setting the DaemonSet's NodeSelector as the NodePool name which are exclusive for each node, meaning a node can't have more that a single NodePool label.

AIA Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak titzhak@redhat.com

@openshift-ci openshift-ci Bot requested review from shajmakh and swatisehgal May 12, 2026 07:50
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Tal-or

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces a primary pool labeling system for nodes and integrates it into the NUMA resources operator controller. Nodes receive a label identifying their primary MachineConfigPool using MCO priority rules, DaemonSet selectors migrate to use that label, and the controller watches nodes to maintain labels during reconciliation while cleaning them up on CR deletion.

Changes

Node Primary Pool Labeling System

Layer / File(s) Summary
Label key definitions
internal/api/labels/labels.go
Exports constants NodePrimaryPool and NodeFinalizer used throughout the labeling and finalizer system.
MachineConfigPool classification library
pkg/machineconfigpools/pools.go, pkg/machineconfigpools/pools_test.go
New package exports functions to classify nodes against MachineConfigPools using label selectors and MCO priority rules (master > custom > worker). Tests cover pool matching, priority precedence, multiple custom pool rejection, and edge cases.
Node labeling implementation
internal/nodes/labels.go, internal/nodes/labels_test.go
Implements LabelForTrees to label nodes based on their primary MachineConfigPool and remove stale labels, plus RemoveAllLabels for finalizer cleanup. Tests verify primary pool selection, stale label removal, unmanaged pool filtering, platform-specific behavior (HyperShift no-op), and idempotence.
DaemonSet pod selector migration
pkg/objectstate/rte/machineconfigpool.go
DaemonSet pod nodeSelector now unconditionally uses the primary pool label instead of copying MCP node selector, removing dependency on MCP having a valid selector.
Controller node watching and reconciliation
internal/controller/numaresourcesoperator_controller.go, config/rbac/role.yaml
Controller watches Node resources and triggers reconciliation on node label changes. During reconciliation, nodes are labeled via LabelForTrees; during CR deletion, the finalizer handler calls RemoveAllLabels. RBAC updated to grant node get/list/watch/patch permissions.
E2E assertions
test/e2e/install/install_test.go
Adds OpenShift-only e2e checks validating the DaemonSet NodeSelector uses the primary-pool label, labeled nodes exist and have non-empty values, and labels are removed after operator deletion; includes helper to list nodes by the label.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main objective: avoiding RTE pod duplication when nodes match multiple MCPs by using primary pool labeling.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly relates to the changeset, explaining the problem of duplicate RTE pods when nodes match multiple MachineConfigPools and detailing the proposed solution of adding NRO-owned labels to encode primary pool membership.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/controller/numaresourcesoperator_controller_test.go (1)

77-97: ⚡ Quick win

Consider enhancing the fake-node helper to handle NodeSelector.MatchExpressions as a defensive measure.

The helper currently only copies MatchLabels from MCPs. While all current test fixtures use only MatchLabels, the function would miss MatchExpressions if they're added in the future, potentially making tests brittle. Adding support for both selector types would make the helper more resilient to changes in MCP definitions.

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/numaresourcesoperator_controller.go`:
- Around line 867-883: In filterTreesForPrimaryPools, avoid calling
mcpools.HasNodesForPool repeatedly for the same MachineConfigPool across trees:
precompute a map[string]bool (and map[string]error if you want to preserve error
logging) by iterating all unique MCPs from the input (use mcp.Name as the key)
and calling HasNodesForPool once per unique pool, then in the existing loop over
tree.MachineConfigPools consult that map to decide whether to append or log;
keep the current error logging behavior (klog.V(...).InfoS) when the precomputed
call returned an error and reuse the stored bool result for has/no-has decisions
instead of re-invoking HasNodesForPool.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 55d876f5-586b-48be-b43d-bc26fe315d2b

📥 Commits

Reviewing files that changed from the base of the PR and between bd67f71 and 9b1247f.

📒 Files selected for processing (4)
  • internal/controller/numaresourcesoperator_controller.go
  • internal/controller/numaresourcesoperator_controller_test.go
  • pkg/machineconfigpools/pools.go
  • pkg/machineconfigpools/pools_test.go

Comment thread internal/controller/numaresourcesoperator_controller.go Outdated
Copy link
Copy Markdown
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skimmed over it, it seems a good idea worth looking into

Comment thread pkg/machineconfigpools/pools.go
Comment thread internal/controller/numaresourcesoperator_controller_test.go Outdated
@Tal-or Tal-or changed the title machineconfigpools: avoid RTE pods duplication when single node matching multiple MCPs CNF-23818: machineconfigpools: avoid RTE pods duplication when single node matching multiple MCPs May 14, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 14, 2026

@Tal-or: This pull request references CNF-23818 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the ticket to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Add a new package that replicates the Machine Config Operator's logic for
determining the primary MachineConfigPool for a given node. MCO uses a
priority-based algorithm (master > custom > worker) to assign exactly one
primary pool per node, even when a node's labels match multiple MCP selectors.

This is important to avoid RTE pods duplication when the node has labels that are matching multiple MCPs.

AIA Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak titzhak@redhat.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Tal-or
Copy link
Copy Markdown
Collaborator Author

Tal-or commented May 19, 2026

/retest

@Tal-or
Copy link
Copy Markdown
Collaborator Author

Tal-or commented May 28, 2026

/hold

This requires more thinking, the existing approach is not good enough

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 28, 2026
@Tal-or Tal-or force-pushed the primary_pool_selection branch from 9b1247f to 64907fe Compare May 31, 2026 16:22
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 31, 2026

@Tal-or: This pull request references CNF-23818 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the ticket to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Problem Statement
When a node's labels match multiple MachineConfigPools (e.g. a custom pool that extends the worker role), the current DaemonSet's NodeSelector — which copies the MCP's own nodeSelector — causes multiple RTE DaemonSets to target the same node, resulting in duplicate pods.

Root Cause
The MCP nodeSelectors are inherently overlapping: for example, a node with both node-role.kubernetes.io/worker and node-role.kubernetes.io/workerB labels matches the worker MCP and the workerB MCP simultaneously. Using those selectors directly as DaemonSet's NodeSelectors guarantees
collisions.

Suggested Approach
Place a single, NRO-owned label on each node that encodes its primary pool. The DaemonSet's NodeSelector then targets this label instead of the MCP's nodeSelector, making placement 1:1 by construction.

We reuse MCO's own priority logic (master > custom > worker) to determine each node's primary pool, so the behavior is consistent with how MCO itself decides under what MCP the node belongs.

AIA Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak titzhak@redhat.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Introduce internal/api/labels with the NRO-owned node label
(numa-operator.openshift.io/primary-pool) and finalizer constants.

Add pkg/machineconfigpools with MCO-compatible pool-selection logic
(ListPools, GetPoolsForNode, GetPrimaryPoolForNode, GetNodesForPool,
HasNodesForPool) that determines each node's primary MCP following
MCO's master > custom > worker priority.

These building blocks are used by the node-labeling logic (next commit)
to ensure each node gets exactly one DaemonSet.

AIA: Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or Tal-or force-pushed the primary_pool_selection branch 3 times, most recently from ea23eca to e608a4f Compare May 31, 2026 17:21
@Tal-or
Copy link
Copy Markdown
Collaborator Author

Tal-or commented May 31, 2026

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 31, 2026

@Tal-or: This pull request references CNF-23818 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 31, 2026

@Tal-or: This pull request references CNF-23818 which is a valid jira issue.

Details

In response to this:

Problem Statement
When a node's labels match multiple MachineConfigPools (e.g. a custom pool that extends the worker role), the current DaemonSet's NodeSelector — which copies the MCP's own nodeSelector — causes multiple RTE DaemonSets to target the same node, resulting in duplicate pods.

Root Cause
The MCP nodeSelectors are inherently overlapping: for example, a node with both node-role.kubernetes.io/worker and node-role.kubernetes.io/workerB labels matches the worker MCP and the workerB MCP simultaneously. Using those selectors directly as DaemonSet's NodeSelectors guarantees
collisions.

Suggested Approach
Place a single, NRO-owned label on each node that encodes its primary pool. The DaemonSet's NodeSelector then targets this label instead of the MCP's nodeSelector, making placement 1:1 by construction.

We reuse MCO's own priority logic (master > custom > worker) to determine each node's primary pool, so the behavior is consistent with how MCO itself decides under what MCP the node belongs.

NOTE:
This is not relevant for HyperShift since on HyperShift we're setting the DaemonSet's NodeSelector as the NodePool name which are exclusive for each node, meaning a node can't have more that a single NodePool label.

AIA Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak titzhak@redhat.com

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Tal-or Tal-or force-pushed the primary_pool_selection branch from e608a4f to 15fb24e Compare June 1, 2026 08:53
@Tal-or
Copy link
Copy Markdown
Collaborator Author

Tal-or commented Jun 2, 2026

/retest

@Tal-or Tal-or force-pushed the primary_pool_selection branch from 15fb24e to 049d1a4 Compare June 2, 2026 08:18
Tal-or added 5 commits June 2, 2026 12:10
AIA: Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
When a node's MCO primary pool is not in the NRO NodeGroups (e.g. a
custom MCP like workerB), fall back to the best matching managed pool
(e.g. worker) instead of silently dropping the node from RTE coverage.
This preserves the previous behavior where all nodes matching a managed
MCP's NodeSelector receive an RTE pod.

AIA: Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
…aemonSet targeting

AIA: Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
…labeling

AIA: Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
AIA: Human-AI blend, New content, Human-initiated, Reviewed, Claude Opus 4.6 v1.0
Signed-off-by: Talor Itzhak <titzhak@redhat.com>
@Tal-or Tal-or force-pushed the primary_pool_selection branch from 049d1a4 to 183d387 Compare June 2, 2026 10:24
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 2, 2026

@Tal-or: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ci-install-e2e-compact 183d387 link false /test ci-install-e2e-compact

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Tal-or
Copy link
Copy Markdown
Collaborator Author

Tal-or commented Jun 2, 2026

CI E2E - for tools depends on #4311

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants