Skip to content

OCPBUGS-92817: delete MAPI MachineSets before CAPI in e2e cleanup#611

Open
pmeida wants to merge 1 commit into
openshift:mainfrom
pmeida:OCPBUGS-92817
Open

OCPBUGS-92817: delete MAPI MachineSets before CAPI in e2e cleanup#611
pmeida wants to merge 1 commit into
openshift:mainfrom
pmeida:OCPBUGS-92817

Conversation

@pmeida

@pmeida pmeida commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a deadlock in cleanupMachineSetTestResources that causes e2e-aws-capi-techpreview to fail with a 15-minute timeout.

When a test creates both a CAPI MachineSet and a MAPI MachineSet with the same name and authoritativeAPI: ClusterAPI, deleting CAPI first causes the sync controller to loop in reconcileCAPItoMAPIMachineSetDeletionNormal - it waits for the CAPI-specific finalizer (cluster.x-k8s.io/machineset) to be removed, but its own constant requeues conflict with the CAPI controller's finalizer removal patch, deadlocking cleanup.

Deleting MAPI first instead triggers reconcileCAPItoMAPIMachineSetDeletionCAPINotDeleting, which removes the sync finalizer from CAPI immediately. The CAPI MachineSet can then be deleted cleanly with no sync interference.

Test plan

  • e2e-aws-capi-techpreview passes without the Should have deleted MachineSet openshift-cluster-api/capi-ms-auth-capi-* timeout

Fixes: https://issues.redhat.com/browse/OCPBUGS-92817

When a test creates both a CAPI MachineSet and a MAPI MachineSet with the
same name and authoritativeAPI: ClusterAPI, the sync controller manages
deletion through reconcileCAPItoMAPIMachineSetDeletionNormal. Deleting CAPI
first causes the sync controller to issue deletion to MAPI and then loop
waiting for the CAPI-specific finalizer (cluster.x-k8s.io/machineset) to
be removed. The sync controller's constant requeues conflict with the CAPI
controller's finalizer removal patch, causing a deadlock.

Deleting MAPI first triggers reconcileCAPItoMAPIMachineSetDeletionCAPINotDeleting
which removes the sync finalizer from CAPI immediately. The CAPI MachineSet
can then be deleted cleanly with only its own finalizer to manage.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Jun 26, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@pmeida: This pull request references Jira Issue OCPBUGS-92817, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fixes a deadlock in cleanupMachineSetTestResources that causes e2e-aws-capi-techpreview to consistently fail with a 15-minute timeout.

When a test creates both a CAPI MachineSet and a MAPI MachineSet with the same name and authoritativeAPI: ClusterAPI, deleting CAPI first causes the sync controller to loop in reconcileCAPItoMAPIMachineSetDeletionNormal — it waits for the CAPI-specific finalizer (cluster.x-k8s.io/machineset) to be removed, but its own constant requeues conflict with the CAPI controller's finalizer removal patch, deadlocking cleanup.

Deleting MAPI first instead triggers reconcileCAPItoMAPIMachineSetDeletionCAPINotDeleting, which removes the sync finalizer from CAPI immediately. The CAPI MachineSet can then be deleted cleanly with no sync interference.

Test plan

  • e2e-aws-capi-techpreview passes without the Should have deleted MachineSet openshift-cluster-api/capi-ms-auth-capi-* timeout

Fixes: https://issues.redhat.com/browse/OCPBUGS-92817

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jun 26, 2026
@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mdbooth for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot

Copy link
Copy Markdown

@pmeida: This pull request references Jira Issue OCPBUGS-92817, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

Summary

Fixes a deadlock in cleanupMachineSetTestResources that causes e2e-aws-capi-techpreview to consistently fail with a 15-minute timeout.

When a test creates both a CAPI MachineSet and a MAPI MachineSet with the same name and authoritativeAPI: ClusterAPI, deleting CAPI first causes the sync controller to loop in reconcileCAPItoMAPIMachineSetDeletionNormal — it waits for the CAPI-specific finalizer (cluster.x-k8s.io/machineset) to be removed, but its own constant requeues conflict with the CAPI controller's finalizer removal patch, deadlocking cleanup.

Deleting MAPI first instead triggers reconcileCAPItoMAPIMachineSetDeletionCAPINotDeleting, which removes the sync finalizer from CAPI immediately. The CAPI MachineSet can then be deleted cleanly with no sync interference.

Test plan

  • e2e-aws-capi-techpreview passes without the Should have deleted MachineSet openshift-cluster-api/capi-ms-auth-capi-* timeout

Fixes: https://issues.redhat.com/browse/OCPBUGS-92817

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@pmeida, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 45 minutes and 4 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0cd52f36-d01e-4b00-8154-9842c87dcf8c

📥 Commits

Reviewing files that changed from the base of the PR and between 925d57a and b2e2670.

📒 Files selected for processing (1)
  • e2e/machineset_migration_helpers.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@pmeida

pmeida commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

This fixes the test's cleanup order to use the supported deletion path (MAPI first). When MAPI is deleted first, reconcileCAPItoMAPIMachineSetDeletionCAPINotDeleting removes the sync finalizer from CAPI immediately and steps aside, allowing clean deletion.

The root cause of the deadlock is in reconcileCAPItoMAPIMachineSetDeletionNormal: when CAPI is deleted first, the sync controller requeues indefinitely waiting for the CAPI-specific finalizer to be removed, and those requeues can conflict with the CAPI controller's own finalizer removal patch.
This fix avoids triggering that path in tests but it doesnt solve the core issue.

@pmeida pmeida changed the title OCPBUGS-92817: delete MAPI MachineSets before CAPI in cleanup OCPBUGS-92817: delete MAPI MachineSets before CAPI in e2e cleanup Jun 26, 2026
@pmeida

pmeida commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

/test e2e-aws-capi-techpreview

@damdo

damdo commented Jun 26, 2026

Copy link
Copy Markdown
Member

/assign @theobarberbany

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@pmeida: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot

Copy link
Copy Markdown

@pmeida: This pull request references Jira Issue OCPBUGS-92817, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Fixes a deadlock in cleanupMachineSetTestResources that causes e2e-aws-capi-techpreview to fail with a 15-minute timeout.

When a test creates both a CAPI MachineSet and a MAPI MachineSet with the same name and authoritativeAPI: ClusterAPI, deleting CAPI first causes the sync controller to loop in reconcileCAPItoMAPIMachineSetDeletionNormal - it waits for the CAPI-specific finalizer (cluster.x-k8s.io/machineset) to be removed, but its own constant requeues conflict with the CAPI controller's finalizer removal patch, deadlocking cleanup.

Deleting MAPI first instead triggers reconcileCAPItoMAPIMachineSetDeletionCAPINotDeleting, which removes the sync finalizer from CAPI immediately. The CAPI MachineSet can then be deleted cleanly with no sync interference.

Test plan

  • e2e-aws-capi-techpreview passes without the Should have deleted MachineSet openshift-cluster-api/capi-ms-auth-capi-* timeout

Fixes: https://issues.redhat.com/browse/OCPBUGS-92817

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants