Skip to content

Adding toggle that removes finalizers post-delete#1347

Open
chapmanc wants to merge 16 commits intomainfrom
chappie/finalizerRemoval
Open

Adding toggle that removes finalizers post-delete#1347
chapmanc wants to merge 16 commits intomainfrom
chappie/finalizerRemoval

Conversation

@chapmanc
Copy link
Copy Markdown
Contributor

@chapmanc chapmanc commented Mar 26, 2026

Two related fixes for orphaned CRs with finalizers blocking namespace/operator teardown:

  1. Pre-delete finalizer removal job (finalizerRemoval.enabled=false by default)
    Adds an opt-in pre-delete helm hook that strips operator.redpanda.com/finalizer from all operator-managed CRs before helm uninstall completes. Discovers CR types dynamically from the scheme — no code changes needed when new types are added.

  2. Skip finalizer addition in terminating namespaces
    The reconciler now checks if the CR's namespace has DeletionTimestamp set before adding the operator finalizer. Prevents the controller from re-adding finalizers to objects in a namespace that's already being deleted, which was the root cause of the blocking behavior.

@chapmanc chapmanc changed the title Adding finalizer removal toggle that removes finalizer pre-delete Adding toggle that removes finalizers pre-delete Mar 26, 2026
@david-yu
Copy link
Copy Markdown
Contributor

@chapmanc How urgent is this to review could this wait until next week?

@andrewstucki
Copy link
Copy Markdown
Contributor

@david-yu - I talked to Chappie about this already, it's not super urgent, he's just trying to self-serve on this feature to make things in cloud a bit easier in case something goes awry in cleaning up clusters. Seems like in the past they have had issues where the resources they've created haven't been deleted properly before uninstalling the operator and when they delete the namespace it sees orphaned resources and can't cascade delete them due to the finalizers.

Since finalizers are a safe-guard against leaving around side-effects that really should be cleaned up, I'm going to call this additional flag a "feature" not a "bug" since the behavior of leaving around CR instances with finalizers post uninstall of the operator is intended (and, at least in my experience, fairly common). Definitely fine with making an option for finalizer removal (or even CR + CRD removal as well) for those who really want it, but think it should definitely be opt-in.

Comment thread operator/chart/pre_delete_finalizer_removal_job.go Outdated
@david-yu
Copy link
Copy Markdown
Contributor

Fixed some lint issues for you, hopefully should pass all checks.

Copy link
Copy Markdown
Contributor

@andrewstucki andrewstucki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes, also, if possible I'd like an acceptance test to verify the finalizer removal/namespace deletion unblock. LMK if you need help with that.

Comment thread operator/chart/templates/_pre-delete-finalizer-removal-job.go.tpl Outdated
Comment thread operator/cmd/finalizerremoval/finalizer_removal.go Outdated
Comment thread operator/internal/controller/redpanda/resource_controller.go Outdated
Comment thread operator/internal/controller/redpanda/resource_controller_test.go Outdated
Comment thread operator/chart/values.yaml Outdated
@andrewstucki andrewstucki changed the title Adding toggle that removes finalizers pre-delete Adding toggle that removes finalizers post-delete Mar 31, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 6, 2026

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Apr 6, 2026
@david-yu david-yu removed the stale label Apr 7, 2026
@david-yu
Copy link
Copy Markdown
Contributor

david-yu commented Apr 7, 2026

Addressed all review feedback from @andrewstucki in 379591f:

  • Deleted stale template: Removed _pre-delete-finalizer-removal-job.go.tpl that gotohelm left behind after the pre→post-delete rename.
  • Use crds package: Replaced scheme reflection (UnifiedScheme.AllKnownTypes() + group filtering) with crds.All() to discover operator CRD types directly.
  • Removed terminating namespace check: Dropped the namespace DeletionTimestamp guard from the resource controller — the post-delete hook is the proper mechanism for orphaned CR cleanup.
  • Fixed all pre-delete → post-delete references: values.yaml, README, changelog, package doc.
  • Added envtest-based tests for finalizer removal: Cross-referenced with the cloud codebase to understand scenarios where orphaned CRs occur during the teardown flow (operator uninstalled before CRs are cleaned up, finalizers block namespace deletion). Tests cover:
    • Orphaned CRs have their operator finalizer stripped
    • Multiple CR types (Topic, User) handled independently
    • CRs without the operator finalizer are left untouched
    • Unknown/uninstalled GVKs are skipped gracefully
    • End-to-end: CRs are deletable after finalizer removal, unblocking namespace teardown

@david-yu
Copy link
Copy Markdown
Contributor

Added a TestTemplate test case for finalizerRemoval.enabled: true in 231674d5. This ensures the post-delete Job, ServiceAccount, ClusterRole, and ClusterRoleBinding are rendered and snapshot-tested in the golden file — matching the convention used by all other feature flags (connectController.enabled, crds.enabled, etc.).

Without this, CI's git diff --exit-code step would fail because task generate regenerates the golden file with the new case, but the committed golden file wouldn't have it.

chapmanc and others added 11 commits April 9, 2026 21:51
Adds a new `finalizerRemoval.enabled` boolean (default false) to the
operator chart values. This will gate a pre-delete helm hook job that
strips finalizers from all operator-managed CRs on uninstall.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds SA name helper and SA object for the upcoming pre-delete finalizer
removal job, following the same pattern as MigrationJobServiceAccount.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a pre-delete helm hook Job that runs the finalizer-removal
subcommand on helm uninstall when finalizerRemoval.enabled=true.
Mirrors the pattern from PostUpgradeMigrationJob (f1112cb).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds RBAC bundle (reusing operator's v2-manager rules) for the pre-delete
finalizer removal job and wires the job + SA into the chart render function.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the finalizer-removal CLI subcommand that lists all operator-managed
CR types (RedpandaRole, User, Group, Topic, Schema, ShadowLink, Redpanda,
NodePool) across all namespaces and removes the operator finalizer from
each. If a CRD is not installed, the error is logged and skipped gracefully.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the finalizer-removal CLI subcommand that discovers all CR types in
the operator's API groups (cluster.redpanda.com, redpanda.vectorized.io)
from the scheme at runtime, then lists and removes the operator finalizer
from each. New types are automatically picked up without code changes.

If a CRD is not installed the error is logged and that type is skipped
gracefully so the job succeeds even in partial-install scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Regenerates templates, schema, and partial values struct to include
the new finalizerRemoval opt-in field and pre-delete hook job.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If a namespace is being deleted, the controller no longer adds the
operator finalizer to CRs. Previously a re-created CR in a terminating
namespace would have a finalizer added, blocking namespace deletion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Verifies that the reconciler does not add the operator finalizer when the
object's namespace has DeletionTimestamp set. envtest has no namespace
controller so deleting a namespace leaves its objects intact, exactly
modelling the window the check guards against.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Running post-delete means the operator Deployment is already gone before
finalizers are stripped, eliminating the race where the controller could
re-add them. A failed job no longer blocks the uninstall since chart
resources are already deleted by the time the hook runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
david-yu and others added 5 commits April 9, 2026 21:51
Swap import order of corev1 and apierrors in resource_controller.go
to satisfy goimports, and add missing finalizerRemoval entries to the
operator chart README.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Switch to post-delete hook (already done) and delete stale
  _pre-delete-finalizer-removal-job.go.tpl template
- Use crds.All() instead of scheme reflection to discover operator
  CRD types for finalizer removal
- Remove terminating namespace check from resource controller as the
  post-delete hook is the proper mechanism for orphaned CR cleanup
- Remove corresponding terminating namespace test
- Fix all pre-delete -> post-delete references in values.yaml, README,
  changelog, and package doc
- Add envtest-based tests for finalizer removal covering orphaned CRs,
  multiple CR types, missing CRDs, and end-to-end deletion unblock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set spec.cluster.clusterRef.name on test CRs so they pass CRD
  validation (Topic/User require cluster or kafkaApiSpec)
- Remove trailing blank line in resource_controller_test.go that
  caused git diff --exit-code lint failure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add a TestTemplate test case that exercises the finalizerRemoval
toggle, ensuring the post-delete Job, ServiceAccount, ClusterRole,
and ClusterRoleBinding are rendered correctly when enabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@david-yu david-yu force-pushed the chappie/finalizerRemoval branch from 231674d to 63ffeb6 Compare April 10, 2026 04:52
@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open 5 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants