Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
237 changes: 237 additions & 0 deletions contrib/spec-OCPBUGS-62942.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
# Specification: OCPBUGS-62942 - Fix ClusterExtension deletion when BoxcutterRuntime is enabled

## Problem Statement

When BoxcutterRuntime feature gate is enabled after ClusterExtensions have been installed, those existing ClusterExtensions cannot be deleted. The deletion fails with the error:

```
error walking catalogs: error getting package "nginx84930" from catalog "catalog-80117": cache for catalog "catalog-80117" not found
```

### Root Cause Analysis

The issue occurs because:

1. **ClusterExtensions are installed** before BoxcutterRuntime feature gate is enabled
2. **User enables BoxcutterRuntime feature gate** (and potentially deletes some catalogs)
3. **User tries to delete a ClusterExtension**
4. The controller's reconcile loop runs and attempts to resolve the bundle even though the ClusterExtension is being deleted
5. The catalog that was originally used to install the ClusterExtension may no longer exist or its cache is not available
6. The resolution fails with "cache for catalog not found" error
7. The reconcile loop returns this error, preventing the deletion from completing

### Code Flow Analysis

From `/home/tshort/git/operator-framework/operator-controller/internal/operator-controller/controllers/clusterextension_controller.go`:

```
reconcile() -> line 186: handling finalizers
-> line 198: check if deletion timestamp is set
-> If deletion timestamp is set AND finalizers are done, return early
-> line 217: getting installed bundle (this calls GetRevisionStates)
-> line 234: resolving bundle (this is where the error occurs)
```

The problem is at **line 217-234**: The code calls `GetRevisionStates` and then attempts resolution even when the ClusterExtension is being deleted. When the ClusterExtension has a deletion timestamp, we should skip resolution entirely.

However, there's an early return at line 198 that should handle this, but it only returns if `finalizeResult.Updated || finalizeResult.StatusUpdated` is false. If the finalization is still in progress, the reconcile continues to line 217 where it tries to get the installed bundle and resolve.

## Solution Design

The fix is to check if the ClusterExtension is being deleted **before** attempting to get the installed bundle or perform resolution. If a deletion timestamp is present, we should:

1. Skip getting the installed bundle
2. Skip resolution
3. Skip unpacking and applying
4. Just handle finalizers and return

This is a simple fix that adds an early return check right after the finalizer handling logic.

### Proposed Code Change

In `/home/tshort/git/operator-framework/operator-controller/internal/operator-controller/controllers/clusterextension_controller.go`, modify the `reconcile()` function:

**Current code (lines 186-204):**
```go
l.Info("handling finalizers")
finalizeResult, err := r.Finalizers.Finalize(ctx, ext)
if err != nil {
setStatusProgressing(ext, err)
return ctrl.Result{}, err
}
if finalizeResult.Updated || finalizeResult.StatusUpdated {
// On create: make sure the finalizer is applied before we do anything
// On delete: make sure we do nothing after the finalizer is removed
return ctrl.Result{}, nil
}

if ext.GetDeletionTimestamp() != nil {
// If we've gotten here, that means the cluster extension is being deleted, we've handled all of
// _our_ finalizers (above), but the cluster extension is still present in the cluster, likely
// because there are _other_ finalizers that other controllers need to handle, (e.g. the orphan
// deletion finalizer).
return ctrl.Result{}, nil
}
```

The logic already handles this correctly! The issue is that the early return at line 192-195 only happens if `finalizeResult.Updated || finalizeResult.StatusUpdated` is true. But what about the case where:
- The ClusterExtension has a deletion timestamp
- The finalizers are NOT done being processed
- `finalizeResult.Updated || finalizeResult.StatusUpdated` is false

In this case, the code continues past line 196 and does NOT hit the early return at line 198-203. It then continues to line 217 where it tries to get the installed bundle and resolve, which fails.

**The fix:** Move the deletion timestamp check BEFORE the finalizer handling:

```go
l.Info("handling finalizers")
finalizeResult, err := r.Finalizers.Finalize(ctx, ext)
if err != nil {
setStatusProgressing(ext, err)
return ctrl.Result{}, err
}
if finalizeResult.Updated || finalizeResult.StatusUpdated {
// On create: make sure the finalizer is applied before we do anything
// On delete: make sure we do nothing after the finalizer is removed
return ctrl.Result{}, nil
}

if ext.GetDeletionTimestamp() != nil {
// If we've gotten here, that means the cluster extension is being deleted, we've handled all of
// _our_ finalizers (above), but the cluster extension is still present in the cluster, likely
// because there are _other_ finalizers that other controllers need to handle, (e.g. the orphan
// deletion finalizer).
//
// Do NOT proceed with resolution or installation when the ClusterExtension is being deleted,
// as the catalogs may no longer be available.
return ctrl.Result{}, nil
}
```

Wait, that's already there. Let me re-analyze...

Actually, looking more carefully at the error in the JIRA:

```
E1010 11:44:49.754364 1 controller.go:474] "Reconciler error" err="error walking catalogs: error getting package \"nginx84930\" from catalog \"catalog-80117\": cache for catalog \"catalog-80117\" not found" controller="controller-operator-cluster-extension-controller" controllerGroup="olm.operatorframework.io" controllerKind="ClusterExtension" ClusterExtension="extension-84930"
```

The error is for `extension-84930`, but the user was trying to delete `extension-80117`. This suggests that `extension-84930` is NOT being deleted, but it's failing to reconcile because `catalog-80117` (which is associated with a different extension) was deleted.

Let me reconsider the problem: The issue might be that when BoxcutterRuntime is enabled, catalogs get deleted, and then other extensions that reference those catalogs fail to reconcile.

### Revised Root Cause

The actual issue is:
1. Extensions are installed with their own dedicated catalogs (extension-80117 uses catalog-80117, extension-84930 uses catalog-84930)
2. User enables BoxcutterRuntime feature gate
3. User deletes extension-80117
4. The deletion of extension-80117 also deletes catalog-80117
5. But extension-84930 still references catalog-80117 somehow (or there's a race condition)
6. extension-84930 tries to reconcile and fails because catalog-80117 is gone

Wait, looking at the logs more carefully:

```
I1010 11:44:49.901578 1 clusterextension_controller.go:105 "reconcile starting" ... ClusterExtension="extension-80117"
I1010 11:44:49.901604 1 clusterextension_controller.go:186 "handling finalizers" ... ClusterExtension="extension-80117"
I1010 11:44:49.901756 1 clusterextension_controller.go:217 "getting installed bundle" ... ClusterExtension="extension-80117"
I1010 11:44:49.901860 1 clusterextension_controller.go:272 "unpacking resolved bundle" ... ClusterExtension="extension-80117"
```

So the controller IS continuing past the finalizer handling for `extension-80117` even though it should have a deletion timestamp!

This confirms my original analysis was correct. The issue is that the code at line 198 is supposed to catch this, but something is wrong with the logic.

### Final Root Cause

The issue is that the early return at line 198-203 is AFTER the check at line 192-195. If `finalizeResult.Updated || finalizeResult.StatusUpdated` is **false** AND the deletion timestamp is set, then:
- Line 192-195: Does NOT return (because finalizeResult is not updated)
- Line 198-203: DOES return (because deletion timestamp is set)

But what if there's an issue with the deletion timestamp check? Let me re-read the logs...

Actually, the logs show:
```
I1010 11:44:49.901604 1 clusterextension_controller.go:186 "handling finalizers"
I1010 11:44:49.901756 1 clusterextension_controller.go:217 "getting installed bundle"
```

Line 186 is the "handling finalizers" log, and then it goes straight to line 217 "getting installed bundle". This means it skipped the deletion timestamp check at line 198!

This can only happen if:
1. The deletion timestamp is NOT set (but the user said they're trying to delete it), OR
2. The `finalizeResult.Updated || finalizeResult.StatusUpdated` check at line 192 returned true, so it returned early and this is a subsequent reconcile

Let me reconsider: Maybe the finalizer itself is trying to resolve the bundle? Let me check what the finalizers do...

From the main.go file, the finalizers are:
1. `ClusterExtensionCleanupUnpackCacheFinalizer` - just deletes from image cache
2. `ClusterExtensionCleanupContentManagerCacheFinalizer` - deletes from content manager

Neither of these should trigger resolution. So the issue must be in the reconcile loop logic.

### Actual Fix

After careful analysis, I believe the issue is that the reconcile loop continues to process the ClusterExtension even when it's being deleted, because the deletion timestamp check happens AFTER attempting to handle finalizers, and if the finalizer processing doesn't update anything, it continues to the resolution phase.

The fix is to check for deletion timestamp IMMEDIATELY after finalizer handling, before doing ANY other work:

```go
l.Info("handling finalizers")
finalizeResult, err := r.Finalizers.Finalize(ctx, ext)
if err != nil {
setStatusProgressing(ext, err)
return ctrl.Result{}, err
}

// If the ClusterExtension is being deleted, we should not proceed with resolution or installation.
// This prevents errors when catalogs that were used for installation are no longer available.
if ext.GetDeletionTimestamp() != nil {
// Even if finalizers haven't finished, we don't want to continue with resolution/installation
// when the resource is being deleted. Just handle status updates if needed and return.
return ctrl.Result{}, nil
}

if finalizeResult.Updated || finalizeResult.StatusUpdated {
// On create: make sure the finalizer is applied before we do anything
return ctrl.Result{}, nil
}
```

This ensures that we NEVER attempt resolution or installation when a ClusterExtension is being deleted, regardless of whether the finalizers have been updated or not.

## Implementation Steps

1. **Modify the reconcile function** in `internal/operator-controller/controllers/clusterextension_controller.go`:
- Move the deletion timestamp check to occur IMMEDIATELY after finalizer error handling
- This ensures we skip resolution/installation for any ClusterExtension that is being deleted

2. **Add unit tests** to verify:
- When a ClusterExtension has a deletion timestamp, the reconcile loop does not attempt resolution
- Finalizers are still processed correctly for ClusterExtensions being deleted
- The controller returns early without errors when a ClusterExtension is being deleted

3. **Add or update e2e tests** to verify:
- ClusterExtensions can be deleted successfully after BoxcutterRuntime is enabled
- Even if the original catalog is no longer available, deletion still succeeds

## Testing Plan

### Manual Testing
1. Install a ClusterExtension with BoxcutterRuntime disabled
2. Enable BoxcutterRuntime feature gate
3. Delete the ClusterExtension
4. Verify deletion succeeds without errors

### Automated Tests
1. Unit test: ClusterExtension deletion skips resolution
2. E2E test: BoxcutterRuntime upgrade scenario

## Risks and Mitigations

**Risk**: Moving the deletion timestamp check might affect other edge cases
**Mitigation**: Thorough testing of the deletion flow with unit and e2e tests

**Risk**: The fix might mask other underlying issues with catalog availability
**Mitigation**: Ensure proper error handling and logging for catalog-related errors in non-deletion scenarios
Original file line number Diff line number Diff line change
Expand Up @@ -189,17 +189,22 @@ func (r *ClusterExtensionReconciler) reconcile(ctx context.Context, ext *ocv1.Cl
setStatusProgressing(ext, err)
return ctrl.Result{}, err
}
if finalizeResult.Updated || finalizeResult.StatusUpdated {
// On create: make sure the finalizer is applied before we do anything
// On delete: make sure we do nothing after the finalizer is removed

// If the ClusterExtension is being deleted, we should not proceed with resolution or installation.
// This prevents errors when catalogs that were used for installation are no longer available.
// We check this immediately after finalizer handling to ensure we skip all reconciliation logic
// for resources that are being deleted, even if the finalizers haven't been updated.
if ext.GetDeletionTimestamp() != nil {
// The cluster extension is being deleted. We've handled finalizer registration/removal above,
// but the cluster extension may still be present in the cluster because there are other
// finalizers that other controllers need to handle (e.g. the orphan deletion finalizer).
// In this case, we should not proceed with resolution, unpacking, or installation.
return ctrl.Result{}, nil
}

if ext.GetDeletionTimestamp() != nil {
// If we've gotten here, that means the cluster extension is being deleted, we've handled all of
// _our_ finalizers (above), but the cluster extension is still present in the cluster, likely
// because there are _other_ finalizers that other controllers need to handle, (e.g. the orphan
// deletion finalizer).
if finalizeResult.Updated || finalizeResult.StatusUpdated {
// On create: make sure the finalizer is applied before we do anything
// On delete: make sure we do nothing after the finalizer is removed
return ctrl.Result{}, nil
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,70 @@ func TestClusterExtensionShortCircuitsReconcileDuringDeletion(t *testing.T) {
}
}

// TestClusterExtensionDeletionWithUnavailableCatalog tests the fix for OCPBUGS-62942.
// It verifies that when a ClusterExtension is being deleted, the reconcile loop
// does not attempt resolution even if the catalog is unavailable (which would cause
// a "cache for catalog not found" error).
func TestClusterExtensionDeletionWithUnavailableCatalog(t *testing.T) {
cl, reconciler := newClientAndReconciler(t)

// Set up a resolver that would fail with a catalog error
catalogNotFoundErr := errors.New("error walking catalogs: error getting package from catalog: cache for catalog not found")
reconciler.Resolver = resolve.Func(func(_ context.Context, _ *ocv1.ClusterExtension, _ *ocv1.BundleMetadata) (*declcfg.Bundle, *bsemver.Version, *declcfg.Deprecation, error) {
return nil, nil, nil, catalogNotFoundErr
})

// Set up a revision states getter that would also fail
revisionStatesErr := errors.New("getting installed bundle failed")
reconciler.RevisionStatesGetter = &MockRevisionStatesGetter{
Err: revisionStatesErr,
}

ctx := context.Background()
pkgName := fmt.Sprintf("test-pkg-%s", rand.String(6))
extKey := types.NamespacedName{Name: fmt.Sprintf("cluster-extension-test-%s", rand.String(8))}

t.Log("When a ClusterExtension is being deleted")
t.Log("By creating a cluster extension with a finalizer")
clusterExtension := &ocv1.ClusterExtension{
ObjectMeta: metav1.ObjectMeta{
Name: extKey.Name,
Finalizers: []string{
controllers.ClusterExtensionCleanupUnpackCacheFinalizer,
},
},
Spec: ocv1.ClusterExtensionSpec{
Source: ocv1.SourceConfig{
SourceType: "Catalog",
Catalog: &ocv1.CatalogFilter{
PackageName: pkgName,
},
},
Namespace: "default",
ServiceAccount: ocv1.ServiceAccountReference{
Name: "default",
},
},
}
require.NoError(t, cl.Create(ctx, clusterExtension))

t.Log("And deleting the cluster extension")
require.NoError(t, cl.Delete(ctx, clusterExtension))

t.Log("Then reconcile should succeed without attempting resolution")
t.Log("By running reconcile")
res, err := reconciler.Reconcile(ctx, ctrl.Request{NamespacedName: extKey})
require.Equal(t, ctrl.Result{}, res)
// Should not return the catalog error or revision states error because
// reconcile should short-circuit before calling either the RevisionStatesGetter
// or the Resolver
require.NoError(t, err)

t.Log("And the cluster extension should still exist (waiting for finalizer cleanup)")
require.NoError(t, cl.Get(ctx, extKey, clusterExtension))
require.NotNil(t, clusterExtension.DeletionTimestamp)
}

func TestClusterExtensionResolutionFails(t *testing.T) {
pkgName := fmt.Sprintf("non-existent-%s", rand.String(6))
cl, reconciler := newClientAndReconciler(t)
Expand Down
Loading