Skip to content

Commit a36f82c

Browse files
feat(jobs): atomic team-deletion teardown + orphan-sweep reconciler
Make team/account deletion eventually-consistent across every system the provisioner touches. Two changes: team_deletion_executor: - Step 0 flips deletion_requested -> deletion_pending the instant teardown begins, so a mid-pipeline crash leaves the team in deletion_pending (recoverable) rather than indistinguishable from a grace-window team or half-tombstoned. The tombstone flip now guards on deletion_pending. - New step 3: delete every instant-deploy-<appID> k8s namespace the team owns. Previously a deleted Pro/Team customer's pods ran (and cost) forever — only customer DBs + S3 backups were torn down. - fetchCandidates also re-sweeps deletion_pending rows for retry; every step is idempotent so a re-run after a partial failure completes cleanly. orphan_sweep_reconciler (new, every 15m): - PASS 1 finishes stuck deletion_pending teams via the executor seam. - PASS 2 cancels Razorpay subscriptions still live on tombstoned / deletion_pending teams — the "stop the money" backstop so a deleted customer is never billed. - PASS 3 deletes instant-deploy-* namespaces with no live owner. - Each reclaimed orphan emits team.orphan_reclaimed; each failure emits team.orphan_sweep_failed for operator alerting. K8sNamespaceClient + razorpayOrphanCanceler provide the concrete deleter and canceler; both fail open (nil) when the cluster / Razorpay is unreachable. Tests (sqlmock, no DB needed, run under go test ./...): - TestTeamDeletionExecutor_TxFailure_StaysPending — mid-deletion failure leaves the team in deletion_pending, not half-tombstoned. - TestTeamDeletionExecutor_PendingRetry_Idempotent — re-running over a deletion_pending team completes cleanly. - TestTeamDeletionExecutor_DeletesK8sNamespaces — namespace teardown. - TestOrphanSweep_Pass1_CompletesStuckPendingTeam / _TeardownFailure_* - TestOrphanSweep_Pass2_CancelsOrphanedSubscription / _CancelFailure_* - TestOrphanSweep_Pass3_DeletesOrphanedNamespaceKeepsLive - TestOrphanSweep_AllPasses_NoOrphans_Idempotent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 0fe5989 commit a36f82c

7 files changed

Lines changed: 1502 additions & 79 deletions
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
package jobs
2+
3+
// k8s_namespace_client.go — concrete K8sNamespaceDeleter backed by a
4+
// kubernetes.Clientset. Shared by the team_deletion_executor (which deletes
5+
// a deleted team's deploy namespaces) and the orphan_sweep_reconciler
6+
// (which detects + reclaims orphaned namespaces).
7+
//
8+
// WHY ITS OWN FILE
9+
//
10+
// deploy_status_reconcile.go already builds a kubernetes.Clientset for its
11+
// read-only GetDeployment surface. The deletion path needs a *write*
12+
// surface (Namespaces().Delete) plus a List surface for orphan detection.
13+
// Keeping the deleter in its own file makes the blast radius of the write
14+
// capability explicit and gives the orphan-sweep reconciler something to
15+
// import without dragging in the status-reconciler's autopsy machinery.
16+
//
17+
// FAIL-OPEN POSTURE
18+
//
19+
// NewK8sNamespaceClient returns (nil, err) when no cluster is reachable
20+
// (CI, docker-compose). Callers pass nil to the executor / reconciler,
21+
// which then skip the k8s steps with a WARN log — the same posture as
22+
// deployStatusK8sProvider. A worker that cannot reach k8s still runs every
23+
// other periodic job.
24+
//
25+
// IDEMPOTENCY
26+
//
27+
// DeleteNamespace swallows apierrors.IsNotFound — a namespace that is
28+
// already gone (a previous partially-failed teardown deleted it, or the
29+
// deploy-expirer beat us to it) is success, not an error. This is the
30+
// property that makes re-running a failed team deletion safe.
31+
32+
import (
33+
"context"
34+
"fmt"
35+
36+
apierrors "k8s.io/apimachinery/pkg/api/errors"
37+
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
38+
"k8s.io/client-go/kubernetes"
39+
)
40+
41+
// k8sNamespaceClient is the concrete K8sNamespaceDeleter. Wraps a
42+
// kubernetes.Clientset.
43+
type k8sNamespaceClient struct {
44+
cs *kubernetes.Clientset
45+
}
46+
47+
// NewK8sNamespaceClient builds a K8sNamespaceDeleter from in-cluster config,
48+
// falling back to the default kubeconfig for local dev. Returns (nil, err)
49+
// when neither is reachable — the caller logs and passes nil to the
50+
// executor / reconciler.
51+
func NewK8sNamespaceClient() (K8sNamespaceDeleter, error) {
52+
cs, err := newDeployK8sClientset()
53+
if err != nil {
54+
return nil, err
55+
}
56+
return &k8sNamespaceClient{cs: cs}, nil
57+
}
58+
59+
// DeleteNamespace removes the namespace and everything it contains. A
60+
// NotFound namespace is treated as success — that is the idempotency
61+
// contract the executor and the reconciler depend on for safe re-runs.
62+
//
63+
// The Delete call is asynchronous on the k8s side (the namespace enters
64+
// Terminating and the control plane garbage-collects its contents); we do
65+
// not block on completion. The orphan-sweep reconciler's NamespaceExists
66+
// check will still report true for a Terminating namespace, so a namespace
67+
// mid-termination is simply re-observed on the next sweep and re-Delete'd
68+
// (also a no-op) — eventually consistent, never wrong.
69+
func (c *k8sNamespaceClient) DeleteNamespace(ctx context.Context, namespace string) error {
70+
err := c.cs.CoreV1().Namespaces().Delete(ctx, namespace, metav1.DeleteOptions{})
71+
if err != nil && !apierrors.IsNotFound(err) {
72+
return fmt.Errorf("k8sNamespaceClient.DeleteNamespace %q: %w", namespace, err)
73+
}
74+
return nil
75+
}
76+
77+
// NamespaceExists reports whether the namespace is still present (including
78+
// the Terminating phase). NotFound → (false, nil). Any other error is
79+
// surfaced so the reconciler does not mistake an API outage for "orphan
80+
// already cleaned".
81+
func (c *k8sNamespaceClient) NamespaceExists(ctx context.Context, namespace string) (bool, error) {
82+
_, err := c.cs.CoreV1().Namespaces().Get(ctx, namespace, metav1.GetOptions{})
83+
if err == nil {
84+
return true, nil
85+
}
86+
if apierrors.IsNotFound(err) {
87+
return false, nil
88+
}
89+
return false, fmt.Errorf("k8sNamespaceClient.NamespaceExists %q: %w", namespace, err)
90+
}
91+
92+
// ListDeployNamespaces returns the names of every instant-deploy-* namespace
93+
// currently in the cluster. The orphan-sweep reconciler uses this to find
94+
// namespaces whose owning deployment row / team is gone. Returns an error
95+
// on any API failure so the reconciler skips the k8s-orphan phase rather
96+
// than acting on a truncated list.
97+
func (c *k8sNamespaceClient) ListDeployNamespaces(ctx context.Context) ([]string, error) {
98+
list, err := c.cs.CoreV1().Namespaces().List(ctx, metav1.ListOptions{})
99+
if err != nil {
100+
return nil, fmt.Errorf("k8sNamespaceClient.ListDeployNamespaces: %w", err)
101+
}
102+
var out []string
103+
for i := range list.Items {
104+
name := list.Items[i].Name
105+
if len(name) >= len(deployNamespacePrefixTDE) &&
106+
name[:len(deployNamespacePrefixTDE)] == deployNamespacePrefixTDE {
107+
out = append(out, name)
108+
}
109+
}
110+
return out, nil
111+
}
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
package jobs
2+
3+
// orphan_sweep_canceler.go — the production OrphanSubscriptionCanceler.
4+
//
5+
// The orphan-sweep reconciler's PASS 2 cancels Razorpay subscriptions that
6+
// are still live for a team that has been tombstoned. The worker already
7+
// links the Razorpay Go SDK (billing_reconciler.go uses it for the
8+
// subscription fetcher), so we add a thin Cancel wrapper here rather than
9+
// routing a cancel call through the api over HTTP — fewer hops, and the
10+
// reconciler is the system of record for "this subscription must die".
11+
//
12+
// IDEMPOTENCY
13+
//
14+
// Razorpay's cancel endpoint returns an error for a subscription that is
15+
// already in a terminal state (cancelled / completed / expired). The
16+
// reconciler must treat "already cancelled" as success or it would loop on
17+
// the same orphan forever. razorpayOrphanCanceler.CancelSubscription
18+
// inspects the error text and returns nil for the terminal-state cases —
19+
// the same already-gone-is-success contract DeleteNamespace uses.
20+
21+
import (
22+
"context"
23+
"fmt"
24+
"os"
25+
"strings"
26+
27+
razorpay "github.com/razorpay/razorpay-go"
28+
)
29+
30+
// razorpaySubCancelClient is the subset of razorpay.Client the canceler
31+
// needs. Narrow interface so tests inject a fake without an httptest
32+
// server. Mirrors razorpaySDKClient in billing_reconciler.go.
33+
type razorpaySubCancelClient interface {
34+
CancelSubscription(subID string, data map[string]interface{}, extraHeaders map[string]string) (map[string]interface{}, error)
35+
}
36+
37+
// razorpaySubCancelAdapter wraps razorpay.Client to satisfy the interface.
38+
type razorpaySubCancelAdapter struct{ c *razorpay.Client }
39+
40+
func (a *razorpaySubCancelAdapter) CancelSubscription(subID string, data map[string]interface{}, extraHeaders map[string]string) (map[string]interface{}, error) {
41+
return a.c.Subscription.Cancel(subID, data, extraHeaders)
42+
}
43+
44+
// razorpayOrphanCanceler implements OrphanSubscriptionCanceler.
45+
type razorpayOrphanCanceler struct {
46+
client razorpaySubCancelClient
47+
}
48+
49+
// NewRazorpayOrphanCanceler constructs the canceler from RAZORPAY_KEY_ID /
50+
// RAZORPAY_KEY_SECRET. Returns (nil, nil) when Razorpay is unconfigured so
51+
// the caller can pass nil into the reconciler — PASS 2 is then skipped with
52+
// a WARN. Same unconfigured contract as NewRazorpaySubFetcher.
53+
func NewRazorpayOrphanCanceler() (OrphanSubscriptionCanceler, error) {
54+
keyID := os.Getenv("RAZORPAY_KEY_ID")
55+
keySecret := os.Getenv("RAZORPAY_KEY_SECRET")
56+
if keyID == "" || keySecret == "" {
57+
return nil, nil // unconfigured — reconciler skips PASS 2
58+
}
59+
c := razorpay.NewClient(keyID, keySecret)
60+
return &razorpayOrphanCanceler{client: &razorpaySubCancelAdapter{c: c}}, nil
61+
}
62+
63+
// newRazorpayOrphanCancelerFromClient builds a canceler from a pre-built
64+
// client. Used by tests to inject a fake.
65+
func newRazorpayOrphanCancelerFromClient(c razorpaySubCancelClient) *razorpayOrphanCanceler {
66+
return &razorpayOrphanCanceler{client: c}
67+
}
68+
69+
// CancelSubscription issues an immediate (cancel_at_cycle_end=0) cancel.
70+
//
71+
// A subscription already in a terminal state is treated as success — the
72+
// reconciler's goal is "this subscription is not charging the card", and an
73+
// already-cancelled subscription satisfies that. Without this, the orphan
74+
// would be re-detected and re-attempted on every sweep forever.
75+
func (rc *razorpayOrphanCanceler) CancelSubscription(_ context.Context, subscriptionID string) error {
76+
if strings.TrimSpace(subscriptionID) == "" {
77+
// Nothing to cancel — vacuously satisfied.
78+
return nil
79+
}
80+
_, err := rc.client.CancelSubscription(subscriptionID,
81+
map[string]interface{}{"cancel_at_cycle_end": 0}, nil)
82+
if err == nil {
83+
return nil
84+
}
85+
if isRazorpayTerminalCancelError(err) {
86+
// Already cancelled / completed / expired — the money is already
87+
// stopped, which is exactly the post-condition we want.
88+
return nil
89+
}
90+
return fmt.Errorf("razorpayOrphanCanceler.CancelSubscription %q: %w", subscriptionID, err)
91+
}
92+
93+
// isRazorpayTerminalCancelError reports whether a Razorpay cancel error
94+
// means the subscription is already in a non-charging terminal state.
95+
// Razorpay returns a 400 with a message like "subscription is not in a
96+
// valid state to perform this operation" / "...already been cancelled".
97+
// We match on the substrings rather than parse the JSON body because the
98+
// SDK surfaces the error as an opaque error value.
99+
func isRazorpayTerminalCancelError(err error) bool {
100+
msg := strings.ToLower(err.Error())
101+
for _, frag := range []string{
102+
"already been cancelled",
103+
"already cancelled",
104+
"not in a valid state",
105+
"cannot be cancelled",
106+
"completed",
107+
"expired",
108+
} {
109+
if strings.Contains(msg, frag) {
110+
return true
111+
}
112+
}
113+
return false
114+
}

0 commit comments

Comments
 (0)