Skip to content

Commit a0f4014

Browse files
feat(worker): audit-only orphan-customer-DB / redis-namespace sweep (flag-gated OFF) (#102)
* feat(worker): audit-only orphan-customer-DB / redis-namespace sweep (flag-gated OFF) New River periodic job orphan_db_sweep (hourly, reconcile queue, UniqueOpts) that addresses the ~25 orphaned customer DB / redis namespace drain-backlog — in DETECTION / DRY-RUN mode only. It lists instant-customer-* namespaces, flags the ones whose token has NO non-terminal (pending/active/paused/suspended) resources row and is past the provisioning grace window, then LOGS each candidate (token masked via logsafe.Token) and emits the candidate metrics. It DROPS NOTHING in audit-only mode. truehomie-2026-06-03 safety: there is NO manual / raw DROP anywhere in this job. The destructive teardown sits behind a SECOND flag and, when (and only when) enabled, routes through the AUDITED provisioner DeprovisionResource chokepoint — the same path the TTL reaper (expire.go) uses. For this PR that path is intentionally unreachable-by-default. Two flag gates, BOTH default OFF / fail-closed: ORPHAN_DB_SWEEP_ENABLED — master flag; off → Work is a DEBUG no-op (no namespace List, no DB read, no metric). ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLED — destructive flag; meaningless unless the master is also on AND a provisioner is wired. Routes through the audited chokepoint only. Fail-safe / fail-open: a namespace-List error or a live-token DB-read error degrades to ZERO candidates (never an empty-set that a destructive caller could read as "drop everything"); a candidate whose token reappears live at the destructive re-confirm is SKIPPED; a generic (unmapped-kind) orphan is SKIPPED (no proven backing type → no guessed DROP). When in doubt, skip + log. Metrics (lazy *Vec, both labels primed in metrics_test): instant_orphan_db_sweep_candidates_total{kind} — counter, kind in {customer_namespace, redis_namespace} instant_orphan_db_sweep_candidates_current{kind} — gauge, current backlog Alert + dashboard + catalog live in the infra repo (not owned here) — see PR body for the exact metric names + suggested alerts (rule 25 follow-up). Tests: candidate detection (orphan vs live vs pending vs within-grace), kind classification, both flag gates (off → no-op), masking, fail-open paths, and that the destructive deprovisioner is NEVER called in audit-only mode. New file at 100% statement coverage. make gate green (build + vet + go test -short). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(worker): 100% patch coverage — extract typed-nil-safe deprovisioner helper diff-cover flagged the `if provClient != nil` branch in the StartWorkers wiring (integration-only, not unit-reachable). Extract the typed-nil-safe conversion into orphanDBSweepDeprovisionerFor (mirrors NewExpireAnonymousWorker's handling) and unit-test both arms directly, so the wiring call site is a single non-branching expression covered by TestStartWorkers_FullBoot and the branch logic is covered by the new unit test. New code back to 100% patch coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 4eb1207 commit a0f4014

7 files changed

Lines changed: 1350 additions & 0 deletions

File tree

internal/config/config.go

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,31 @@ type Config struct {
206206
// pathological flapping). Enabling is an operator action after a canary.
207207
DeployScaleToZeroEnabled bool // DEPLOY_SCALE_TO_ZERO_ENABLED — master flag (default false)
208208
DeployScaleToZeroIdleMinutes int // DEPLOY_SCALE_TO_ZERO_IDLE_MINUTES — idle threshold (default 30)
209+
210+
// Audit-only orphan-customer-DB / orphan-redis-namespace sweep
211+
// (orphan_db_sweep.go). TWO independent flags, both default OFF / fail-closed
212+
// (project_feature_flag_decision: every new feature ships flag-gated,
213+
// default-off, fail-closed).
214+
//
215+
// OrphanDBSweepEnabled — the MASTER flag. When false (the default) the sweep
216+
// Work method no-ops immediately: no namespace List, no DB read, no metric,
217+
// no log beyond a single DEBUG. A single env flip turns the whole detection
218+
// layer on. In audit-only mode (the destructive flag below OFF) an enabled
219+
// sweep ONLY logs masked orphan candidates + emits the candidate metrics — it
220+
// drops NOTHING.
221+
//
222+
// OrphanDBSweepDestructiveEnabled — the SECOND, destructive flag. Default
223+
// OFF. It is meaningless unless the master flag is ALSO on. When (and only
224+
// when) BOTH are true does the sweep route a confirmed orphan through the
225+
// AUDITED provisioner DeprovisionResource chokepoint (the SAME path the TTL
226+
// reaper uses) — NEVER a manual/raw DROP. This is the truehomie-2026-06-03
227+
// safety posture: an active Pro customer's DB+role were dropped by an
228+
// unaudited path, so this job must never improvise a DROP. For THIS PR the
229+
// destructive path is wired but intentionally left UNREACHABLE-BY-DEFAULT:
230+
// we ship audit-only, review the dry-run candidate list, and only then (in a
231+
// later, deliberate operator action) consider lighting the destructive flag.
232+
OrphanDBSweepEnabled bool // ORPHAN_DB_SWEEP_ENABLED — master flag (default false)
233+
OrphanDBSweepDestructiveEnabled bool // ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLED — destructive flag (default false; requires master on; routes through audited provisioner only)
209234
}
210235

211236
// ErrMissingConfig is returned when a required env var is absent.
@@ -340,6 +365,13 @@ func Load() *Config {
340365
// Scale-to-zero idle-scaler (Task #54). Default OFF; idle threshold
341366
// default 30 min (parsed below).
342367
DeployScaleToZeroEnabled: os.Getenv("DEPLOY_SCALE_TO_ZERO_ENABLED") == "true",
368+
369+
// Audit-only orphan-DB sweep. BOTH default OFF / fail-closed. The
370+
// destructive flag is inert unless the master flag is also on (the sweep
371+
// enforces that ordering at runtime). Shipping audit-only: the
372+
// destructive flag stays unset until we've reviewed the dry-run list.
373+
OrphanDBSweepEnabled: os.Getenv("ORPHAN_DB_SWEEP_ENABLED") == "true",
374+
OrphanDBSweepDestructiveEnabled: os.Getenv("ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLED") == "true",
343375
}
344376

345377
// DEPLOY_SCALE_TO_ZERO_IDLE_MINUTES: minutes of no-activity before an app is

internal/config/config_test.go

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,44 @@ func TestLoad_Defaults(t *testing.T) {
124124
if cfg.SESTemplateNames == nil || len(cfg.SESTemplateNames) != 0 {
125125
t.Errorf("SESTemplateNames = %v", cfg.SESTemplateNames)
126126
}
127+
// Audit-only orphan-DB sweep flags both default OFF / fail-closed.
128+
if cfg.OrphanDBSweepEnabled {
129+
t.Error("OrphanDBSweepEnabled should default false (fail-closed)")
130+
}
131+
if cfg.OrphanDBSweepDestructiveEnabled {
132+
t.Error("OrphanDBSweepDestructiveEnabled should default false (fail-closed)")
133+
}
134+
}
135+
136+
// TestLoad_OrphanDBSweepFlags pins the two env-driven flags of the audit-only
137+
// orphan-DB sweep: each is set ONLY by its exact env var being literally "true".
138+
func TestLoad_OrphanDBSweepFlags(t *testing.T) {
139+
t.Run("both true", func(t *testing.T) {
140+
clearEnv(t)
141+
t.Setenv("DATABASE_URL", "postgres://localhost/db")
142+
t.Setenv("ORPHAN_DB_SWEEP_ENABLED", "true")
143+
t.Setenv("ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLED", "true")
144+
cfg := Load()
145+
if !cfg.OrphanDBSweepEnabled {
146+
t.Error("OrphanDBSweepEnabled should be true when env is 'true'")
147+
}
148+
if !cfg.OrphanDBSweepDestructiveEnabled {
149+
t.Error("OrphanDBSweepDestructiveEnabled should be true when env is 'true'")
150+
}
151+
})
152+
t.Run("non-true is off", func(t *testing.T) {
153+
clearEnv(t)
154+
t.Setenv("DATABASE_URL", "postgres://localhost/db")
155+
t.Setenv("ORPHAN_DB_SWEEP_ENABLED", "1") // not exactly "true"
156+
t.Setenv("ORPHAN_DB_SWEEP_DESTRUCTIVE_ENABLED", "yes")
157+
cfg := Load()
158+
if cfg.OrphanDBSweepEnabled {
159+
t.Error("OrphanDBSweepEnabled should be false for non-'true' value")
160+
}
161+
if cfg.OrphanDBSweepDestructiveEnabled {
162+
t.Error("OrphanDBSweepDestructiveEnabled should be false for non-'true' value")
163+
}
164+
})
127165
}
128166

129167
// TestLoad_DeployScaleToZeroIdleMinutes exercises the env-parse branch for

0 commit comments

Comments
 (0)