Commit 7d2ff0d
feat(jobs): orphan_sweep PASS 3 enhanced reasons + PASS 6 stuck-build
PROBLEM. The prod orphan instant-deploy-04dc0b31 (2026-05-14) sat in
ImagePullBackOff for 9h+. PASS 3 left it alone because the deployments
row was status='deploying', not 'deleted', and its team was active.
deploy_status_reconcile flips a row to 'failed' only when k8s reports
DeploymentReplicaFailure=True — a stuck pod never trips that.
PASS 3 ENHANCEMENTS
- Per-namespace reason labels (team_tombstoned, no_db_row,
failed_old_deployment) drive new Prometheus metric
instant_orphan_sweep_reaped_total{reason}.
- no_db_row now applies a 1h grace via GetNamespaceAge to avoid racing
with in-flight provisions.
- failed_old_deployment reaps instant-deploy-* whose row is status='failed'
AND created_at > 6h ago (autopsy stays in deployment_events).
- Proposed-reap structured log lands BEFORE every delete with full
evidence (constraint #3: operator must see what is about to happen).
PASS 6 (NEW) — STUCK-BUILD DETECTION
Catches deployments stuck in 'building'/'deploying' for >30min whose
only pod is in ImagePullBackOff/ErrImagePull/CrashLoopBackOff.
Flips the row to 'failed' + sets error_message. The autopsy is captured
by the next deploy_status_reconcile tick (one source of truth).
SAFETY
- Whitelist on the three prefixes (instant-deploy-*, instant-customer-*,
instant-stack-*) enforced by the ListNamespacesWithPrefix seam.
- All grace thresholds (1h/6h/30min) conservative on purpose.
- Per-namespace fail-open posture matches the existing passes.
- Pure classifier extracted (classifyDeployOrphan) for direct table-driven testing.
TESTS
- TestOrphanSweep_NamespaceWithoutDBRow_ReapsAfterGrace
- TestOrphanSweep_FailedDeployment_ReapedAfter6h
- TestOrphanSweep_Pass6_StuckBuild_FlipsToFailed
- TestOrphanSweep_Pass6_RunningPod_DoesNotFlip
- TestOrphanSweep_PrefixWhitelist_RefusesUnknownNamespace
- TestOrphanSweep_ClassifyDeployOrphan_TableDriven (10 row shapes)
- TestOrphanSweep_StuckBuildWaitingReasons_Registry (registry-iterating
per CLAUDE.md rule 18)
METRICS
- instant_orphan_sweep_reaped_total{reason} — PASS 3/4/5/6 reaps
- instant_orphan_sweep_reap_failed_total{reason} — k8s/DB failures
Companion infra PR adds the Prom alerts (no_db_row > 0 over 1h → P0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent b3f093c commit 7d2ff0d
5 files changed
Lines changed: 1052 additions & 53 deletions
File tree
- internal
- jobs
- metrics
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| 35 | + | |
35 | 36 | | |
| 37 | + | |
36 | 38 | | |
37 | 39 | | |
38 | 40 | | |
| |||
138 | 140 | | |
139 | 141 | | |
140 | 142 | | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
0 commit comments