Commit b3f093c
fix(chaos)+test(integration): F1-F4 chaos fixes + propagation runner integration test layer (#35)
* fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED
Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go
lines 756–771):
handleTierElevation treated `(Applied=false, SkipReason=<any string
not in the allowed-skip whitelist>)` as success. A WARN log fired,
firstErr stayed nil, the runner stamped applied_at on the row, and
the entitlement_reconciler (5-min backstop) saw no drift to correct
because applied_at was set. A paying customer's tier-elevation
regrade never landed — no retry, no dead-letter, no alert.
Real prod trigger: customer's postgres pod missing postgres-admin
Secret (legacy free-tier pods, mid-deprovisioning races). The chaos
drill confirmed the failure mode end-to-end.
Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr
(implements errors.Is on errPropagationUnexpectedSkipSentinel). The
runner's markRetry path detects the sentinel and emits a distinct
propagation.unexpected_skip audit row (NOT propagation.applied). The
row retries per the existing backoff schedule (1m, 5m, 15m, ...) and
dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going
through the standard markDeadLettered path with the canonical
propagation.dead_lettered audit kind that operators already alert on.
New Prometheus counter:
instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason}
with bounded skip_reason cardinality via bucketSkipReason() —
postgres_admin_secret_missing, redis_auth_secret_missing,
namespace_not_found, pod_not_found, resource_not_reachable,
legacy_resource, other. Leading indicator for the dead-letter alert
that already exists.
Audit kinds the runner now emits (mirrors api/models/audit_kinds.go):
- propagation.applied (success; unchanged)
- propagation.retrying (routine retry; unchanged)
- propagation.dead_lettered (terminal failure; unchanged)
- propagation.unexpected_skip (NEW: F1 retry signal)
Coverage block (CLAUDE.md rule 17):
Symptom: propagation.applied audit row + applied_at stamp on a
row whose regrade never landed
Enumeration: rg -F 'unexpected_skip'
(worker, provisioner, api repos)
Sites found: 1 emit site (handleTierElevation only)
Sites touched: 1
Coverage test: TestIsPropagationAllowedSkip_Coverage iterates
propagationAllowedSkipSubstrings + a known-failure
string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied
fails the second a future PR re-routes unexpected_skip
through markApplied
Live verified: pending — will verify post-deploy via synthetic
pending_propagations row pointing at non-existent
team_id with kind=tier_elevation
Tests pass:
TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS
TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS
TestIsPropagationAllowedSkip_Coverage PASS
TestPropagationUnexpectedSkipErr_IsMatches PASS
TestBucketSkipReason_BoundsCardinality PASS
make gate green (build + vet + go test ./... -short -count=1).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(jobs): CHAOS F2/F3/F4 — bound unknown_kind retries, add dead-letter counter, pin River RescueStuckJobsAfter
Ships the three follow-ups from CHAOS-DRILL-2026-05-20 on top of F1's
unexpected_skip fix. F1 already pulled in the F2/F3 helpers (the
propagation_runner.go markUnknownKindDeadLettered path + the
PropagationDeadLetteredTotal / PropagationUnknownKindTotal counters);
this commit adds the F4 River config pin and the registry-iterating tests
that lock the three fixes in.
F2 (P1, CHAOS-DRILL-2026-05-20):
pending_propagations rows whose kind has no handler now respect the same
propagationMaxAttempts ceiling as a real-failure row. Pre-fix they retried
forever — confirmed live during the drill (chaos_test_unknown_kind reached
attempts=10 in 4 minutes without ever transitioning to failed_at). New
markUnknownKindDeadLettered path emits a distinct
propagation.unknown_kind_dead_lettered audit row + bumps
instant_propagation_dead_lettered_total{reason="unknown_kind",kind="unknown_kind"}
(the second label is a bounded BUCKET, NOT the raw row.kind, so an
attacker-controlled enqueue cannot blow up Prom cardinality).
F3 (P2, CHAOS-DRILL-2026-05-20):
Adds instant_propagation_dead_lettered_total{reason,kind} counter incremented
on every transition to failed_at. reason="max_attempts" covers the modal
path (real RegradeResource failures, F1's unexpected_skip-as-failure, and
markApplied DB failures once they exhaust the backoff schedule);
reason="unknown_kind" covers F2's image-skew path. Also adds the per-tick
instant_propagation_unknown_kind_total{kind} counter as a leading indicator
so the operator sees "worker is older than api" in seconds rather than
waiting ~24h for the dead-letter to land.
F4 (P1, CHAOS-DRILL-2026-05-20):
River's default RescueStuckJobsAfter = JobTimeout + JobRescuerRescueAfterDefault
= 20m + 1h = 1h20m. That is an 80-minute RTO ceiling on any catastrophic
worker death (OOMKill / pod eviction / segfault) where River's client never
gets to mark the job back to 'available' itself. Pin it explicitly to
25 minutes — JobTimeout (20m) + 5m of jitter headroom. Every job in this
worker is idempotent, so a duplicate rescue is a no-op rather than a
double-effect. The rescue_stuck_jobs_after value is now also stamped into
the jobs.workers.started log line so a kubectl-logs grep after a roll
confirms the pinned RTO is live.
TESTS (registry-iterating per CLAUDE.md rule 18):
TestPropagation_UnknownKind_DeadLettersAtMaxAttempts:
synthesises a guaranteed-not-in-registry kind (chaos_unknown_kind_<unix_nano>),
drives a row at propagationMaxAttempts-1 through Work(), asserts (a)
failed_at-stamping UPDATE landed and (b) PropagationDeadLetteredTotal
{reason=unknown_kind,kind=unknown_kind} delta == 1. A future PR adding the
synthetic kind to propagationHandlers cannot turn this test into a no-op
because the kind is freshly generated each run.
TestPropagation_UnknownKind_RetriesBelowMaxAttempts:
companion guard so the F2 fix can't regress into the opposite bug
(immediate dead-letter at attempts=0).
TestPropagation_DeadLetter_IncrementsMetric:
pins the F3 contract on the modal max_attempts path (tier_elevation kind,
persistent gRPC failure at attempts=propagationMaxAttempts-1, asserts the
Prom counter incremented).
TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min:
pins rescueStuckJobsAfter == 25m exactly, AND > globalJobTimeout (so the
rescuer doesn't race River's own timeout), AND < River's default 1h20m
(so the explicit pin remains an actual reduction).
GATES: make gate green (build + vet + go test ./... -short -count=1 all green).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(integration): propagation_runner integration test layer (Track 3)
Adds the next layer up from worker/internal/jobs/propagation_runner_test.go
(sqlmock unit drift guards). No build tag — runs under regular make gate.
- TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry
Drives w.markRetry directly with a deterministic clock and pins
the persisted next_attempt_at SQL UPDATE arg for every position
in propagationBackoffSchedule + the clamp arm. Catches a refactor
that changes the next-attempt formula without updating the
schedule.
- TestPropagation_DeadLetterIntegration_AtMaxAttempts
Drives w.markDeadLettered directly. Pins the SQL UPDATE setting
failed_at AND the propagation.dead_lettered audit row emission.
Catches a conditional skip of the audit emission.
- TestPropagation_UnknownKindIntegration_BoundedRetries (F2 P1 guard)
A pending_propagation with kind='garbage_kind_nobody_handles'
must flow through markRetry (attempts++), not a silent skip.
Catches a refactor that bypasses attempts++ in the unknown_kind
branch.
- TestPropagation_ForUpdateSkipLockedIntegration
Live-DB concurrent picker test. Two workers pickEligible the
same row; total picks must be <= 1. Gated on TEST_DATABASE_URL
+ absence of -short. Catches a SKIP LOCKED removal that lets
sibling pods double-dispatch.
- TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap
Rule 18 registry walk against the pending_propagations.kind PG
enum. Catches a migration that adds an enum value without a
matching handler in propagationHandlers.
CLAUDE.md rule 17 coverage block per test, rule 18 registry walk
in two of the five tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 8ecab5c commit b3f093c
3 files changed
Lines changed: 785 additions & 0 deletions
File tree
- internal/jobs
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
0 commit comments