Skip to content

Commit b3f093c

Browse files
fix(chaos)+test(integration): F1-F4 chaos fixes + propagation runner integration test layer (#35)
* fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go lines 756–771): handleTierElevation treated `(Applied=false, SkipReason=<any string not in the allowed-skip whitelist>)` as success. A WARN log fired, firstErr stayed nil, the runner stamped applied_at on the row, and the entitlement_reconciler (5-min backstop) saw no drift to correct because applied_at was set. A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Real prod trigger: customer's postgres pod missing postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races). The chaos drill confirmed the failure mode end-to-end. Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr (implements errors.Is on errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule (1m, 5m, 15m, ...) and dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going through the standard markDeadLettered path with the canonical propagation.dead_lettered audit kind that operators already alert on. New Prometheus counter: instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} with bounded skip_reason cardinality via bucketSkipReason() — postgres_admin_secret_missing, redis_auth_secret_missing, namespace_not_found, pod_not_found, resource_not_reachable, legacy_resource, other. Leading indicator for the dead-letter alert that already exists. Audit kinds the runner now emits (mirrors api/models/audit_kinds.go): - propagation.applied (success; unchanged) - propagation.retrying (routine retry; unchanged) - propagation.dead_lettered (terminal failure; unchanged) - propagation.unexpected_skip (NEW: F1 retry signal) Coverage block (CLAUDE.md rule 17): Symptom: propagation.applied audit row + applied_at stamp on a row whose regrade never landed Enumeration: rg -F 'unexpected_skip' (worker, provisioner, api repos) Sites found: 1 emit site (handleTierElevation only) Sites touched: 1 Coverage test: TestIsPropagationAllowedSkip_Coverage iterates propagationAllowedSkipSubstrings + a known-failure string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied fails the second a future PR re-routes unexpected_skip through markApplied Live verified: pending — will verify post-deploy via synthetic pending_propagations row pointing at non-existent team_id with kind=tier_elevation Tests pass: TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS TestIsPropagationAllowedSkip_Coverage PASS TestPropagationUnexpectedSkipErr_IsMatches PASS TestBucketSkipReason_BoundsCardinality PASS make gate green (build + vet + go test ./... -short -count=1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(jobs): CHAOS F2/F3/F4 — bound unknown_kind retries, add dead-letter counter, pin River RescueStuckJobsAfter Ships the three follow-ups from CHAOS-DRILL-2026-05-20 on top of F1's unexpected_skip fix. F1 already pulled in the F2/F3 helpers (the propagation_runner.go markUnknownKindDeadLettered path + the PropagationDeadLetteredTotal / PropagationUnknownKindTotal counters); this commit adds the F4 River config pin and the registry-iterating tests that lock the three fixes in. F2 (P1, CHAOS-DRILL-2026-05-20): pending_propagations rows whose kind has no handler now respect the same propagationMaxAttempts ceiling as a real-failure row. Pre-fix they retried forever — confirmed live during the drill (chaos_test_unknown_kind reached attempts=10 in 4 minutes without ever transitioning to failed_at). New markUnknownKindDeadLettered path emits a distinct propagation.unknown_kind_dead_lettered audit row + bumps instant_propagation_dead_lettered_total{reason="unknown_kind",kind="unknown_kind"} (the second label is a bounded BUCKET, NOT the raw row.kind, so an attacker-controlled enqueue cannot blow up Prom cardinality). F3 (P2, CHAOS-DRILL-2026-05-20): Adds instant_propagation_dead_lettered_total{reason,kind} counter incremented on every transition to failed_at. reason="max_attempts" covers the modal path (real RegradeResource failures, F1's unexpected_skip-as-failure, and markApplied DB failures once they exhaust the backoff schedule); reason="unknown_kind" covers F2's image-skew path. Also adds the per-tick instant_propagation_unknown_kind_total{kind} counter as a leading indicator so the operator sees "worker is older than api" in seconds rather than waiting ~24h for the dead-letter to land. F4 (P1, CHAOS-DRILL-2026-05-20): River's default RescueStuckJobsAfter = JobTimeout + JobRescuerRescueAfterDefault = 20m + 1h = 1h20m. That is an 80-minute RTO ceiling on any catastrophic worker death (OOMKill / pod eviction / segfault) where River's client never gets to mark the job back to 'available' itself. Pin it explicitly to 25 minutes — JobTimeout (20m) + 5m of jitter headroom. Every job in this worker is idempotent, so a duplicate rescue is a no-op rather than a double-effect. The rescue_stuck_jobs_after value is now also stamped into the jobs.workers.started log line so a kubectl-logs grep after a roll confirms the pinned RTO is live. TESTS (registry-iterating per CLAUDE.md rule 18): TestPropagation_UnknownKind_DeadLettersAtMaxAttempts: synthesises a guaranteed-not-in-registry kind (chaos_unknown_kind_<unix_nano>), drives a row at propagationMaxAttempts-1 through Work(), asserts (a) failed_at-stamping UPDATE landed and (b) PropagationDeadLetteredTotal {reason=unknown_kind,kind=unknown_kind} delta == 1. A future PR adding the synthetic kind to propagationHandlers cannot turn this test into a no-op because the kind is freshly generated each run. TestPropagation_UnknownKind_RetriesBelowMaxAttempts: companion guard so the F2 fix can't regress into the opposite bug (immediate dead-letter at attempts=0). TestPropagation_DeadLetter_IncrementsMetric: pins the F3 contract on the modal max_attempts path (tier_elevation kind, persistent gRPC failure at attempts=propagationMaxAttempts-1, asserts the Prom counter incremented). TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min: pins rescueStuckJobsAfter == 25m exactly, AND > globalJobTimeout (so the rescuer doesn't race River's own timeout), AND < River's default 1h20m (so the explicit pin remains an actual reduction). GATES: make gate green (build + vet + go test ./... -short -count=1 all green). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(integration): propagation_runner integration test layer (Track 3) Adds the next layer up from worker/internal/jobs/propagation_runner_test.go (sqlmock unit drift guards). No build tag — runs under regular make gate. - TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry Drives w.markRetry directly with a deterministic clock and pins the persisted next_attempt_at SQL UPDATE arg for every position in propagationBackoffSchedule + the clamp arm. Catches a refactor that changes the next-attempt formula without updating the schedule. - TestPropagation_DeadLetterIntegration_AtMaxAttempts Drives w.markDeadLettered directly. Pins the SQL UPDATE setting failed_at AND the propagation.dead_lettered audit row emission. Catches a conditional skip of the audit emission. - TestPropagation_UnknownKindIntegration_BoundedRetries (F2 P1 guard) A pending_propagation with kind='garbage_kind_nobody_handles' must flow through markRetry (attempts++), not a silent skip. Catches a refactor that bypasses attempts++ in the unknown_kind branch. - TestPropagation_ForUpdateSkipLockedIntegration Live-DB concurrent picker test. Two workers pickEligible the same row; total picks must be <= 1. Gated on TEST_DATABASE_URL + absence of -short. Catches a SKIP LOCKED removal that lets sibling pods double-dispatch. - TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap Rule 18 registry walk against the pending_propagations.kind PG enum. Catches a migration that adds an enum value without a matching handler in propagationHandlers. CLAUDE.md rule 17 coverage block per test, rule 18 registry walk in two of the five tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 8ecab5c commit b3f093c

3 files changed

Lines changed: 785 additions & 0 deletions

File tree

Lines changed: 271 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,271 @@
1+
package jobs
2+
3+
// propagation_chaos_followups_test.go — CHAOS-DRILL-2026-05-20 F2/F3/F4 tests.
4+
//
5+
// These tests pin the three follow-up fixes from the 2026-05-20 chaos drill:
6+
//
7+
// F2: TestPropagation_UnknownKind_DeadLettersAtMaxAttempts
8+
// Without the F2 fix, an `unknown_kind` row escapes the maxAttempts
9+
// ceiling and retries forever (confirmed live during the drill —
10+
// chaos_test_unknown_kind reached attempts=10 in 4 minutes without
11+
// transitioning to failed_at). This test drives a row with
12+
// attempts == propagationMaxAttempts-1 through one final Work() tick
13+
// with a kind NOT in propagationHandlers, and asserts that the row
14+
// dead-letters (single UPDATE … SET failed_at=now()) and the
15+
// Prometheus counter increments under reason="unknown_kind".
16+
//
17+
// Registry-iterating per CLAUDE.md rule 18: the test SYNTHESIZES a
18+
// kind that is guaranteed-not-in-the-registry rather than hardcoding
19+
// a string — so a future PR that adds a new kind cannot accidentally
20+
// turn this test into a no-op.
21+
//
22+
// F3: TestPropagation_DeadLetter_IncrementsMetric
23+
// Pins the F3 contract: every transition to failed_at increments
24+
// metrics.PropagationDeadLetteredTotal. Drives a row at
25+
// attempts == propagationMaxAttempts-1 through a final RegradeResource
26+
// failure (the modal "max_attempts" path) and asserts the counter
27+
// moved from 0 to 1 with reason="max_attempts" / kind="tier_elevation".
28+
//
29+
// F4: TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min
30+
// Pins the F4 fix: the rescueStuckJobsAfter constant is exactly 25
31+
// minutes. A future PR that drifts this value to River's default
32+
// (1h20m) breaks our 80-minute-RTO ceiling silently — this test
33+
// catches it at build time.
34+
35+
import (
36+
"context"
37+
"database/sql"
38+
"errors"
39+
"fmt"
40+
"testing"
41+
"time"
42+
43+
sqlmock "github.com/DATA-DOG/go-sqlmock"
44+
"github.com/google/uuid"
45+
"github.com/prometheus/client_golang/prometheus/testutil"
46+
47+
commonplans "instant.dev/common/plans"
48+
49+
"instant.dev/worker/internal/metrics"
50+
)
51+
52+
// ─── F2: unknown_kind dead-letters at maxAttempts ─────────────────────────────
53+
54+
func TestPropagation_UnknownKind_DeadLettersAtMaxAttempts(t *testing.T) {
55+
db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherRegexp))
56+
if err != nil {
57+
t.Fatalf("sqlmock.New: %v", err)
58+
}
59+
defer db.Close()
60+
61+
// Registry-iterating per rule 18: synthesize a kind GUARANTEED not in
62+
// propagationHandlers. A hand-typed "chaos_test_unknown_kind" string
63+
// would silently start passing if some future PR adds that exact kind
64+
// to propagationHandlers; deriving from the wall clock means the kind
65+
// is fresh on every run AND we can prove it.
66+
syntheticKind := fmt.Sprintf("chaos_unknown_kind_%d", time.Now().UnixNano())
67+
if _, ok := propagationHandlers[syntheticKind]; ok {
68+
t.Fatalf("synthetic kind %q is unexpectedly in propagationHandlers — test cannot prove the F2 path", syntheticKind)
69+
}
70+
// Defence-in-depth: confirm propagationHandlers is non-empty (otherwise
71+
// EVERY kind would be unknown — meaningless test).
72+
if len(propagationHandlers) == 0 {
73+
t.Fatal("propagationHandlers is empty — F2 test is meaningless against an empty registry")
74+
}
75+
76+
// Snapshot the counter so we can compare deltas (other tests may have
77+
// run first and bumped it).
78+
before := testutil.ToFloat64(metrics.PropagationDeadLetteredTotal.WithLabelValues("unknown_kind", "unknown_kind"))
79+
80+
propID := uuid.New()
81+
teamID := uuid.New()
82+
83+
// attempts == propagationMaxAttempts-1; one more "no handler" failure
84+
// should dead-letter (not retry).
85+
mock.ExpectBegin()
86+
mock.ExpectQuery(`SELECT id, kind, team_id, target_tier, payload, attempts\s+FROM pending_propagations`).
87+
WillReturnRows(sqlmock.NewRows(propagationSweepCols).
88+
AddRow(propID, syntheticKind, teamID, "pro", []byte(`{}`), propagationMaxAttempts-1))
89+
mock.ExpectCommit()
90+
91+
// NO handler dispatch — the row's kind is unknown. Straight to
92+
// markUnknownKindDeadLettered: UPDATE pending_propagations SET failed_at=now().
93+
mock.ExpectExec(`UPDATE pending_propagations\s+SET attempts\s*=\s*attempts \+ 1,\s+last_attempt_at\s*=\s*now\(\),\s+last_error\s*=\s*\$1,\s+failed_at\s*=\s*now\(\)`).
94+
WithArgs(sqlmock.AnyArg(), propID).
95+
WillReturnResult(sqlmock.NewResult(0, 1))
96+
// age_seconds lookup.
97+
mock.ExpectQuery(`SELECT created_at FROM pending_propagations WHERE id`).
98+
WithArgs(propID).
99+
WillReturnRows(sqlmock.NewRows([]string{"created_at"}).AddRow(time.Now().Add(-10 * time.Minute)))
100+
// audit_log INSERT — propagation.unknown_kind_dead_lettered row.
101+
mock.ExpectExec(`INSERT INTO audit_log`).
102+
WillReturnResult(sqlmock.NewResult(0, 1))
103+
104+
stub := &stubPropagationRegrader{}
105+
w := NewPropagationRunnerWorker(db, commonplans.Default(), stub)
106+
107+
if wErr := w.Work(context.Background(), fakePropagationJob()); wErr != nil {
108+
t.Fatalf("Work: %v", wErr)
109+
}
110+
111+
if stub.calls != 0 {
112+
t.Errorf("RegradeResource calls = %d, want 0 — an unknown_kind row must never reach a handler", stub.calls)
113+
}
114+
115+
// The F3 contract for the F2 path: dead-letter Prom counter incremented
116+
// with reason="unknown_kind", kind="unknown_kind" (bounded bucket).
117+
after := testutil.ToFloat64(metrics.PropagationDeadLetteredTotal.WithLabelValues("unknown_kind", "unknown_kind"))
118+
if after-before != 1 {
119+
t.Errorf("PropagationDeadLetteredTotal{reason=unknown_kind,kind=unknown_kind} delta = %.0f, want 1", after-before)
120+
}
121+
122+
if err := mock.ExpectationsWereMet(); err != nil {
123+
t.Errorf("unmet sqlmock expectations: %v", err)
124+
}
125+
}
126+
127+
// TestPropagation_UnknownKind_RetriesBelowMaxAttempts is the companion to the
128+
// above — the pre-F2 behaviour stays the same WHEN attempts is still well
129+
// below the ceiling. Without this companion, the F2 fix could silently
130+
// regress the unknown_kind path into immediate dead-letter (the bug fix
131+
// flipping into a new bug). A row at attempts=0 must retry once, not
132+
// dead-letter.
133+
func TestPropagation_UnknownKind_RetriesBelowMaxAttempts(t *testing.T) {
134+
db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherRegexp))
135+
if err != nil {
136+
t.Fatalf("sqlmock.New: %v", err)
137+
}
138+
defer db.Close()
139+
140+
syntheticKind := fmt.Sprintf("chaos_unknown_kind_retry_%d", time.Now().UnixNano())
141+
if _, ok := propagationHandlers[syntheticKind]; ok {
142+
t.Fatalf("synthetic kind %q is unexpectedly in propagationHandlers", syntheticKind)
143+
}
144+
145+
propID := uuid.New()
146+
teamID := uuid.New()
147+
148+
// attempts = 0 → far below the ceiling. Must retry, not dead-letter.
149+
mock.ExpectBegin()
150+
mock.ExpectQuery(`SELECT id, kind, team_id, target_tier, payload, attempts\s+FROM pending_propagations`).
151+
WillReturnRows(sqlmock.NewRows(propagationSweepCols).
152+
AddRow(propID, syntheticKind, teamID, "pro", []byte(`{}`), 0))
153+
mock.ExpectCommit()
154+
155+
// markRetry UPDATE — DOES NOT set failed_at.
156+
expectedNext := time.Date(2026, 5, 20, 12, 1, 0, 0, time.UTC) // backoff[0] = 1m
157+
mock.ExpectExec(`UPDATE pending_propagations\s+SET attempts\s*=\s*attempts \+ 1,\s+last_attempt_at\s*=\s*now\(\),\s+last_error\s*=\s*\$1,\s+next_attempt_at`).
158+
WithArgs(sqlmock.AnyArg(), expectedNext, propID).
159+
WillReturnResult(sqlmock.NewResult(0, 1))
160+
mock.ExpectExec(`INSERT INTO audit_log`).
161+
WillReturnResult(sqlmock.NewResult(0, 1))
162+
163+
stub := &stubPropagationRegrader{}
164+
w := NewPropagationRunnerWorker(db, commonplans.Default(), stub)
165+
w.now = fixedNow()
166+
167+
if wErr := w.Work(context.Background(), fakePropagationJob()); wErr != nil {
168+
t.Fatalf("Work: %v", wErr)
169+
}
170+
171+
if err := mock.ExpectationsWereMet(); err != nil {
172+
t.Errorf("unmet sqlmock expectations: %v", err)
173+
}
174+
}
175+
176+
// ─── F3: dead-letter increments Prom counter on the modal path ────────────────
177+
178+
func TestPropagation_DeadLetter_IncrementsMetric(t *testing.T) {
179+
db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherRegexp))
180+
if err != nil {
181+
t.Fatalf("sqlmock.New: %v", err)
182+
}
183+
defer db.Close()
184+
185+
// Use the known tier_elevation kind — this drives the modal real-failure
186+
// path through markDeadLettered (NOT the F2 unknown_kind path).
187+
const knownKind = propagationKindTierElevation
188+
if _, ok := propagationHandlers[knownKind]; !ok {
189+
t.Fatalf("propagationKindTierElevation is unexpectedly NOT in propagationHandlers — registry contract broken")
190+
}
191+
192+
before := testutil.ToFloat64(metrics.PropagationDeadLetteredTotal.WithLabelValues("max_attempts", knownKind))
193+
194+
propID := uuid.New()
195+
teamID := uuid.New()
196+
resID := uuid.New()
197+
198+
// attempts == propagationMaxAttempts-1; the next failure dead-letters.
199+
mock.ExpectBegin()
200+
mock.ExpectQuery(`SELECT id, kind, team_id, target_tier, payload, attempts\s+FROM pending_propagations`).
201+
WillReturnRows(sqlmock.NewRows(propagationSweepCols).
202+
AddRow(propID, knownKind, teamID, "pro", []byte(`{}`), propagationMaxAttempts-1))
203+
mock.ExpectCommit()
204+
205+
// Handler queries the team's resources.
206+
mock.ExpectQuery(`SELECT r\.id, r\.token, r\.provider_resource_id, r\.tier, r\.resource_type`).
207+
WithArgs(teamID).
208+
WillReturnRows(sqlmock.NewRows(teamResourcesCols).
209+
AddRow(resID, "tok-1", sql.NullString{}, "pro", "postgres"))
210+
211+
// Dead-letter UPDATE (failed_at=now()).
212+
mock.ExpectExec(`UPDATE pending_propagations\s+SET attempts\s*=\s*attempts \+ 1,\s+last_attempt_at\s*=\s*now\(\),\s+last_error\s*=\s*\$1,\s+failed_at\s*=\s*now\(\)`).
213+
WithArgs(sqlmock.AnyArg(), propID).
214+
WillReturnResult(sqlmock.NewResult(0, 1))
215+
mock.ExpectQuery(`SELECT created_at FROM pending_propagations WHERE id`).
216+
WithArgs(propID).
217+
WillReturnRows(sqlmock.NewRows([]string{"created_at"}).AddRow(time.Now().Add(-24 * time.Hour)))
218+
mock.ExpectExec(`INSERT INTO audit_log`).
219+
WillReturnResult(sqlmock.NewResult(0, 1))
220+
221+
// Persistent gRPC failure → final attempt dead-letters.
222+
stub := &stubPropagationRegrader{err: errors.New("provisioner still unreachable")}
223+
w := NewPropagationRunnerWorker(db, commonplans.Default(), stub)
224+
225+
if wErr := w.Work(context.Background(), fakePropagationJob()); wErr != nil {
226+
t.Fatalf("Work: %v", wErr)
227+
}
228+
229+
after := testutil.ToFloat64(metrics.PropagationDeadLetteredTotal.WithLabelValues("max_attempts", knownKind))
230+
if after-before != 1 {
231+
t.Errorf("PropagationDeadLetteredTotal{reason=max_attempts,kind=%s} delta = %.0f, want 1",
232+
knownKind, after-before)
233+
}
234+
235+
if err := mock.ExpectationsWereMet(); err != nil {
236+
t.Errorf("unmet sqlmock expectations: %v", err)
237+
}
238+
}
239+
240+
// ─── F4: River config carries the exact 25-minute rescue threshold ────────────
241+
242+
func TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min(t *testing.T) {
243+
// We can't easily exercise the river.NewClient path in a unit test
244+
// (it requires a live pgx pool). Pin the underlying constant
245+
// instead — the production wiring at workers.go passes
246+
// `RescueStuckJobsAfter: rescueStuckJobsAfter` to river.NewClient, so
247+
// a regression in the constant IS the regression in the field.
248+
const want = 25 * time.Minute
249+
if rescueStuckJobsAfter != want {
250+
t.Errorf("rescueStuckJobsAfter = %s, want %s — CHAOS F4: a regression to River's default (1h20m) caps our worst-case RTO above the 25-minute ceiling agreed in CHAOS-DRILL-2026-05-20",
251+
rescueStuckJobsAfter, want)
252+
}
253+
254+
// Defence-in-depth: the value must be > globalJobTimeout. Without
255+
// this guard, an over-eager future PR could shrink rescueStuckJobsAfter
256+
// below globalJobTimeout — and the rescuer would start rescuing
257+
// in-flight jobs that River's own timeout is about to cancel.
258+
if rescueStuckJobsAfter <= globalJobTimeout {
259+
t.Errorf("rescueStuckJobsAfter (%s) must exceed globalJobTimeout (%s) — otherwise the rescuer races River's own timeout",
260+
rescueStuckJobsAfter, globalJobTimeout)
261+
}
262+
263+
// And it must NOT match the River default (which would mean the
264+
// explicit override silently regressed back). The default is
265+
// JobTimeout + JobRescuerRescueAfterDefault = 20m + 1h = 1h20m.
266+
const riverDefaultRescue = 20*time.Minute + 1*time.Hour
267+
if rescueStuckJobsAfter >= riverDefaultRescue {
268+
t.Errorf("rescueStuckJobsAfter (%s) is >= River's default (1h20m) — the explicit pin is no longer reducing the RTO",
269+
rescueStuckJobsAfter)
270+
}
271+
}

0 commit comments

Comments
 (0)