Skip to content

Commit 8ecab5c

Browse files
fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED (#34)
Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go lines 756–771): handleTierElevation treated `(Applied=false, SkipReason=<any string not in the allowed-skip whitelist>)` as success. A WARN log fired, firstErr stayed nil, the runner stamped applied_at on the row, and the entitlement_reconciler (5-min backstop) saw no drift to correct because applied_at was set. A paying customer's tier-elevation regrade never landed — no retry, no dead-letter, no alert. Real prod trigger: customer's postgres pod missing postgres-admin Secret (legacy free-tier pods, mid-deprovisioning races). The chaos drill confirmed the failure mode end-to-end. Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr (implements errors.Is on errPropagationUnexpectedSkipSentinel). The runner's markRetry path detects the sentinel and emits a distinct propagation.unexpected_skip audit row (NOT propagation.applied). The row retries per the existing backoff schedule (1m, 5m, 15m, ...) and dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going through the standard markDeadLettered path with the canonical propagation.dead_lettered audit kind that operators already alert on. New Prometheus counter: instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason} with bounded skip_reason cardinality via bucketSkipReason() — postgres_admin_secret_missing, redis_auth_secret_missing, namespace_not_found, pod_not_found, resource_not_reachable, legacy_resource, other. Leading indicator for the dead-letter alert that already exists. Audit kinds the runner now emits (mirrors api/models/audit_kinds.go): - propagation.applied (success; unchanged) - propagation.retrying (routine retry; unchanged) - propagation.dead_lettered (terminal failure; unchanged) - propagation.unexpected_skip (NEW: F1 retry signal) Coverage block (CLAUDE.md rule 17): Symptom: propagation.applied audit row + applied_at stamp on a row whose regrade never landed Enumeration: rg -F 'unexpected_skip' (worker, provisioner, api repos) Sites found: 1 emit site (handleTierElevation only) Sites touched: 1 Coverage test: TestIsPropagationAllowedSkip_Coverage iterates propagationAllowedSkipSubstrings + a known-failure string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied fails the second a future PR re-routes unexpected_skip through markApplied Live verified: pending — will verify post-deploy via synthetic pending_propagations row pointing at non-existent team_id with kind=tier_elevation Tests pass: TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS TestIsPropagationAllowedSkip_Coverage PASS TestPropagationUnexpectedSkipErr_IsMatches PASS TestBucketSkipReason_BoundsCardinality PASS make gate green (build + vet + go test ./... -short -count=1). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 7871685 commit 8ecab5c

3 files changed

Lines changed: 797 additions & 35 deletions

File tree

0 commit comments

Comments
 (0)