Commit 8ecab5c
fix(jobs): CHAOS F1 — unexpected_skip no longer silently marks propagation APPLIED (#34)
Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go
lines 756–771):
handleTierElevation treated `(Applied=false, SkipReason=<any string
not in the allowed-skip whitelist>)` as success. A WARN log fired,
firstErr stayed nil, the runner stamped applied_at on the row, and
the entitlement_reconciler (5-min backstop) saw no drift to correct
because applied_at was set. A paying customer's tier-elevation
regrade never landed — no retry, no dead-letter, no alert.
Real prod trigger: customer's postgres pod missing postgres-admin
Secret (legacy free-tier pods, mid-deprovisioning races). The chaos
drill confirmed the failure mode end-to-end.
Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr
(implements errors.Is on errPropagationUnexpectedSkipSentinel). The
runner's markRetry path detects the sentinel and emits a distinct
propagation.unexpected_skip audit row (NOT propagation.applied). The
row retries per the existing backoff schedule (1m, 5m, 15m, ...) and
dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going
through the standard markDeadLettered path with the canonical
propagation.dead_lettered audit kind that operators already alert on.
New Prometheus counter:
instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason}
with bounded skip_reason cardinality via bucketSkipReason() —
postgres_admin_secret_missing, redis_auth_secret_missing,
namespace_not_found, pod_not_found, resource_not_reachable,
legacy_resource, other. Leading indicator for the dead-letter alert
that already exists.
Audit kinds the runner now emits (mirrors api/models/audit_kinds.go):
- propagation.applied (success; unchanged)
- propagation.retrying (routine retry; unchanged)
- propagation.dead_lettered (terminal failure; unchanged)
- propagation.unexpected_skip (NEW: F1 retry signal)
Coverage block (CLAUDE.md rule 17):
Symptom: propagation.applied audit row + applied_at stamp on a
row whose regrade never landed
Enumeration: rg -F 'unexpected_skip'
(worker, provisioner, api repos)
Sites found: 1 emit site (handleTierElevation only)
Sites touched: 1
Coverage test: TestIsPropagationAllowedSkip_Coverage iterates
propagationAllowedSkipSubstrings + a known-failure
string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied
fails the second a future PR re-routes unexpected_skip
through markApplied
Live verified: pending — will verify post-deploy via synthetic
pending_propagations row pointing at non-existent
team_id with kind=tier_elevation
Tests pass:
TestPropagation_UnexpectedSkip_DoesNotMarkApplied PASS
TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts PASS
TestIsPropagationAllowedSkip_Coverage PASS
TestPropagationUnexpectedSkipErr_IsMatches PASS
TestBucketSkipReason_BoundsCardinality PASS
make gate green (build + vet + go test ./... -short -count=1).
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 7871685 commit 8ecab5c
3 files changed
Lines changed: 797 additions & 35 deletions
0 commit comments