Skip to content

fix(chaos)+test(integration): F1-F4 chaos fixes + propagation runner integration test layer#35

Merged
mastermanas805 merged 3 commits into
masterfrom
chaos-f234-on-f1
May 20, 2026
Merged

fix(chaos)+test(integration): F1-F4 chaos fixes + propagation runner integration test layer#35
mastermanas805 merged 3 commits into
masterfrom
chaos-f234-on-f1

Conversation

@mastermanas805
Copy link
Copy Markdown
Member

CHAOS fixes (F1/F2/F3/F4, base commits)

  • F1: unexpected_skip no longer silently marks propagation APPLIED — handler returns propagationUnexpectedSkipErr, runner takes the markRetry path.
  • F2: unknown_kind propagation rows are bounded by maxAttempts + emit propagation.unknown_kind_dead_lettered audit kind.
  • F3: instant_propagation_dead_lettered_total Prometheus counter.
  • F4: rescueStuckJobsAfter = 25min pinned in River config (was defaulting to 1h20m).

NEW: Track 3 — Propagation runner integration tests (this commit)

Adds the next layer up from propagation_runner_test.go (sqlmock unit drift guards). No build tag — runs under regular make gate.

worker/internal/jobs/propagation_runner_integration_test.go:

  • TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry — markRetry end-to-end persistence for every backoff position + clamp arm.
  • TestPropagation_DeadLetterIntegration_AtMaxAttempts — markDeadLettered direct + audit row emission pin.
  • TestPropagation_UnknownKindIntegration_BoundedRetries — F2 P1 guard: unknown_kind hits attempts++ via markRetry, NOT a silent skip.
  • TestPropagation_ForUpdateSkipLockedIntegration — live-DB concurrent picker test, TEST_DATABASE_URL-gated.
  • TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap — rule-18 enum-vs-handler-map walk, TEST_DATABASE_URL-gated.

CLAUDE.md rule 17 coverage block per test, rule 18 registry walk in two of the five tests.

Local verified

$ make gate
ok      instant.dev/worker/internal/circuit      1.332s
ok      instant.dev/worker/internal/email        1.074s
ok      instant.dev/worker/internal/handlers     2.216s
ok      instant.dev/worker/internal/jobs        22.515s
gate: green — matches deploy.yml test step

🤖 Generated with Claude Code

mastermanas805 and others added 3 commits May 20, 2026 15:09
…ation APPLIED

Pre-fix bug (CHAOS-DRILL-2026-05-20 finding #1, propagation_runner.go
lines 756–771):

  handleTierElevation treated `(Applied=false, SkipReason=<any string
  not in the allowed-skip whitelist>)` as success. A WARN log fired,
  firstErr stayed nil, the runner stamped applied_at on the row, and
  the entitlement_reconciler (5-min backstop) saw no drift to correct
  because applied_at was set. A paying customer's tier-elevation
  regrade never landed — no retry, no dead-letter, no alert.

  Real prod trigger: customer's postgres pod missing postgres-admin
  Secret (legacy free-tier pods, mid-deprovisioning races). The chaos
  drill confirmed the failure mode end-to-end.

Fix: any non-allowed SkipReason now returns propagationUnexpectedSkipErr
(implements errors.Is on errPropagationUnexpectedSkipSentinel). The
runner's markRetry path detects the sentinel and emits a distinct
propagation.unexpected_skip audit row (NOT propagation.applied). The
row retries per the existing backoff schedule (1m, 5m, 15m, ...) and
dead-letters at propagationMaxAttempts (10 attempts ≈ 24h33m), going
through the standard markDeadLettered path with the canonical
propagation.dead_lettered audit kind that operators already alert on.

New Prometheus counter:

  instant_propagation_unexpected_skip_total{kind,resource_type,skip_reason}

with bounded skip_reason cardinality via bucketSkipReason() —
postgres_admin_secret_missing, redis_auth_secret_missing,
namespace_not_found, pod_not_found, resource_not_reachable,
legacy_resource, other. Leading indicator for the dead-letter alert
that already exists.

Audit kinds the runner now emits (mirrors api/models/audit_kinds.go):
  - propagation.applied         (success; unchanged)
  - propagation.retrying        (routine retry; unchanged)
  - propagation.dead_lettered   (terminal failure; unchanged)
  - propagation.unexpected_skip (NEW: F1 retry signal)

Coverage block (CLAUDE.md rule 17):

  Symptom:        propagation.applied audit row + applied_at stamp on a
                  row whose regrade never landed
  Enumeration:    rg -F 'unexpected_skip'
                  (worker, provisioner, api repos)
  Sites found:    1 emit site (handleTierElevation only)
  Sites touched:  1
  Coverage test:  TestIsPropagationAllowedSkip_Coverage iterates
                  propagationAllowedSkipSubstrings + a known-failure
                  string set; TestPropagation_UnexpectedSkip_DoesNotMarkApplied
                  fails the second a future PR re-routes unexpected_skip
                  through markApplied
  Live verified:  pending — will verify post-deploy via synthetic
                  pending_propagations row pointing at non-existent
                  team_id with kind=tier_elevation

Tests pass:
  TestPropagation_UnexpectedSkip_DoesNotMarkApplied             PASS
  TestPropagation_UnexpectedSkip_DeadLettersAtMaxAttempts       PASS
  TestIsPropagationAllowedSkip_Coverage                         PASS
  TestPropagationUnexpectedSkipErr_IsMatches                    PASS
  TestBucketSkipReason_BoundsCardinality                        PASS

make gate green (build + vet + go test ./... -short -count=1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er counter, pin River RescueStuckJobsAfter

Ships the three follow-ups from CHAOS-DRILL-2026-05-20 on top of F1's
unexpected_skip fix. F1 already pulled in the F2/F3 helpers (the
propagation_runner.go markUnknownKindDeadLettered path + the
PropagationDeadLetteredTotal / PropagationUnknownKindTotal counters);
this commit adds the F4 River config pin and the registry-iterating tests
that lock the three fixes in.

F2 (P1, CHAOS-DRILL-2026-05-20):
  pending_propagations rows whose kind has no handler now respect the same
  propagationMaxAttempts ceiling as a real-failure row. Pre-fix they retried
  forever — confirmed live during the drill (chaos_test_unknown_kind reached
  attempts=10 in 4 minutes without ever transitioning to failed_at). New
  markUnknownKindDeadLettered path emits a distinct
  propagation.unknown_kind_dead_lettered audit row + bumps
  instant_propagation_dead_lettered_total{reason="unknown_kind",kind="unknown_kind"}
  (the second label is a bounded BUCKET, NOT the raw row.kind, so an
  attacker-controlled enqueue cannot blow up Prom cardinality).

F3 (P2, CHAOS-DRILL-2026-05-20):
  Adds instant_propagation_dead_lettered_total{reason,kind} counter incremented
  on every transition to failed_at. reason="max_attempts" covers the modal
  path (real RegradeResource failures, F1's unexpected_skip-as-failure, and
  markApplied DB failures once they exhaust the backoff schedule);
  reason="unknown_kind" covers F2's image-skew path. Also adds the per-tick
  instant_propagation_unknown_kind_total{kind} counter as a leading indicator
  so the operator sees "worker is older than api" in seconds rather than
  waiting ~24h for the dead-letter to land.

F4 (P1, CHAOS-DRILL-2026-05-20):
  River's default RescueStuckJobsAfter = JobTimeout + JobRescuerRescueAfterDefault
  = 20m + 1h = 1h20m. That is an 80-minute RTO ceiling on any catastrophic
  worker death (OOMKill / pod eviction / segfault) where River's client never
  gets to mark the job back to 'available' itself. Pin it explicitly to
  25 minutes — JobTimeout (20m) + 5m of jitter headroom. Every job in this
  worker is idempotent, so a duplicate rescue is a no-op rather than a
  double-effect. The rescue_stuck_jobs_after value is now also stamped into
  the jobs.workers.started log line so a kubectl-logs grep after a roll
  confirms the pinned RTO is live.

TESTS (registry-iterating per CLAUDE.md rule 18):
  TestPropagation_UnknownKind_DeadLettersAtMaxAttempts:
    synthesises a guaranteed-not-in-registry kind (chaos_unknown_kind_<unix_nano>),
    drives a row at propagationMaxAttempts-1 through Work(), asserts (a)
    failed_at-stamping UPDATE landed and (b) PropagationDeadLetteredTotal
    {reason=unknown_kind,kind=unknown_kind} delta == 1. A future PR adding the
    synthetic kind to propagationHandlers cannot turn this test into a no-op
    because the kind is freshly generated each run.

  TestPropagation_UnknownKind_RetriesBelowMaxAttempts:
    companion guard so the F2 fix can't regress into the opposite bug
    (immediate dead-letter at attempts=0).

  TestPropagation_DeadLetter_IncrementsMetric:
    pins the F3 contract on the modal max_attempts path (tier_elevation kind,
    persistent gRPC failure at attempts=propagationMaxAttempts-1, asserts the
    Prom counter incremented).

  TestWorker_RiverConfig_RescueStuckJobsAfterIs25Min:
    pins rescueStuckJobsAfter == 25m exactly, AND > globalJobTimeout (so the
    rescuer doesn't race River's own timeout), AND < River's default 1h20m
    (so the explicit pin remains an actual reduction).

GATES: make gate green (build + vet + go test ./... -short -count=1 all green).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the next layer up from worker/internal/jobs/propagation_runner_test.go
(sqlmock unit drift guards). No build tag — runs under regular make gate.

  - TestPropagation_BackoffIntegration_ExactScheduleViaMarkRetry
    Drives w.markRetry directly with a deterministic clock and pins
    the persisted next_attempt_at SQL UPDATE arg for every position
    in propagationBackoffSchedule + the clamp arm. Catches a refactor
    that changes the next-attempt formula without updating the
    schedule.

  - TestPropagation_DeadLetterIntegration_AtMaxAttempts
    Drives w.markDeadLettered directly. Pins the SQL UPDATE setting
    failed_at AND the propagation.dead_lettered audit row emission.
    Catches a conditional skip of the audit emission.

  - TestPropagation_UnknownKindIntegration_BoundedRetries (F2 P1 guard)
    A pending_propagation with kind='garbage_kind_nobody_handles'
    must flow through markRetry (attempts++), not a silent skip.
    Catches a refactor that bypasses attempts++ in the unknown_kind
    branch.

  - TestPropagation_ForUpdateSkipLockedIntegration
    Live-DB concurrent picker test. Two workers pickEligible the
    same row; total picks must be <= 1. Gated on TEST_DATABASE_URL
    + absence of -short. Catches a SKIP LOCKED removal that lets
    sibling pods double-dispatch.

  - TestPropagation_RegistryWalkIntegration_EnumVsHandlerMap
    Rule 18 registry walk against the pending_propagations.kind PG
    enum. Catches a migration that adds an enum value without a
    matching handler in propagationHandlers.

CLAUDE.md rule 17 coverage block per test, rule 18 registry walk
in two of the five tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit b3f093c into master May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant