Skip to content

Commit 0d7e46c

Browse files
worker: Wave 3 P2 fixes — BugBash 2026-05-20
MR-P1-5 (T5 P0-3): reaper SELECT-then-deprovision races the upgrade webhook clearing expires_at. Wrap the per-row check + deprovision in a SERIALIZED FOR UPDATE tx so a concurrent subscription.charged that clears expires_at + sets tier='pro' between batch SELECT and per-row work cannot result in DROP DATABASE on the customer's just-paid DB. The fix opens BeginTx per row, re-confirms the reaper predicate under 'SELECT EXISTS … FOR UPDATE OF r', and runs the deprovision + UPDATE inside the same tx. The upgrade webhook either blocks on the row lock (and then finds a 'deleted' row — its UPDATE is a no-op because ElevateResourceTiersByTeam filters non-deleted statuses) or completes before our SELECT (in which case the EXISTS returns false and the reaper skips with metrics.ExpireRaceSkippedTotal.Inc()). MR-P1-7 (T5 P1-7): reaper free-tier predicate ignored team deletion state — a 'free' resource of a team in deletion_requested 30-day grace was being reaped (DROP DATABASE) before the restore window elapsed. Add LEFT JOIN teams + (r.team_id IS NULL OR t.status = 'active') to the batch SELECT and to the per-row FOR UPDATE re-confirm (defense in depth). The team-deletion executor is the authorized destructor for that data path; the reaper stays out of it. T8 worker followup: entitlement_reconciler Mongo arm — verify the unsupported-MONGODB skip (provisioner returns {Applied:false, SkipReason:'unsupported resource type for regrade'}) is logged at DEBUG not WARN. Existing code already does this; added a regression test (TestEntitlementReconciler_MongoArm_UnsupportedSkip_LogsAtDebug_NotWARN) that captures slog records and fails loudly if mongo.regrade_skipped ever fires at WARN — that path runs once per Mongo resource per 5min tick indefinitely until provisioner.regradeMongo lands, so a WARN would be ~12/h/resource of alert-fatigue spam. T22 P2 (DEPLOY_DOMAIN/R2_ENDPOINT defaults) + T21 P1-5 (e2e_bypass INFO spam): SKIPPED — both findings are in the api repo (api/internal/config/config.go and api/internal/middleware/fingerprint.go respectively), not in the worker module. The api repo has uncommitted in-flight work (OTel tracing restoration, ~5 files); a worker-scoped brief should not drag those in. Flagged in the report so a follow-up api-repo Wave can claim them. Coverage block (per CLAUDE.md rule 17): Symptom: reaper race vs upgrade webhook → DROP paid DB; reaper reaps free-row of deletion_requested team → restore returns active account with no data Enumeration: rg -n 'tier=.free.\|expires_at' worker/internal/jobs/ rg -n 'DeprovisionResource' worker/internal/jobs/expire.go Sites found: 1 reaper SQL + 1 per-row deprovision call Sites touched: both — batch SELECT now LEFT JOINs teams, per-row deprovision wrapped in FOR UPDATE tx Coverage test: TestExpireAnonymousWorker_T5_P0_3_UpgradeWebhookWinsRace TestExpireAnonymousWorker_T5_P1_7_TeamInDeletion- RequestedIsExcluded TestExpireAnonymousWorker_T5_P1_7_PerRowGuard- RedundantlyChecksTeamStatus TestEntitlementReconciler_MongoArm_UnsupportedSkip- _LogsAtDebug_NotWARN Live verified: pending — pushes to origin/master auto-deploy; the on-cluster effect is per-tick metric instant_expire_race_skipped_total ≥ 0 (positive signal when fires); behavioural verification waits for a real subscription.charged collision (rare). No regression in the existing P0-1a / paused-suspended / free-tier expiry / storage-expiry test coverage. Build/vet/test all green: go build ./... → clean go vet ./... → clean go test ./... → ok (all packages, including 22.6s jobs suite) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent ebbdfef commit 0d7e46c

4 files changed

Lines changed: 606 additions & 156 deletions

File tree

internal/jobs/entitlement_reconciler_test.go

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -999,3 +999,86 @@ func TestEntitlementReconciler_MongoArm_SweepsMongoResources(t *testing.T) {
999999
t.Errorf("unmet sqlmock expectations: %v", err)
10001000
}
10011001
}
1002+
1003+
// TestEntitlementReconciler_MongoArm_UnsupportedSkip_LogsAtDebug_NotWARN is the
1004+
// regression guard for T8 worker follow-up (BugBash 2026-05-20): until the
1005+
// provisioner's regradeMongo is implemented, every Mongo regrade call returns
1006+
// `{Applied: false, SkipReason: "unsupported resource type for regrade"}`.
1007+
// This is the EXPECTED steady state — once per Mongo resource per 5-minute
1008+
// tick, indefinitely. If that path logs at WARN it spams the production
1009+
// log stream (~12/h/resource × N Mongo customers) and triggers alert fatigue.
1010+
//
1011+
// THE CONTRACT: the unsupported-Mongo skip path logs at DEBUG; the operator-
1012+
// visible signal is the rollup `mongo_skipped` counter on the
1013+
// `.completed` INFO line, not a per-resource per-tick WARN.
1014+
//
1015+
// THE ASSERTION: with a regrader that returns Applied:false for MONGODB, the
1016+
// captured slog stream contains ZERO WARN records whose message starts with
1017+
// `jobs.entitlement_reconciler.mongo.` (the family the unsupported skip
1018+
// belongs to), and the per-row skip event is recorded at DEBUG with the
1019+
// expected skip_reason.
1020+
func TestEntitlementReconciler_MongoArm_UnsupportedSkip_LogsAtDebug_NotWARN(t *testing.T) {
1021+
records := captureSlog(t)
1022+
1023+
db, mock, err := sqlmock.New(sqlmock.QueryMatcherOption(sqlmock.QueryMatcherRegexp))
1024+
if err != nil {
1025+
t.Fatalf("sqlmock.New: %v", err)
1026+
}
1027+
defer db.Close()
1028+
1029+
id := uuid.New()
1030+
// Postgres + Redis empty; Mongo has one row. The unsupported regrader
1031+
// below returns Applied:false / SkipReason="unsupported...".
1032+
mock.ExpectQuery(`SELECT`).WillReturnRows(sqlmock.NewRows(entitlementSweepCols))
1033+
mock.ExpectQuery(`SELECT`).WillReturnRows(emptyRedisRows())
1034+
mock.ExpectQuery(`SELECT`).WillReturnRows(sqlmock.NewRows(mongoSweepCols).
1035+
AddRow(id, "tok-mongo-unsup", nil, "pro", "pro"))
1036+
1037+
// Regrader returns the exact shape the real provisioner produces today
1038+
// for MONGODB — Applied:false + SkipReason matching server.go:749.
1039+
unsupRegrader := &funcRegrader{
1040+
fn: func(_ context.Context, _, _ string, resType commonv1.ResourceType, _, _ string) (regradeOutcome, error) {
1041+
if resType == commonv1.ResourceType_RESOURCE_TYPE_MONGODB {
1042+
return regradeOutcome{Applied: false, SkipReason: "unsupported resource type for regrade"}, nil
1043+
}
1044+
return regradeOutcome{Applied: true}, nil
1045+
},
1046+
}
1047+
1048+
reg := liveRegistry(t)
1049+
w := NewEntitlementReconcilerWorker(db, reg, unsupRegrader)
1050+
if err := w.Work(context.Background(), fakeEntitlementJob()); err != nil {
1051+
t.Fatalf("Work() returned unexpected error: %v", err)
1052+
}
1053+
1054+
// THE LOAD-BEARING ASSERTION: scan the captured records for any WARN
1055+
// (or higher) record whose message belongs to the Mongo arm family.
1056+
// The unsupported-skip path must use DEBUG; any WARN here is the
1057+
// per-tick spam this regression guards against.
1058+
for _, rec := range *records {
1059+
if rec.level < slog.LevelWarn {
1060+
continue
1061+
}
1062+
// We tolerate non-Mongo WARNs (other arms unaffected) — only the
1063+
// `mongo.` family is in-scope for this regression.
1064+
if msg := rec.msg; len(msg) >= len("jobs.entitlement_reconciler.mongo.") &&
1065+
msg[:len("jobs.entitlement_reconciler.mongo.")] == "jobs.entitlement_reconciler.mongo." {
1066+
// The mongo.query_failed / mongo.scan_failed / mongo.rows_failed /
1067+
// mongo.regrade_failed lines are legitimate WARNs (they signal a
1068+
// real fault). The bug we guard is mongo.regrade_skipped firing
1069+
// at WARN — pin its level to DEBUG specifically.
1070+
if rec.msg == "jobs.entitlement_reconciler.mongo.regrade_skipped" {
1071+
t.Errorf("T8 worker followup regression: "+
1072+
"the unsupported-Mongo skip path emitted a %s record — "+
1073+
"this fires once per Mongo resource per 5-min tick, "+
1074+
"indefinitely (until provisioner.regradeMongo lands), "+
1075+
"so it must be DEBUG, not WARN/ERROR. Record: msg=%q attrs=%v",
1076+
rec.level, rec.msg, rec.attrs)
1077+
}
1078+
}
1079+
}
1080+
1081+
if err := mock.ExpectationsWereMet(); err != nil {
1082+
t.Errorf("unmet sqlmock expectations: %v", err)
1083+
}
1084+
}

0 commit comments

Comments
 (0)