|
| 1 | +# instant-worker — Claude Code Project Config |
| 2 | + |
| 3 | +Background-jobs service. Runs River (Postgres-native queue) on top of the |
| 4 | +platform DB. Owns every async / scheduled side-effect that the agent-facing |
| 5 | +API would block on if it tried to do them in-request. |
| 6 | + |
| 7 | +## Quick reference |
| 8 | + |
| 9 | +- `make build` — compile all packages |
| 10 | +- `make test` — full test run (race, count=1) |
| 11 | +- `make gate` — **pre-push/pre-commit gate**. Runs the EXACT command |
| 12 | + sequence CI's `.github/workflows/deploy.yml` runs as its test step |
| 13 | + (`go build ./... && go vet ./... && go test ./... -short -count=1`). |
| 14 | + A green `make gate` locally == a green CI test step. Per CLAUDE.md |
| 15 | + (root) rule 23, this must pass before commit/push. |
| 16 | +- `make docker-build` — Docker image with `GIT_SHA` / `BUILD_TIME` / |
| 17 | + `VERSION` build args wired into `instant.dev/common/buildinfo`. |
| 18 | +- `make smoke-buildinfo` — proves the `-ldflags` injection path still |
| 19 | + works end-to-end (CI runs this on every PR). |
| 20 | +- `make chaostest-propagation` — propagation_runner retry / dead-letter |
| 21 | + drill (live cluster; see `CHAOS-DRILL-2026-05-20.md` at repo root). |
| 22 | +- `make chaostest-lease-recovery` — worker pod OOMKill / lease-takeover |
| 23 | + drill (see same doc). |
| 24 | + |
| 25 | +## What lives where |
| 26 | + |
| 27 | +``` |
| 28 | +worker/ |
| 29 | +├── cmd/ ← per-binary mains (smoke-buildinfo helper, etc.) |
| 30 | +├── docs/ ← per-job design notes |
| 31 | +├── internal/ |
| 32 | +│ ├── jobs/ ← ~50 source files. Every job ends in *_job.go-style |
| 33 | +│ │ with a worker.River pattern + companion *_test.go. |
| 34 | +│ ├── ... ← (logctx wrappers, k8s client, db helpers) |
| 35 | +│ └── ... |
| 36 | +├── sql/ ← internal SQL helpers (worker-side, not migrations; |
| 37 | +│ migrations live in api/internal/db/migrations/) |
| 38 | +└── main.go ← River boot + job registration |
| 39 | +``` |
| 40 | + |
| 41 | +Source-of-truth headline jobs (in scope for routine edits): |
| 42 | + |
| 43 | +| Job | File | Purpose | |
| 44 | +|---|---|---| |
| 45 | +| `propagation_runner` | `internal/jobs/propagation_runner.go` | Out-of-band side-effect runner with maxAttempts + dead-letter table. **Chaos-tested 2026-05-20.** `instant_propagation_dead_lettered_total` Prometheus counter wired. | |
| 46 | +| `expire` / `expire_imminent` / `expire_stacks` | `internal/jobs/expire*.go` | Anonymous + Free-tier 24h TTL reaper + 6/2/1h warning emails. Comprehensive Go-rendered email per Rule 12 (no Brevo templates). | |
| 47 | +| `entitlement_reconciler` | `internal/jobs/entitlement_reconciler.go` | Re-grade resources after tier change. Reads from `resource.tier`, NOT `team.plan_tier` (the PR #175 fix). All arms (Postgres / Redis / Mongo) implemented. | |
| 48 | +| `quota` / `quota_infra` / `quota_redis_eviction` / `quota_wall_nudge` | `internal/jobs/quota*.go` | Per-resource usage scans against shared infra. The 80% upsell nudge fires once per resource. | |
| 49 | +| `billing_reconciler` / `checkout_reconcile` | `internal/jobs/billing*.go` `internal/jobs/checkout_reconcile.go` | Razorpay subscription state-machine repair + pending-checkout reconciliation. | |
| 50 | +| `deploy_status_reconcile` / `deploy_notify_webhook` / `deploy_failure_autopsy` / `deployment_expirer` / `deployment_reminder` | `internal/jobs/deploy*.go` `internal/jobs/deployment*.go` | Deploy lifecycle: status sync, webhook fan-out, failure-cause classification, app expiry. | |
| 51 | +| `orphan_sweep_reconciler` / `orphan_sweep_canceler` | `internal/jobs/orphan_sweep*.go` | k8s namespace reaper. PASS 3 enhanced reasons + PASS 6 stuck-build state (covers ImagePullBackOff at 9h). | |
| 52 | +| `event_email_forwarder` / `event_email_mapping` / `lifecycle_emails` / `expiry_reminder` | `internal/jobs/event_email*.go` `internal/jobs/lifecycle_emails.go` `internal/jobs/expiry_reminder.go` | Comprehensive Go-rendered transactional email. All 18+ kinds. See `expiry_reminder.brevo-template.md` for the historical (now-deprecated) Brevo template. | |
| 53 | +| `customer_backup_runner` / `customer_backup_scheduler` / `customer_restore_runner` / `backup_audit` / `backup_s3` / `platform_db_backup` | `internal/jobs/customer_backup*.go` `internal/jobs/backup*.go` `internal/jobs/platform_db_backup*.go` | Per-tenant backup ladder + platform DB backup. | |
| 54 | +| `team_deletion_executor` (+ audit_kinds, s3_adapter) / `pending_deletion_expirer` | `internal/jobs/team_deletion*.go` `internal/jobs/pending_deletion_expirer.go` | Purge orchestration. k8s namespace teardown lives here (PR #135). | |
| 55 | +| `magic_link_reconciler` / `payment_grace_reminder` / `payment_grace_terminator` | `internal/jobs/magic_link_reconciler.go` `internal/jobs/payment_grace*.go` | Auth + payment-failure flows. | |
| 56 | +| `provisioner_reconciler` / `razorpay_webhook_prune` / `chaos_lease_recovery` / `resource_heartbeat` | `internal/jobs/provisioner_reconciler.go` `internal/jobs/razorpay_webhook_prune.go` `internal/jobs/chaos_lease_recovery.go` `internal/jobs/resource_heartbeat.go` | Operational reconciliation. | |
| 57 | +| `prober` / `real_prober` / `uptime_prober` | `internal/jobs/prober.go` `internal/jobs/real_prober.go` `internal/jobs/uptime_prober.go` | Synthetic health checks. | |
| 58 | +| `churn_predictor` | `internal/jobs/churn_predictor.go` | Heuristic risk scoring. | |
| 59 | +| `geodb` | `internal/jobs/geodb.go` | MaxMind GeoLite2 refresh. | |
| 60 | +| `custom_domain_reconcile` | `internal/jobs/custom_domain_reconcile.go` | cert-manager / ingress reconciliation for Pro+ custom hostnames. | |
| 61 | +| `storage` / `storage_minio` | `internal/jobs/storage.go` `internal/jobs/storage_minio.go` | Quota scan against object-store backend (DO Spaces live; MinIO legacy). | |
| 62 | + |
| 63 | +## Conventions (worker-specific, on top of root CLAUDE.md) |
| 64 | + |
| 65 | +1. **Idle ticks are DEBUG.** A job that wakes up on its schedule, finds |
| 66 | + nothing to do, and exits should log at DEBUG. INFO is reserved for |
| 67 | + work performed. See `worker/internal/jobs/quota.go` and the W4 ticket |
| 68 | + pattern (`entitlement_reconciler.Mongo arm` was the historical noisy |
| 69 | + surface — silenced 2026-05-19). |
| 70 | + |
| 71 | +2. **Resource bearer tokens MUST be masked in logs.** Worker T21 P1-2. |
| 72 | + Use the masking helper, never raw `resource.token`. |
| 73 | + |
| 74 | +3. **Job timeout via River JobTimeout.** Worker T20 P1. Every job gets a |
| 75 | + timeout so a stuck Postgres/HTTP call cannot wedge the River pool. |
| 76 | + |
| 77 | +4. **Drain before cancel.** Worker T20 P0-2 / P1-3. `Workers.Stop` calls |
| 78 | + `Drain` first so in-flight work commits before the context cancels. |
| 79 | + |
| 80 | +5. **`resource.tier`, not `team.plan_tier`.** Entitlement reconciler reads |
| 81 | + per-resource tier (Worker T8 P1-1). A user with old Pro resources and |
| 82 | + a downgraded team-tier-Hobby keeps the Pro grade on those resources |
| 83 | + until the next provision. |
| 84 | + |
| 85 | +6. **Forwarder claim-after-2xx.** Worker MR-P1-16. The forwarder claims |
| 86 | + the row only AFTER a 2xx from Brevo, not before. This avoids the |
| 87 | + "ledger marked sent, network call failed" hole. |
| 88 | + |
| 89 | +7. **Email masking in worker providers.** Worker T22 P1-1. Stub the |
| 90 | + right side of the `@`. |
| 91 | + |
| 92 | +## Chaos drills (run on real cluster) |
| 93 | + |
| 94 | +- `make chaostest-propagation` — flushes the propagation queue with |
| 95 | + malformed kinds, asserts dead-letter ceiling + the `unknown_kind` |
| 96 | + escape route (CHAOS F2 fix). |
| 97 | +- `make chaostest-lease-recovery` — SIGKILL a worker pod, measure RTO |
| 98 | + for River lease takeover by the surviving pod. See CHAOS F5 ticket |
| 99 | + for live RTO measurement gating on the next image rebuild. |
| 100 | + |
| 101 | +## Auto-deploy |
| 102 | + |
| 103 | +Worker auto-deploys on push to `master` via `.github/workflows/deploy.yml`. |
| 104 | +Verify with `kubectl get pod -n instant-infra -l app=instant-worker -o jsonpath='{.items[0].spec.containers[0].image}'` after rollout. The image |
| 105 | +tag (e.g. `master-<sha>`) must match `git rev-parse --short HEAD`. |
| 106 | + |
| 107 | +## When in doubt |
| 108 | + |
| 109 | +The root `/Users/manassrivastava/Documents/InstaNode/CLAUDE.md` covers |
| 110 | +shared conventions, agent-reliability rules (1–23), and the four-pass |
| 111 | +deploy ritual. **This file is the worker-specific delta only.** When a |
| 112 | +rule appears in both, root wins. |
0 commit comments