Skip to content

Commit f0cdbd5

Browse files
docs(.claude): add worker-specific Claude Code project config
Worker repo had no .claude/CLAUDE.md, so Claude Code sessions opened in the worker root only saw the project-root CLAUDE.md and missed worker-specific conventions. This adds the worker delta over root rules: - Idle ticks are DEBUG (not INFO) — quota.go pattern. - Resource bearer tokens MUST be masked in logs (T21 P1-2). - River JobTimeout on every job (T20 P1). - Drain-before-cancel in Workers.Stop (T20 P0-2 / P1-3). - resource.tier (not team.plan_tier) for entitlement reconciliation. - Forwarder claim-after-2xx (MR-P1-16). - Email masking in worker email providers (T22 P1-1). Includes a comprehensive jobs table (~50 source files in internal/jobs/) so an agent can find the right file without grepping. Documents the chaos drill make targets and the auto-deploy contract. Coverage block (per CLAUDE.md rule 17): Symptom: worker repo missing .claude/CLAUDE.md while api/ has one Enumeration: ls api/.claude vs ls worker/.claude Sites found: 1 (worker root needed .claude/ dir + CLAUDE.md) Sites touched: 1 Coverage test: none — doc-only. Live verified: next Claude Code session opened in worker/ will pick up the file via the default discovery path. Closes P2 from DOC-REALITY-DELTA-2026-05-20.md §3 (worker repo CLAUDE.md gap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 7d2ff0d commit f0cdbd5

1 file changed

Lines changed: 112 additions & 0 deletions

File tree

.claude/CLAUDE.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# instant-worker — Claude Code Project Config
2+
3+
Background-jobs service. Runs River (Postgres-native queue) on top of the
4+
platform DB. Owns every async / scheduled side-effect that the agent-facing
5+
API would block on if it tried to do them in-request.
6+
7+
## Quick reference
8+
9+
- `make build` — compile all packages
10+
- `make test` — full test run (race, count=1)
11+
- `make gate`**pre-push/pre-commit gate**. Runs the EXACT command
12+
sequence CI's `.github/workflows/deploy.yml` runs as its test step
13+
(`go build ./... && go vet ./... && go test ./... -short -count=1`).
14+
A green `make gate` locally == a green CI test step. Per CLAUDE.md
15+
(root) rule 23, this must pass before commit/push.
16+
- `make docker-build` — Docker image with `GIT_SHA` / `BUILD_TIME` /
17+
`VERSION` build args wired into `instant.dev/common/buildinfo`.
18+
- `make smoke-buildinfo` — proves the `-ldflags` injection path still
19+
works end-to-end (CI runs this on every PR).
20+
- `make chaostest-propagation` — propagation_runner retry / dead-letter
21+
drill (live cluster; see `CHAOS-DRILL-2026-05-20.md` at repo root).
22+
- `make chaostest-lease-recovery` — worker pod OOMKill / lease-takeover
23+
drill (see same doc).
24+
25+
## What lives where
26+
27+
```
28+
worker/
29+
├── cmd/ ← per-binary mains (smoke-buildinfo helper, etc.)
30+
├── docs/ ← per-job design notes
31+
├── internal/
32+
│ ├── jobs/ ← ~50 source files. Every job ends in *_job.go-style
33+
│ │ with a worker.River pattern + companion *_test.go.
34+
│ ├── ... ← (logctx wrappers, k8s client, db helpers)
35+
│ └── ...
36+
├── sql/ ← internal SQL helpers (worker-side, not migrations;
37+
│ migrations live in api/internal/db/migrations/)
38+
└── main.go ← River boot + job registration
39+
```
40+
41+
Source-of-truth headline jobs (in scope for routine edits):
42+
43+
| Job | File | Purpose |
44+
|---|---|---|
45+
| `propagation_runner` | `internal/jobs/propagation_runner.go` | Out-of-band side-effect runner with maxAttempts + dead-letter table. **Chaos-tested 2026-05-20.** `instant_propagation_dead_lettered_total` Prometheus counter wired. |
46+
| `expire` / `expire_imminent` / `expire_stacks` | `internal/jobs/expire*.go` | Anonymous + Free-tier 24h TTL reaper + 6/2/1h warning emails. Comprehensive Go-rendered email per Rule 12 (no Brevo templates). |
47+
| `entitlement_reconciler` | `internal/jobs/entitlement_reconciler.go` | Re-grade resources after tier change. Reads from `resource.tier`, NOT `team.plan_tier` (the PR #175 fix). All arms (Postgres / Redis / Mongo) implemented. |
48+
| `quota` / `quota_infra` / `quota_redis_eviction` / `quota_wall_nudge` | `internal/jobs/quota*.go` | Per-resource usage scans against shared infra. The 80% upsell nudge fires once per resource. |
49+
| `billing_reconciler` / `checkout_reconcile` | `internal/jobs/billing*.go` `internal/jobs/checkout_reconcile.go` | Razorpay subscription state-machine repair + pending-checkout reconciliation. |
50+
| `deploy_status_reconcile` / `deploy_notify_webhook` / `deploy_failure_autopsy` / `deployment_expirer` / `deployment_reminder` | `internal/jobs/deploy*.go` `internal/jobs/deployment*.go` | Deploy lifecycle: status sync, webhook fan-out, failure-cause classification, app expiry. |
51+
| `orphan_sweep_reconciler` / `orphan_sweep_canceler` | `internal/jobs/orphan_sweep*.go` | k8s namespace reaper. PASS 3 enhanced reasons + PASS 6 stuck-build state (covers ImagePullBackOff at 9h). |
52+
| `event_email_forwarder` / `event_email_mapping` / `lifecycle_emails` / `expiry_reminder` | `internal/jobs/event_email*.go` `internal/jobs/lifecycle_emails.go` `internal/jobs/expiry_reminder.go` | Comprehensive Go-rendered transactional email. All 18+ kinds. See `expiry_reminder.brevo-template.md` for the historical (now-deprecated) Brevo template. |
53+
| `customer_backup_runner` / `customer_backup_scheduler` / `customer_restore_runner` / `backup_audit` / `backup_s3` / `platform_db_backup` | `internal/jobs/customer_backup*.go` `internal/jobs/backup*.go` `internal/jobs/platform_db_backup*.go` | Per-tenant backup ladder + platform DB backup. |
54+
| `team_deletion_executor` (+ audit_kinds, s3_adapter) / `pending_deletion_expirer` | `internal/jobs/team_deletion*.go` `internal/jobs/pending_deletion_expirer.go` | Purge orchestration. k8s namespace teardown lives here (PR #135). |
55+
| `magic_link_reconciler` / `payment_grace_reminder` / `payment_grace_terminator` | `internal/jobs/magic_link_reconciler.go` `internal/jobs/payment_grace*.go` | Auth + payment-failure flows. |
56+
| `provisioner_reconciler` / `razorpay_webhook_prune` / `chaos_lease_recovery` / `resource_heartbeat` | `internal/jobs/provisioner_reconciler.go` `internal/jobs/razorpay_webhook_prune.go` `internal/jobs/chaos_lease_recovery.go` `internal/jobs/resource_heartbeat.go` | Operational reconciliation. |
57+
| `prober` / `real_prober` / `uptime_prober` | `internal/jobs/prober.go` `internal/jobs/real_prober.go` `internal/jobs/uptime_prober.go` | Synthetic health checks. |
58+
| `churn_predictor` | `internal/jobs/churn_predictor.go` | Heuristic risk scoring. |
59+
| `geodb` | `internal/jobs/geodb.go` | MaxMind GeoLite2 refresh. |
60+
| `custom_domain_reconcile` | `internal/jobs/custom_domain_reconcile.go` | cert-manager / ingress reconciliation for Pro+ custom hostnames. |
61+
| `storage` / `storage_minio` | `internal/jobs/storage.go` `internal/jobs/storage_minio.go` | Quota scan against object-store backend (DO Spaces live; MinIO legacy). |
62+
63+
## Conventions (worker-specific, on top of root CLAUDE.md)
64+
65+
1. **Idle ticks are DEBUG.** A job that wakes up on its schedule, finds
66+
nothing to do, and exits should log at DEBUG. INFO is reserved for
67+
work performed. See `worker/internal/jobs/quota.go` and the W4 ticket
68+
pattern (`entitlement_reconciler.Mongo arm` was the historical noisy
69+
surface — silenced 2026-05-19).
70+
71+
2. **Resource bearer tokens MUST be masked in logs.** Worker T21 P1-2.
72+
Use the masking helper, never raw `resource.token`.
73+
74+
3. **Job timeout via River JobTimeout.** Worker T20 P1. Every job gets a
75+
timeout so a stuck Postgres/HTTP call cannot wedge the River pool.
76+
77+
4. **Drain before cancel.** Worker T20 P0-2 / P1-3. `Workers.Stop` calls
78+
`Drain` first so in-flight work commits before the context cancels.
79+
80+
5. **`resource.tier`, not `team.plan_tier`.** Entitlement reconciler reads
81+
per-resource tier (Worker T8 P1-1). A user with old Pro resources and
82+
a downgraded team-tier-Hobby keeps the Pro grade on those resources
83+
until the next provision.
84+
85+
6. **Forwarder claim-after-2xx.** Worker MR-P1-16. The forwarder claims
86+
the row only AFTER a 2xx from Brevo, not before. This avoids the
87+
"ledger marked sent, network call failed" hole.
88+
89+
7. **Email masking in worker providers.** Worker T22 P1-1. Stub the
90+
right side of the `@`.
91+
92+
## Chaos drills (run on real cluster)
93+
94+
- `make chaostest-propagation` — flushes the propagation queue with
95+
malformed kinds, asserts dead-letter ceiling + the `unknown_kind`
96+
escape route (CHAOS F2 fix).
97+
- `make chaostest-lease-recovery` — SIGKILL a worker pod, measure RTO
98+
for River lease takeover by the surviving pod. See CHAOS F5 ticket
99+
for live RTO measurement gating on the next image rebuild.
100+
101+
## Auto-deploy
102+
103+
Worker auto-deploys on push to `master` via `.github/workflows/deploy.yml`.
104+
Verify with `kubectl get pod -n instant-infra -l app=instant-worker -o jsonpath='{.items[0].spec.containers[0].image}'` after rollout. The image
105+
tag (e.g. `master-<sha>`) must match `git rev-parse --short HEAD`.
106+
107+
## When in doubt
108+
109+
The root `/Users/manassrivastava/Documents/InstaNode/CLAUDE.md` covers
110+
shared conventions, agent-reliability rules (1–23), and the four-pass
111+
deploy ritual. **This file is the worker-specific delta only.** When a
112+
rule appears in both, root wins.

0 commit comments

Comments
 (0)