|
| 1 | +--- |
| 2 | +name: db-migrate |
| 3 | +description: Author or review a Drizzle DB migration for zero-downtime safety — expand/contract phasing, backward-compatibility with the deployed app version, and writing the `-- migration-safe` acknowledgment the check:migrations lint requires. Use when adding/editing files under `packages/db/migrations/` or changing `packages/db/schema.ts`. |
| 4 | +--- |
| 5 | + |
| 6 | +# DB Migrate Skill |
| 7 | + |
| 8 | +You make schema changes that survive a deploy without downtime. The `check:migrations` lint (`scripts/check-migrations-safety.ts`) is the deterministic gate; you are the judgment that decides whether a flagged change is actually safe and writes the annotation that satisfies it. |
| 9 | + |
| 10 | +## The window (why this matters) |
| 11 | + |
| 12 | +A deploy runs the migration, then rolls out the new app image via blue/green. The two are **not atomic and cannot be** — during cutover the old task set keeps serving against the **already-migrated** schema. So: |
| 13 | + |
| 14 | +> Every migration must be backward-compatible with the app version that is *already deployed*. |
| 15 | +
|
| 16 | +If a migration drops a column the old code still reads, renames one, or adds a `NOT NULL` the old inserts don't populate, the old code throws until traffic fully shifts — the downtime we're guarding against. You can't fix this by reordering the pipeline; the only fix is discipline. |
| 17 | + |
| 18 | +## Expand / contract |
| 19 | + |
| 20 | +Split every breaking change across **two deploys**: |
| 21 | + |
| 22 | +1. **Expand** (this PR): additive, backward-compatible schema + code that tolerates *both* the old and new shape. |
| 23 | +2. **Contract** (a later PR, after expand is fully deployed): remove the old thing, now that nothing reads it. |
| 24 | + |
| 25 | +Never put expand and contract in the same PR. If this PR both removes the code that used a column *and* drops the column, the old code is still live during cutover — split it. |
| 26 | + |
| 27 | +### Per-operation playbook |
| 28 | + |
| 29 | +| You want to | Do (deploy 1 = expand) | Do (deploy 2 = contract) | |
| 30 | +|---|---|---| |
| 31 | +| Add a required column | `ADD COLUMN` nullable or `DEFAULT`; code writes it | backfill, then `SET NOT NULL` | |
| 32 | +| Rename a column/table | add the new name; code dual-writes / reads new-then-old | drop the old name | |
| 33 | +| Drop a column/table | stop all reads/writes in code; ship it | `DROP` (annotate) | |
| 34 | +| Change a column type | add a new column of the new type; dual-write | backfill, swap reads, drop old | |
| 35 | +| Add FK / CHECK | `ADD CONSTRAINT ... NOT VALID` | `VALIDATE CONSTRAINT` separately | |
| 36 | +| Index an existing table | `COMMIT;` breakpoint → `SET lock_timeout = 0` → `CREATE INDEX CONCURRENTLY IF NOT EXISTS` (see `packages/db/scripts/migrate.ts`) | — | |
| 37 | +| Drop an index | `COMMIT;` breakpoint → `DROP INDEX CONCURRENTLY` — plain `DROP INDEX` takes ACCESS EXCLUSIVE on the table | — | |
| 38 | +| Backfill data | batched + idempotent `UPDATE` (keyset/`WHERE`, bounded) | — | |
| 39 | + |
| 40 | +A `CREATE INDEX`, `ADD COLUMN`, or `ADD CONSTRAINT` against a table **created in the same migration** is always safe (no rows, no live traffic) — the lint already suppresses those. |
| 41 | + |
| 42 | +## Tracking the contract (don't let it rot) |
| 43 | + |
| 44 | +The contract half is deferred to a later deploy — and that is exactly when it gets forgotten, leaving dead columns, orphaned tables, and `NOT NULL`s that never land. Every deferred contract must become a durable, greppable TODO. |
| 45 | + |
| 46 | +When an expand defers a drop, leave a **`contract-pending`** marker on the legacy column/table in `packages/db/schema.ts` — that is the file you will be editing when you finally do the drop, so the reminder lives where the work happens: |
| 47 | + |
| 48 | +```ts |
| 49 | +// contract-pending(after #5035 is fully deployed): drop once permission-check.ts stops reading it |
| 50 | +workspaceId: text('workspace_id'), |
| 51 | +``` |
| 52 | + |
| 53 | +Format: `contract-pending(<precondition>): <what to drop> — <why it's safe once the precondition holds>`. The precondition names the PR/release that removes the last reader and **must be fully deployed** before the contract ships. |
| 54 | + |
| 55 | +- **The TODO list is a grep** — always accurate, never drifts: `grep -rn "contract-pending" packages/db apps/sim`. Run it when starting migration work to see what is owed. |
| 56 | +- For anything with a real owner or schedule, also open a tracking issue and put its number in the marker. |
| 57 | +- **Close the loop in the contract PR:** the contract migration's `-- migration-safe:` annotation references the expand, and you **delete the `contract-pending` marker** in the same PR: |
| 58 | + ```sql |
| 59 | + -- migration-safe: contract of #5035 — workspace_id readers removed there, deployed 2026-06-10 |
| 60 | + ALTER TABLE "permission_group" DROP COLUMN "workspace_id"; |
| 61 | + ``` |
| 62 | +- An expand merged **without** a marker for the drop it defers, or a contract merged **without** removing its marker, is a bug — flag it in review. |
| 63 | + |
| 64 | +## The judgment the lint can't do |
| 65 | + |
| 66 | +The lint flags risky *shapes*; it cannot know whether a given drop is *safe right now*. For each flagged statement, do the work it can't: |
| 67 | + |
| 68 | +1. **Is the dependency gone?** Grep the app for the table/column: search `apps/sim` and `packages` for the column name, the Drizzle field (camelCase), and the table object. If any live read/write remains, it is **not** safe — fix the code first. |
| 69 | +2. **Did the expand already ship?** The removal of that read/write must be in a deploy that is *already out*, not this same PR. If it's in this PR, split: land the code change now, do the destructive migration in a follow-up after it deploys. |
| 70 | +3. **Backfills:** confirm the `UPDATE`/`DELETE` is batched (bounded `WHERE`/keyset, not a single whole-table statement), idempotent (safe to replay — a failed migration re-runs unjournaled files from the top), and safe under concurrent writes from the still-live old app. |
| 71 | + |
| 72 | +## Workflow |
| 73 | + |
| 74 | +1. Edit `packages/db/schema.ts`, then `cd packages/db && bunx drizzle-kit generate` to produce the SQL. If this is an expand that defers a drop, leave a `contract-pending` marker on the legacy column (see "Tracking the contract"). If this is the contract, delete the marker it resolves. |
| 75 | +2. Hand-edit the generated SQL where the playbook requires it: `CONCURRENTLY` + `COMMIT;` breakpoint for indexes on existing tables, `NOT VALID` for constraints, batching for backfills. |
| 76 | +3. Run `bun run check:migrations` (base defaults to `origin/staging`). |
| 77 | + - **Hard errors** (`add-not-null-no-default`, `rename`, `index-not-concurrent`, `constraint-not-valid`, …): rewrite into expand/contract. Do **not** try to annotate them away — the lint won't accept it. |
| 78 | + - **Annotate tier** (`drop-table`, `drop-column`, `drop-default`, `set-not-null`, `alter-type`, `drop-index`): only after you've confirmed steps 1–3 above, add a comment on the line directly above the statement: |
| 79 | + ```sql |
| 80 | + -- migration-safe: `secret` read removed in v0.6.1 (#1234), shipped two deploys ago |
| 81 | + ALTER TABLE "webhook" DROP COLUMN "secret"; |
| 82 | + ``` |
| 83 | + The reason must be specific and name the PR/version that removed the dependency. An empty reason fails the lint. |
| 84 | + - **Warnings** (`data-backfill`): non-blocking, but confirm the batching/idempotency before merging. |
| 85 | +4. Verify locally: `cd packages/db && bun run db:migrate` against a dev DB. |
| 86 | + |
| 87 | +## Hard rule |
| 88 | + |
| 89 | +Never annotate a destructive statement just to make the lint pass. The annotation is a claim that you verified the old code no longer depends on it. If you can't make that claim truthfully, the change belongs in a later deploy — tell the user to split it. |
0 commit comments