|
| 1 | +# Database Deployment Design Decisions |
| 2 | + |
| 3 | +Detailed technical rationale behind the database deployment architecture. Consult this when modifying the pipeline or when the reasoning behind a decision is unclear. |
| 4 | + |
| 5 | +## Why Migrations Run in CI/CD (Not at App Startup) |
| 6 | + |
| 7 | +Four approaches were evaluated: |
| 8 | + |
| 9 | +| Approach | Runs Once? | Blocks Deploy on Failure? | Safe for Multi-Instance? | |
| 10 | +|----------|-----------|--------------------------|-------------------------| |
| 11 | +| CI/CD step in deploy.yml | Yes | Yes | Yes | |
| 12 | +| Cloud Run Job | Yes | Yes | Yes | |
| 13 | +| FastAPI lifespan | No | No | No | |
| 14 | +| Docker entrypoint | No | No | No | |
| 15 | + |
| 16 | +**CI/CD step was chosen** because it's the simplest correct approach for a small team. Cloud Run Jobs would add Terraform complexity (a separate `google_cloud_run_v2_job` resource) for no practical benefit at current scale. |
| 17 | + |
| 18 | +The lifespan and entrypoint approaches are dangerous because Cloud Run may start 2-10 instances simultaneously during a deployment. Each instance would attempt to run migrations concurrently, causing: |
| 19 | +- Lock contention on DDL statements |
| 20 | +- Race conditions where multiple instances try to create the same table |
| 21 | +- Duplicate entries in `alembic_version` |
| 22 | + |
| 23 | +The Alembic maintainer has confirmed that app-startup migrations are a fallback for environments that don't support arbitrary deploy commands — not the preferred approach. |
| 24 | + |
| 25 | +## Why `create_all()` Was Removed |
| 26 | + |
| 27 | +`SQLModel.metadata.create_all()` was previously called in the FastAPI lifespan via `init_db()`. This was removed because: |
| 28 | + |
| 29 | +1. **It conflicts with Alembic.** `create_all()` creates tables that don't exist but does not modify existing tables (no column adds, type changes, renames, or drops). This means schema evolution silently fails — the app starts successfully but with a stale schema. |
| 30 | +2. **It masks missing migrations.** If a developer forgets to generate a migration, `create_all()` might create the table anyway in some environments, hiding the problem until production. |
| 31 | +3. **It has the same multi-instance problem.** Multiple Cloud Run instances calling `create_all()` simultaneously can cause conflicts. |
| 32 | + |
| 33 | +With `create_all()` removed, a missing migration causes an immediate, visible failure — the app tries to query a table or column that doesn't exist, rather than silently operating against a partial schema. |
| 34 | + |
| 35 | +## Why RLS Policies Are Not in Alembic |
| 36 | + |
| 37 | +### The Core Insight: RLS Doesn't Protect This API |
| 38 | + |
| 39 | +The API connects to Supabase as the `postgres` user. In Supabase, `postgres` has the `BYPASSRLS` privilege. Every query from the FastAPI app and Modal workers bypasses RLS entirely. |
| 40 | + |
| 41 | +RLS policies only take effect when access goes through Supabase's PostgREST proxy — i.e., when using the Supabase client library with `anon` or `authenticated` JWT keys. The API uses the Supabase client **only for storage** (uploading/downloading HDF5 dataset files), not for table queries. |
| 42 | + |
| 43 | +### Why init.py Is the Right Home |
| 44 | + |
| 45 | +Given that RLS is defense-in-depth only: |
| 46 | + |
| 47 | +1. **Idempotency is a feature, not a limitation.** The `DROP POLICY IF EXISTS` + `CREATE POLICY` pattern can be re-run safely. Alembic migrations run once and are tracked — re-running RLS setup is actually desirable for configuration. |
| 48 | +2. **No tooling benefit from Alembic.** `alembic revision --autogenerate` cannot detect RLS changes. Every policy would be manual `op.execute()` raw SQL — identical to what `init.py` already does. |
| 49 | +3. **Supabase's own tooling has RLS gaps.** `supabase db diff` cannot track `ALTER POLICY` statements. Even Supabase doesn't have a clean migration story for RLS. |
| 50 | +4. **Simpler table creation.** Adding a new table doesn't require a separate migration file just for its RLS policy. |
| 51 | + |
| 52 | +### When to Reconsider |
| 53 | + |
| 54 | +Move RLS into Alembic if the architecture changes such that: |
| 55 | +- The API connects as a non-superuser role (not `postgres`) |
| 56 | +- RLS becomes the primary access control mechanism (not just defense-in-depth) |
| 57 | +- Tables need different policies at different points in time (versioned evolution) |
| 58 | + |
| 59 | +In that case, include RLS policies in the same Alembic migration that creates the table: |
| 60 | + |
| 61 | +```python |
| 62 | +def upgrade(): |
| 63 | + op.create_table("some_table", ...) |
| 64 | + op.execute("ALTER TABLE some_table ENABLE ROW LEVEL SECURITY") |
| 65 | + op.execute(""" |
| 66 | + CREATE POLICY "Service role full access" ON some_table |
| 67 | + FOR ALL TO service_role USING (true) WITH CHECK (true) |
| 68 | + """) |
| 69 | + |
| 70 | +def downgrade(): |
| 71 | + op.execute('DROP POLICY IF EXISTS "Service role full access" ON some_table') |
| 72 | + op.drop_table("some_table") |
| 73 | +``` |
| 74 | + |
| 75 | +## Connection Types for Supabase |
| 76 | + |
| 77 | +Supabase exposes two connection endpoints. Use the correct one for each operation: |
| 78 | + |
| 79 | +| Operation | Connection | Port | Why | |
| 80 | +|-----------|-----------|------|-----| |
| 81 | +| Alembic migrations | Direct (`db.project.supabase.co`) | 5432 | DDL needs full Postgres features | |
| 82 | +| FastAPI application | Pooler (`pooler.supabase.com`) | 6543 | Efficient connection reuse | |
| 83 | +| `scripts/init.py` | Pooler | 6543 | Mostly DML (RLS policies are DDL but lightweight) | |
| 84 | + |
| 85 | +The direct connection uses IPv6 by default. GitHub Actions runners support IPv6, so this works for CI/CD. |
| 86 | + |
| 87 | +`alembic/env.py` uses `NullPool` because Supabase manages connection pooling on its side. Creating a client-side pool on top of server-side pooling wastes connections. |
| 88 | + |
| 89 | +## Zero-Downtime Migration Pattern |
| 90 | + |
| 91 | +Cloud Run uses rolling updates: during deployment, both old and new revisions serve traffic simultaneously. Schema changes must be backwards-compatible with the old code. |
| 92 | + |
| 93 | +### Additive Changes (Safe in One Deploy) |
| 94 | + |
| 95 | +- Adding a new column (nullable or with a default) |
| 96 | +- Adding a new table |
| 97 | +- Adding an index |
| 98 | + |
| 99 | +These are safe because old code simply ignores the new column/table. |
| 100 | + |
| 101 | +### Destructive Changes (Require Expand-Contract) |
| 102 | + |
| 103 | +Dropping columns, renaming columns, or changing column types require three separate deployments: |
| 104 | + |
| 105 | +**Deploy 1 — Expand:** Add the new column/table. Old code ignores it. |
| 106 | + |
| 107 | +```python |
| 108 | +def upgrade(): |
| 109 | + op.add_column("users", sa.Column("full_name", sa.String(), nullable=True)) |
| 110 | +``` |
| 111 | + |
| 112 | +**Deploy 2 — Migrate:** New code writes to both old and new columns. Run a backfill for existing data. |
| 113 | + |
| 114 | +**Deploy 3 — Contract:** Drop the old column after all instances use the new code. |
| 115 | + |
| 116 | +```python |
| 117 | +def upgrade(): |
| 118 | + op.drop_column("users", "first_name") |
| 119 | + op.drop_column("users", "last_name") |
| 120 | +``` |
| 121 | + |
| 122 | +### Index Creation on Large Tables |
| 123 | + |
| 124 | +Standard `CREATE INDEX` acquires an exclusive lock, blocking all reads and writes. For large tables, use `CREATE INDEX CONCURRENTLY`: |
| 125 | + |
| 126 | +```python |
| 127 | +def upgrade(): |
| 128 | + op.execute("CREATE INDEX CONCURRENTLY idx_users_email ON users (email)") |
| 129 | +``` |
| 130 | + |
| 131 | +Note: `CONCURRENTLY` cannot run inside a transaction. Use `op.get_context().autocommit_block()` or set the migration to non-transactional. |
| 132 | + |
| 133 | +## Anti-Patterns to Avoid |
| 134 | + |
| 135 | +### Editing Applied Migration Files |
| 136 | + |
| 137 | +Never modify a migration that has already been applied to production. Alembic tracks migrations by revision ID — editing a file does not re-run it. Create a new migration to fix issues. |
| 138 | + |
| 139 | +### Mixing Schema and Data Migrations |
| 140 | + |
| 141 | +A single migration that both alters the schema and backfills millions of rows holds locks for extended periods. Schema changes should be deploy-time migrations. Data backfills should be separate scripts or background jobs. |
| 142 | + |
| 143 | +### Running Migrations Without lock_timeout |
| 144 | + |
| 145 | +Without `lock_timeout`, a migration waits indefinitely for a lock on a table with long-running queries. All new queries queue behind the migration's lock request, cascading into a full outage. The 5-second timeout in `alembic/env.py` ensures the migration fails fast instead. |
| 146 | + |
| 147 | +### Using the Pooler for Migrations |
| 148 | + |
| 149 | +Transaction-mode pooling (port 6543) is incompatible with DDL statements and prepared statements. Always use the direct connection (port 5432) for Alembic operations. The `deploy.yml` workflow uses `secrets.SUPABASE_DB_URL` (direct), while `db-reset.yml` uses `secrets.SUPABASE_POOLER_URL` — note this difference if modifying the workflows. |
0 commit comments