PolicyEngine
diff --git a/‎.claude/skills/database-deployment-pipeline/SKILL.md‎
Lines changed: 105 additions & 0 deletions b/‎.claude/skills/database-deployment-pipeline/SKILL.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎.claude/skills/database-deployment-pipeline/references/design-decisions.md‎
Lines changed: 149 additions & 0 deletions b/‎.claude/skills/database-deployment-pipeline/references/design-decisions.md‎
Lines changed: 149 additions & 0 deletions
diff --git a/‎.claude/skills/database-migrations.md‎
Lines changed: 5 additions & 60 deletions b/‎.claude/skills/database-migrations.md‎
Lines changed: 5 additions & 60 deletions
@@ -0,0 +1,105 @@
+---
+name: Database Deployment Pipeline
+description: >
+  This skill should be used when the user asks about "database deployment",
+  "how migrations run in production", "deploy.yml database step", "db-reset workflow",
+  "why no create_all", "RLS policies deployment", "alembic in CI/CD",
+  "lock_timeout", "seeding production", "init.py vs alembic",
+  or needs to understand how schema changes reach the production Supabase database.
+  Also relevant when modifying deploy.yml, db-reset.yml, alembic/env.py, or scripts/init.py.
+version: 0.1.0
+---
+
+# Database Deployment Pipeline
+
+This skill covers how database schema changes are deployed to the production Supabase (Postgres) database in the PolicyEngine API v2 project, and the design rationale behind the current architecture.
+
+## Pipeline Overview
+
+Two GitHub Actions workflows touch the production database. They serve different purposes and should never be confused:
+
+| Workflow | Trigger | Effect | Destructive? |
+|----------|---------|--------|--------------|
+| `deploy.yml` | Merge to `main` (automatic) | Applies pending Alembic migrations | No |
+| `db-reset.yml` | Manual dispatch only | Drops all tables, recreates, reseeds | Yes |
+
+### deploy.yml — The Standard Path
+
+On every merge to `main`, the deploy job runs `alembic upgrade head` against the production Supabase database **before** building the Docker image or updating Cloud Run. This ordering is critical: if the migration fails, the deploy stops and the old code continues running against the old schema.
+
+The migration step uses `secrets.SUPABASE_DB_URL` (the direct connection on port 5432, not the pooler on port 6543) because DDL statements are incompatible with transaction-mode connection pooling.
+
+The deploy is also triggered by changes to `alembic/**`, ensuring migration-only PRs trigger a deploy.
+
+### db-reset.yml — The Nuclear Option
+
+A manual-only workflow that drops the entire `public` schema, recreates it via Alembic, applies RLS policies, and reseeds data. Requires typing `reset-prod` to confirm and has a `production` environment approval gate.
+
+Use this only when the database needs to be rebuilt from scratch (e.g., after a major schema redesign or data corruption). Never use it for routine schema changes.
+
+## Key Design Decisions
+
+### Migrations Run in CI/CD, Not at App Startup
+
+The FastAPI app does **not** run migrations on startup. There is no `create_all()` or `alembic upgrade head` in the lifespan. This is intentional:
+
+- Cloud Run may start multiple instances simultaneously. Concurrent DDL causes lock contention, race conditions, or duplicate migration attempts.
+- Running migrations in CI/CD means they execute exactly once per deployment, from a single runner.
+- If a migration fails, the deployment stops cleanly — no partially-migrated production state.
+
+### RLS Policies Stay in scripts/init.py, Not Alembic
+
+RLS policies are applied via `scripts/init.py` as idempotent raw SQL, not through Alembic migrations. This is intentional for this project's architecture:
+
+- The API connects as the `postgres` superuser via SQLAlchemy, which **bypasses RLS entirely**. RLS only protects the Supabase PostgREST surface (anon/authenticated roles).
+- Since RLS is defense-in-depth (not load-bearing), coupling it to schema migrations adds complexity with no runtime benefit.
+- The idempotent `DROP POLICY IF EXISTS` + `CREATE POLICY` pattern in `init.py` can be re-run safely — a property that's useful for configuration but awkward in Alembic's run-once model.
+- Alembic cannot autogenerate RLS changes, so every policy would be manual `op.execute()` SQL anyway.
+
+If the project ever moves to an architecture where RLS is load-bearing (e.g., connecting as a non-superuser role), RLS policies should move into Alembic migrations alongside the table they protect. See `references/design-decisions.md` for the full rationale.
+
+### lock_timeout Prevents Cascading Outages
+
+`alembic/env.py` sets `lock_timeout=5000` (5 seconds) on the migration connection. Without this, a migration that can't acquire a lock waits indefinitely — and all new queries queue behind it, cascading into a full outage. With the timeout, the migration fails fast and the deploy stops cleanly.
+
+## Common Operations
+
+### Adding a new table
+
+1. Create the SQLModel model in `src/policyengine_api/models/`
+2. Export it in `models/__init__.py`
+3. Generate migration: `uv run alembic revision --autogenerate -m "Add table_name table"`
+4. **Read and verify** the generated migration file
+5. Apply locally: `uv run alembic upgrade head`
+6. If the table needs RLS protection on the PostgREST surface, add policies to `scripts/init.py`
+7. Merge to `main` — migration runs automatically in `deploy.yml`
+
+### Making a destructive schema change
+
+Cloud Run uses rolling updates — old and new revisions serve traffic simultaneously. Destructive changes (drops, renames, type changes) require the expand-contract pattern across multiple deployments. See `references/design-decisions.md` for details.
+
+### Full production database reset
+
+1. Go to GitHub Actions > "Reset production database"
+2. Select seeding mode (`lite` or `full`)
+3. Type `reset-prod` in the confirmation field
+4. Wait for the `production` environment approval
+5. The workflow drops the schema, runs all migrations, applies RLS, and seeds data
+
+## File Map
+
+| File | Purpose |
+|------|---------|
+| `.github/workflows/deploy.yml` | Runs `alembic upgrade head` before Cloud Run update |
+| `.github/workflows/db-reset.yml` | Manual destructive reset + reseed |
+| `alembic/env.py` | Alembic config: reads `DATABASE_URL`, sets `NullPool` + `lock_timeout` |
+| `alembic/versions/` | Migration files (source of truth for schema) |
+| `scripts/init.py` | RLS policies, storage bucket setup, calls `alembic upgrade head` |
+| `scripts/seed.py` | Populates models, variables, parameters, datasets |
+| `src/policyengine_api/services/database.py` | SQLAlchemy engine + session factory (no `create_all`) |
+
+## Additional Resources
+
+### Reference Files
+
+- **`references/design-decisions.md`** — Full rationale for RLS placement, connection types, zero-downtime migrations, and anti-patterns to avoid
@@ -0,0 +1,149 @@
+# Database Deployment Design Decisions
+
+Detailed technical rationale behind the database deployment architecture. Consult this when modifying the pipeline or when the reasoning behind a decision is unclear.
+
+## Why Migrations Run in CI/CD (Not at App Startup)
+
+Four approaches were evaluated:
+
+| Approach | Runs Once? | Blocks Deploy on Failure? | Safe for Multi-Instance? |
+|----------|-----------|--------------------------|-------------------------|
+| CI/CD step in deploy.yml | Yes | Yes | Yes |
+| Cloud Run Job | Yes | Yes | Yes |
+| FastAPI lifespan | No | No | No |
+| Docker entrypoint | No | No | No |
+
+**CI/CD step was chosen** because it's the simplest correct approach for a small team. Cloud Run Jobs would add Terraform complexity (a separate `google_cloud_run_v2_job` resource) for no practical benefit at current scale.
+
+The lifespan and entrypoint approaches are dangerous because Cloud Run may start 2-10 instances simultaneously during a deployment. Each instance would attempt to run migrations concurrently, causing:
+- Lock contention on DDL statements
+- Race conditions where multiple instances try to create the same table
+- Duplicate entries in `alembic_version`
+
+The Alembic maintainer has confirmed that app-startup migrations are a fallback for environments that don't support arbitrary deploy commands — not the preferred approach.
+
+## Why `create_all()` Was Removed
+
+`SQLModel.metadata.create_all()` was previously called in the FastAPI lifespan via `init_db()`. This was removed because:
+
+1. **It conflicts with Alembic.** `create_all()` creates tables that don't exist but does not modify existing tables (no column adds, type changes, renames, or drops). This means schema evolution silently fails — the app starts successfully but with a stale schema.
+2. **It masks missing migrations.** If a developer forgets to generate a migration, `create_all()` might create the table anyway in some environments, hiding the problem until production.
+3. **It has the same multi-instance problem.** Multiple Cloud Run instances calling `create_all()` simultaneously can cause conflicts.
+
+With `create_all()` removed, a missing migration causes an immediate, visible failure — the app tries to query a table or column that doesn't exist, rather than silently operating against a partial schema.
+
+## Why RLS Policies Are Not in Alembic
+
+### The Core Insight: RLS Doesn't Protect This API
+
+The API connects to Supabase as the `postgres` user. In Supabase, `postgres` has the `BYPASSRLS` privilege. Every query from the FastAPI app and Modal workers bypasses RLS entirely.
+
+RLS policies only take effect when access goes through Supabase's PostgREST proxy — i.e., when using the Supabase client library with `anon` or `authenticated` JWT keys. The API uses the Supabase client **only for storage** (uploading/downloading HDF5 dataset files), not for table queries.
+
+### Why init.py Is the Right Home
+
+Given that RLS is defense-in-depth only:
+
+1. **Idempotency is a feature, not a limitation.** The `DROP POLICY IF EXISTS` + `CREATE POLICY` pattern can be re-run safely. Alembic migrations run once and are tracked — re-running RLS setup is actually desirable for configuration.
+2. **No tooling benefit from Alembic.** `alembic revision --autogenerate` cannot detect RLS changes. Every policy would be manual `op.execute()` raw SQL — identical to what `init.py` already does.
+3. **Supabase's own tooling has RLS gaps.** `supabase db diff` cannot track `ALTER POLICY` statements. Even Supabase doesn't have a clean migration story for RLS.
+4. **Simpler table creation.** Adding a new table doesn't require a separate migration file just for its RLS policy.
+
+### When to Reconsider
+
+Move RLS into Alembic if the architecture changes such that:
+- The API connects as a non-superuser role (not `postgres`)
+- RLS becomes the primary access control mechanism (not just defense-in-depth)
+- Tables need different policies at different points in time (versioned evolution)
+
+In that case, include RLS policies in the same Alembic migration that creates the table:
+
+```python
+def upgrade():
+    op.create_table("some_table", ...)
+    op.execute("ALTER TABLE some_table ENABLE ROW LEVEL SECURITY")
+    op.execute("""
+        CREATE POLICY "Service role full access" ON some_table
+        FOR ALL TO service_role USING (true) WITH CHECK (true)
+    """)
+
+def downgrade():
+    op.execute('DROP POLICY IF EXISTS "Service role full access" ON some_table')
+    op.drop_table("some_table")
+```
+
+## Connection Types for Supabase
+
+Supabase exposes two connection endpoints. Use the correct one for each operation:
+
+| Operation | Connection | Port | Why |
+|-----------|-----------|------|-----|
+| Alembic migrations | Direct (`db.project.supabase.co`) | 5432 | DDL needs full Postgres features |
+| FastAPI application | Pooler (`pooler.supabase.com`) | 6543 | Efficient connection reuse |
+| `scripts/init.py` | Pooler | 6543 | Mostly DML (RLS policies are DDL but lightweight) |
+
+The direct connection uses IPv6 by default. GitHub Actions runners support IPv6, so this works for CI/CD.
+
+`alembic/env.py` uses `NullPool` because Supabase manages connection pooling on its side. Creating a client-side pool on top of server-side pooling wastes connections.
+
+## Zero-Downtime Migration Pattern
+
+Cloud Run uses rolling updates: during deployment, both old and new revisions serve traffic simultaneously. Schema changes must be backwards-compatible with the old code.
+
+### Additive Changes (Safe in One Deploy)
+
+- Adding a new column (nullable or with a default)
+- Adding a new table
+- Adding an index
+
+These are safe because old code simply ignores the new column/table.
+
+### Destructive Changes (Require Expand-Contract)
+
+Dropping columns, renaming columns, or changing column types require three separate deployments:
+
+**Deploy 1 — Expand:** Add the new column/table. Old code ignores it.
+
+```python
+def upgrade():
+    op.add_column("users", sa.Column("full_name", sa.String(), nullable=True))
+```
+
+**Deploy 2 — Migrate:** New code writes to both old and new columns. Run a backfill for existing data.
+
+**Deploy 3 — Contract:** Drop the old column after all instances use the new code.
+
+```python
+def upgrade():
+    op.drop_column("users", "first_name")
+    op.drop_column("users", "last_name")
+```
+
+### Index Creation on Large Tables
+
+Standard `CREATE INDEX` acquires an exclusive lock, blocking all reads and writes. For large tables, use `CREATE INDEX CONCURRENTLY`:
+
+```python
+def upgrade():
+    op.execute("CREATE INDEX CONCURRENTLY idx_users_email ON users (email)")
+```
+
+Note: `CONCURRENTLY` cannot run inside a transaction. Use `op.get_context().autocommit_block()` or set the migration to non-transactional.
+
+## Anti-Patterns to Avoid
+
+### Editing Applied Migration Files
+
+Never modify a migration that has already been applied to production. Alembic tracks migrations by revision ID — editing a file does not re-run it. Create a new migration to fix issues.
+
+### Mixing Schema and Data Migrations
+
+A single migration that both alters the schema and backfills millions of rows holds locks for extended periods. Schema changes should be deploy-time migrations. Data backfills should be separate scripts or background jobs.
+
+### Running Migrations Without lock_timeout
+
+Without `lock_timeout`, a migration waits indefinitely for a lock on a table with long-running queries. All new queries queue behind the migration's lock request, cascading into a full outage. The 5-second timeout in `alembic/env.py` ensures the migration fails fast instead.
+
+### Using the Pooler for Migrations
+
+Transaction-mode pooling (port 6543) is incompatible with DDL statements and prepared statements. Always use the direct connection (port 5432) for Alembic operations. The `deploy.yml` workflow uses `secrets.SUPABASE_DB_URL` (direct), while `db-reset.yml` uses `secrets.SUPABASE_POOLER_URL` — note this difference if modifying the workflows.
@@ -36,9 +36,9 @@ This project uses **Alembic** for database migrations with **SQLModel** models.
 
 ## Essential Rules
 
-### 1. NEVER use SQLModel.metadata.create_all() for schema creation
+### 1. NEVER use SQLModel.metadata.create_all()
 
-The old pattern of using `SQLModel.metadata.create_all()` is deprecated. All tables are created via Alembic migrations.
+`create_all()` is not used anywhere in this project. It was removed because it conflicts with Alembic (creates tables but can't modify them, masking missing migrations). For how migrations reach production, see the `database-deployment-pipeline` skill.
 
 ### 2. Every schema change requires a migration
 
@@ -200,66 +200,11 @@ uv run alembic upgrade head
 
 ## Production Considerations
 
-### Applying migrations to production
+Migrations are automatically applied in `deploy.yml` (runs `alembic upgrade head` before updating Cloud Run). For full details on the production pipeline, connection types, lock_timeout, RLS policy handling, and zero-downtime patterns, see the `database-deployment-pipeline` skill.
 
-1. Migrations are automatically applied when deploying
-2. Always test migrations locally first
-3. For data migrations, consider running during low-traffic periods
+### alembic stamp (for one-time transitions)
 
-### Transitioning production from old system to Alembic
-
-Production databases that were created before Alembic (using the old `SQLModel.metadata.create_all()` approach or raw Supabase migrations) need special handling. Running `alembic upgrade head` would fail because the tables already exist.
-
-**The solution: `alembic stamp`**
-
-The `alembic stamp` command marks a migration as "already applied" without actually running it. This tells Alembic "the database is already at this state, start tracking from here."
-
-**How it works:**
-
-1. `alembic stamp <revision_id>` inserts a row into the `alembic_version` table with the specified revision ID
-2. Alembic now thinks that migration (and all migrations before it) have been applied
-3. Future migrations will run normally starting from that point
-
-**Step-by-step production transition:**
-
-```bash
-# 1. Connect to production database
-# (set SUPABASE_DB_URL or other connection env vars)
-
-# 2. Check if alembic_version table exists
-# If not, Alembic will create it automatically
-
-# 3. Verify production schema matches the initial migration
-# Compare tables/columns in production against alembic/versions/20260204_d6e30d3b834d_initial_schema.py
-
-# 4. Stamp the initial migration as applied
-uv run alembic stamp d6e30d3b834d
-
-# 5. If production also has the indexes from the second migration, stamp that too
-uv run alembic stamp a17ac554f4aa
-
-# 6. Verify the stamp worked
-uv run alembic current
-# Should show: a17ac554f4aa (head)
-
-# 7. From now on, new migrations will apply normally
-uv run alembic upgrade head
-```
-
-**Handling partially applied migrations:**
-
-If production has some but not all changes from a migration:
-
-1. Manually apply the missing changes via SQL
-2. Then stamp that migration as complete
-3. Or: create a new migration that only adds the missing pieces
-
-**After stamping:**
-
-- All future schema changes go through Alembic migrations
-- Developers generate migrations with `alembic revision --autogenerate`
-- Deployments run `alembic upgrade head` to apply pending migrations
-- The `alembic_version` table tracks what's been applied
+If production has tables that predate Alembic, use `alembic stamp <revision_id>` to mark migrations as already applied without running them. This tells Alembic to start tracking from that point forward.
 
 ## File Structure