Skip to content

Commit 3403a75

Browse files
authored
Merge pull request #107 from PolicyEngine/feat/scaffolding
CI/CD: Staging pipeline, release automation, and PR quality checks
2 parents 2099677 + 271629e commit 3403a75

49 files changed

Lines changed: 2809 additions & 218 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
name: Database Deployment Pipeline
3+
description: >
4+
This skill should be used when the user asks about "database deployment",
5+
"how migrations run in production", "deploy.yml database step", "db-reset workflow",
6+
"why no create_all", "RLS policies deployment", "alembic in CI/CD",
7+
"lock_timeout", "seeding production", "init.py vs alembic",
8+
or needs to understand how schema changes reach the production Supabase database.
9+
Also relevant when modifying deploy.yml, db-reset.yml, alembic/env.py, or scripts/init.py.
10+
version: 0.1.0
11+
---
12+
13+
# Database Deployment Pipeline
14+
15+
This skill covers how database schema changes are deployed to the production Supabase (Postgres) database in the PolicyEngine API v2 project, and the design rationale behind the current architecture.
16+
17+
## Pipeline Overview
18+
19+
Two GitHub Actions workflows touch the production database. They serve different purposes and should never be confused:
20+
21+
| Workflow | Trigger | Effect | Destructive? |
22+
|----------|---------|--------|--------------|
23+
| `deploy.yml` | Merge to `main` (automatic) | Applies pending Alembic migrations | No |
24+
| `db-reset.yml` | Manual dispatch only | Drops all tables, recreates, reseeds | Yes |
25+
26+
### deploy.yml — The Standard Path
27+
28+
On every merge to `main`, the deploy job runs `alembic upgrade head` against the production Supabase database **before** building the Docker image or updating Cloud Run. This ordering is critical: if the migration fails, the deploy stops and the old code continues running against the old schema.
29+
30+
The migration step uses `secrets.SUPABASE_DB_URL` (the direct connection on port 5432, not the pooler on port 6543) because DDL statements are incompatible with transaction-mode connection pooling.
31+
32+
The deploy is also triggered by changes to `alembic/**`, ensuring migration-only PRs trigger a deploy.
33+
34+
### db-reset.yml — The Nuclear Option
35+
36+
A manual-only workflow that drops the entire `public` schema, recreates it via Alembic, applies RLS policies, and reseeds data. Requires typing `reset-prod` to confirm and has a `production` environment approval gate.
37+
38+
Use this only when the database needs to be rebuilt from scratch (e.g., after a major schema redesign or data corruption). Never use it for routine schema changes.
39+
40+
## Key Design Decisions
41+
42+
### Migrations Run in CI/CD, Not at App Startup
43+
44+
The FastAPI app does **not** run migrations on startup. There is no `create_all()` or `alembic upgrade head` in the lifespan. This is intentional:
45+
46+
- Cloud Run may start multiple instances simultaneously. Concurrent DDL causes lock contention, race conditions, or duplicate migration attempts.
47+
- Running migrations in CI/CD means they execute exactly once per deployment, from a single runner.
48+
- If a migration fails, the deployment stops cleanly — no partially-migrated production state.
49+
50+
### RLS Policies Stay in scripts/init.py, Not Alembic
51+
52+
RLS policies are applied via `scripts/init.py` as idempotent raw SQL, not through Alembic migrations. This is intentional for this project's architecture:
53+
54+
- The API connects as the `postgres` superuser via SQLAlchemy, which **bypasses RLS entirely**. RLS only protects the Supabase PostgREST surface (anon/authenticated roles).
55+
- Since RLS is defense-in-depth (not load-bearing), coupling it to schema migrations adds complexity with no runtime benefit.
56+
- The idempotent `DROP POLICY IF EXISTS` + `CREATE POLICY` pattern in `init.py` can be re-run safely — a property that's useful for configuration but awkward in Alembic's run-once model.
57+
- Alembic cannot autogenerate RLS changes, so every policy would be manual `op.execute()` SQL anyway.
58+
59+
If the project ever moves to an architecture where RLS is load-bearing (e.g., connecting as a non-superuser role), RLS policies should move into Alembic migrations alongside the table they protect. See `references/design-decisions.md` for the full rationale.
60+
61+
### lock_timeout Prevents Cascading Outages
62+
63+
`alembic/env.py` sets `lock_timeout=5000` (5 seconds) on the migration connection. Without this, a migration that can't acquire a lock waits indefinitely — and all new queries queue behind it, cascading into a full outage. With the timeout, the migration fails fast and the deploy stops cleanly.
64+
65+
## Common Operations
66+
67+
### Adding a new table
68+
69+
1. Create the SQLModel model in `src/policyengine_api/models/`
70+
2. Export it in `models/__init__.py`
71+
3. Generate migration: `uv run alembic revision --autogenerate -m "Add table_name table"`
72+
4. **Read and verify** the generated migration file
73+
5. Apply locally: `uv run alembic upgrade head`
74+
6. If the table needs RLS protection on the PostgREST surface, add policies to `scripts/init.py`
75+
7. Merge to `main` — migration runs automatically in `deploy.yml`
76+
77+
### Making a destructive schema change
78+
79+
Cloud Run uses rolling updates — old and new revisions serve traffic simultaneously. Destructive changes (drops, renames, type changes) require the expand-contract pattern across multiple deployments. See `references/design-decisions.md` for details.
80+
81+
### Full production database reset
82+
83+
1. Go to GitHub Actions > "Reset production database"
84+
2. Select seeding mode (`lite` or `full`)
85+
3. Type `reset-prod` in the confirmation field
86+
4. Wait for the `production` environment approval
87+
5. The workflow drops the schema, runs all migrations, applies RLS, and seeds data
88+
89+
## File Map
90+
91+
| File | Purpose |
92+
|------|---------|
93+
| `.github/workflows/deploy.yml` | Runs `alembic upgrade head` before Cloud Run update |
94+
| `.github/workflows/db-reset.yml` | Manual destructive reset + reseed |
95+
| `alembic/env.py` | Alembic config: reads `DATABASE_URL`, sets `NullPool` + `lock_timeout` |
96+
| `alembic/versions/` | Migration files (source of truth for schema) |
97+
| `scripts/init.py` | RLS policies, storage bucket setup, calls `alembic upgrade head` |
98+
| `scripts/seed.py` | Populates models, variables, parameters, datasets |
99+
| `src/policyengine_api/services/database.py` | SQLAlchemy engine + session factory (no `create_all`) |
100+
101+
## Additional Resources
102+
103+
### Reference Files
104+
105+
- **`references/design-decisions.md`** — Full rationale for RLS placement, connection types, zero-downtime migrations, and anti-patterns to avoid
Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
# Database Deployment Design Decisions
2+
3+
Detailed technical rationale behind the database deployment architecture. Consult this when modifying the pipeline or when the reasoning behind a decision is unclear.
4+
5+
## Why Migrations Run in CI/CD (Not at App Startup)
6+
7+
Four approaches were evaluated:
8+
9+
| Approach | Runs Once? | Blocks Deploy on Failure? | Safe for Multi-Instance? |
10+
|----------|-----------|--------------------------|-------------------------|
11+
| CI/CD step in deploy.yml | Yes | Yes | Yes |
12+
| Cloud Run Job | Yes | Yes | Yes |
13+
| FastAPI lifespan | No | No | No |
14+
| Docker entrypoint | No | No | No |
15+
16+
**CI/CD step was chosen** because it's the simplest correct approach for a small team. Cloud Run Jobs would add Terraform complexity (a separate `google_cloud_run_v2_job` resource) for no practical benefit at current scale.
17+
18+
The lifespan and entrypoint approaches are dangerous because Cloud Run may start 2-10 instances simultaneously during a deployment. Each instance would attempt to run migrations concurrently, causing:
19+
- Lock contention on DDL statements
20+
- Race conditions where multiple instances try to create the same table
21+
- Duplicate entries in `alembic_version`
22+
23+
The Alembic maintainer has confirmed that app-startup migrations are a fallback for environments that don't support arbitrary deploy commands — not the preferred approach.
24+
25+
## Why `create_all()` Was Removed
26+
27+
`SQLModel.metadata.create_all()` was previously called in the FastAPI lifespan via `init_db()`. This was removed because:
28+
29+
1. **It conflicts with Alembic.** `create_all()` creates tables that don't exist but does not modify existing tables (no column adds, type changes, renames, or drops). This means schema evolution silently fails — the app starts successfully but with a stale schema.
30+
2. **It masks missing migrations.** If a developer forgets to generate a migration, `create_all()` might create the table anyway in some environments, hiding the problem until production.
31+
3. **It has the same multi-instance problem.** Multiple Cloud Run instances calling `create_all()` simultaneously can cause conflicts.
32+
33+
With `create_all()` removed, a missing migration causes an immediate, visible failure — the app tries to query a table or column that doesn't exist, rather than silently operating against a partial schema.
34+
35+
## Why RLS Policies Are Not in Alembic
36+
37+
### The Core Insight: RLS Doesn't Protect This API
38+
39+
The API connects to Supabase as the `postgres` user. In Supabase, `postgres` has the `BYPASSRLS` privilege. Every query from the FastAPI app and Modal workers bypasses RLS entirely.
40+
41+
RLS policies only take effect when access goes through Supabase's PostgREST proxy — i.e., when using the Supabase client library with `anon` or `authenticated` JWT keys. The API uses the Supabase client **only for storage** (uploading/downloading HDF5 dataset files), not for table queries.
42+
43+
### Why init.py Is the Right Home
44+
45+
Given that RLS is defense-in-depth only:
46+
47+
1. **Idempotency is a feature, not a limitation.** The `DROP POLICY IF EXISTS` + `CREATE POLICY` pattern can be re-run safely. Alembic migrations run once and are tracked — re-running RLS setup is actually desirable for configuration.
48+
2. **No tooling benefit from Alembic.** `alembic revision --autogenerate` cannot detect RLS changes. Every policy would be manual `op.execute()` raw SQL — identical to what `init.py` already does.
49+
3. **Supabase's own tooling has RLS gaps.** `supabase db diff` cannot track `ALTER POLICY` statements. Even Supabase doesn't have a clean migration story for RLS.
50+
4. **Simpler table creation.** Adding a new table doesn't require a separate migration file just for its RLS policy.
51+
52+
### When to Reconsider
53+
54+
Move RLS into Alembic if the architecture changes such that:
55+
- The API connects as a non-superuser role (not `postgres`)
56+
- RLS becomes the primary access control mechanism (not just defense-in-depth)
57+
- Tables need different policies at different points in time (versioned evolution)
58+
59+
In that case, include RLS policies in the same Alembic migration that creates the table:
60+
61+
```python
62+
def upgrade():
63+
op.create_table("some_table", ...)
64+
op.execute("ALTER TABLE some_table ENABLE ROW LEVEL SECURITY")
65+
op.execute("""
66+
CREATE POLICY "Service role full access" ON some_table
67+
FOR ALL TO service_role USING (true) WITH CHECK (true)
68+
""")
69+
70+
def downgrade():
71+
op.execute('DROP POLICY IF EXISTS "Service role full access" ON some_table')
72+
op.drop_table("some_table")
73+
```
74+
75+
## Connection Types for Supabase
76+
77+
Supabase exposes two connection endpoints. Use the correct one for each operation:
78+
79+
| Operation | Connection | Port | Why |
80+
|-----------|-----------|------|-----|
81+
| Alembic migrations | Direct (`db.project.supabase.co`) | 5432 | DDL needs full Postgres features |
82+
| FastAPI application | Pooler (`pooler.supabase.com`) | 6543 | Efficient connection reuse |
83+
| `scripts/init.py` | Pooler | 6543 | Mostly DML (RLS policies are DDL but lightweight) |
84+
85+
The direct connection uses IPv6 by default. GitHub Actions runners support IPv6, so this works for CI/CD.
86+
87+
`alembic/env.py` uses `NullPool` because Supabase manages connection pooling on its side. Creating a client-side pool on top of server-side pooling wastes connections.
88+
89+
## Zero-Downtime Migration Pattern
90+
91+
Cloud Run uses rolling updates: during deployment, both old and new revisions serve traffic simultaneously. Schema changes must be backwards-compatible with the old code.
92+
93+
### Additive Changes (Safe in One Deploy)
94+
95+
- Adding a new column (nullable or with a default)
96+
- Adding a new table
97+
- Adding an index
98+
99+
These are safe because old code simply ignores the new column/table.
100+
101+
### Destructive Changes (Require Expand-Contract)
102+
103+
Dropping columns, renaming columns, or changing column types require three separate deployments:
104+
105+
**Deploy 1 — Expand:** Add the new column/table. Old code ignores it.
106+
107+
```python
108+
def upgrade():
109+
op.add_column("users", sa.Column("full_name", sa.String(), nullable=True))
110+
```
111+
112+
**Deploy 2 — Migrate:** New code writes to both old and new columns. Run a backfill for existing data.
113+
114+
**Deploy 3 — Contract:** Drop the old column after all instances use the new code.
115+
116+
```python
117+
def upgrade():
118+
op.drop_column("users", "first_name")
119+
op.drop_column("users", "last_name")
120+
```
121+
122+
### Index Creation on Large Tables
123+
124+
Standard `CREATE INDEX` acquires an exclusive lock, blocking all reads and writes. For large tables, use `CREATE INDEX CONCURRENTLY`:
125+
126+
```python
127+
def upgrade():
128+
op.execute("CREATE INDEX CONCURRENTLY idx_users_email ON users (email)")
129+
```
130+
131+
Note: `CONCURRENTLY` cannot run inside a transaction. Use `op.get_context().autocommit_block()` or set the migration to non-transactional.
132+
133+
## Anti-Patterns to Avoid
134+
135+
### Editing Applied Migration Files
136+
137+
Never modify a migration that has already been applied to production. Alembic tracks migrations by revision ID — editing a file does not re-run it. Create a new migration to fix issues.
138+
139+
### Mixing Schema and Data Migrations
140+
141+
A single migration that both alters the schema and backfills millions of rows holds locks for extended periods. Schema changes should be deploy-time migrations. Data backfills should be separate scripts or background jobs.
142+
143+
### Running Migrations Without lock_timeout
144+
145+
Without `lock_timeout`, a migration waits indefinitely for a lock on a table with long-running queries. All new queries queue behind the migration's lock request, cascading into a full outage. The 5-second timeout in `alembic/env.py` ensures the migration fails fast instead.
146+
147+
### Using the Pooler for Migrations
148+
149+
Transaction-mode pooling (port 6543) is incompatible with DDL statements and prepared statements. Always use the direct connection (port 5432) for Alembic operations. The `deploy.yml` workflow uses `secrets.SUPABASE_DB_URL` (direct), while `db-reset.yml` uses `secrets.SUPABASE_POOLER_URL` — note this difference if modifying the workflows.

.claude/skills/database-migrations.md

Lines changed: 5 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,9 @@ This project uses **Alembic** for database migrations with **SQLModel** models.
3636

3737
## Essential Rules
3838

39-
### 1. NEVER use SQLModel.metadata.create_all() for schema creation
39+
### 1. NEVER use SQLModel.metadata.create_all()
4040

41-
The old pattern of using `SQLModel.metadata.create_all()` is deprecated. All tables are created via Alembic migrations.
41+
`create_all()` is not used anywhere in this project. It was removed because it conflicts with Alembic (creates tables but can't modify them, masking missing migrations). For how migrations reach production, see the `database-deployment-pipeline` skill.
4242

4343
### 2. Every schema change requires a migration
4444

@@ -200,66 +200,11 @@ uv run alembic upgrade head
200200

201201
## Production Considerations
202202

203-
### Applying migrations to production
203+
Migrations are automatically applied in `deploy.yml` (runs `alembic upgrade head` before updating Cloud Run). For full details on the production pipeline, connection types, lock_timeout, RLS policy handling, and zero-downtime patterns, see the `database-deployment-pipeline` skill.
204204

205-
1. Migrations are automatically applied when deploying
206-
2. Always test migrations locally first
207-
3. For data migrations, consider running during low-traffic periods
205+
### alembic stamp (for one-time transitions)
208206

209-
### Transitioning production from old system to Alembic
210-
211-
Production databases that were created before Alembic (using the old `SQLModel.metadata.create_all()` approach or raw Supabase migrations) need special handling. Running `alembic upgrade head` would fail because the tables already exist.
212-
213-
**The solution: `alembic stamp`**
214-
215-
The `alembic stamp` command marks a migration as "already applied" without actually running it. This tells Alembic "the database is already at this state, start tracking from here."
216-
217-
**How it works:**
218-
219-
1. `alembic stamp <revision_id>` inserts a row into the `alembic_version` table with the specified revision ID
220-
2. Alembic now thinks that migration (and all migrations before it) have been applied
221-
3. Future migrations will run normally starting from that point
222-
223-
**Step-by-step production transition:**
224-
225-
```bash
226-
# 1. Connect to production database
227-
# (set SUPABASE_DB_URL or other connection env vars)
228-
229-
# 2. Check if alembic_version table exists
230-
# If not, Alembic will create it automatically
231-
232-
# 3. Verify production schema matches the initial migration
233-
# Compare tables/columns in production against alembic/versions/20260204_d6e30d3b834d_initial_schema.py
234-
235-
# 4. Stamp the initial migration as applied
236-
uv run alembic stamp d6e30d3b834d
237-
238-
# 5. If production also has the indexes from the second migration, stamp that too
239-
uv run alembic stamp a17ac554f4aa
240-
241-
# 6. Verify the stamp worked
242-
uv run alembic current
243-
# Should show: a17ac554f4aa (head)
244-
245-
# 7. From now on, new migrations will apply normally
246-
uv run alembic upgrade head
247-
```
248-
249-
**Handling partially applied migrations:**
250-
251-
If production has some but not all changes from a migration:
252-
253-
1. Manually apply the missing changes via SQL
254-
2. Then stamp that migration as complete
255-
3. Or: create a new migration that only adds the missing pieces
256-
257-
**After stamping:**
258-
259-
- All future schema changes go through Alembic migrations
260-
- Developers generate migrations with `alembic revision --autogenerate`
261-
- Deployments run `alembic upgrade head` to apply pending migrations
262-
- The `alembic_version` table tracks what's been applied
207+
If production has tables that predate Alembic, use `alembic stamp <revision_id>` to mark migrations as already applied without running them. This tells Alembic to start tracking from that point forward.
263208

264209
## File Structure
265210

0 commit comments

Comments
 (0)