add docs

subpath · subpath · commit bd31069a503f · 2026-06-25T15:00:59.000+02:00
diff --git a/docs/database-management.md b/docs/database-management.md
@@ -0,0 +1,198 @@
+# Database management
+
+MLPA talks to two PostgreSQL databases. This doc covers what lives where, how the
+connection pools are configured, and the query timeout budgets (why we have a few
+of them and which query uses which).
+
+## The two databases
+
+| DB | Owner | What MLPA does with it |
+|----|-------|------------------------|
+| `litellm` (`LITELLM_DB_NAME`) | LiteLLM | Reads/writes a couple of tables directly for things the free-tier LiteLLM API doesn't expose (block/unblock, budget tier change, user listing/counts). |
+| `app_attest` (`APP_ATTEST_DB_NAME`) | MLPA (via Alembic) | App Attest challenges + keys, and the signup capacity state. |
+
+Each DB gets its own asyncpg pool, wrapped in a `PGService`
+(`src/mlpa/core/pg_services/`):
+
+- `LiteLLMPGService` → `litellm`
+- `AppAttestPGService` → `app_attest` (also holds a reference to the litellm
+  service, because the capacity gate reads from both)
+
+## Tables
+
+### litellm DB (LiteLLM owns the schema)
+
+**`LiteLLM_EndUserTable`** - one row per end user.
+
+- `user_id` is `{base_identity}:{service_type}`, e.g. `fxa_uid:ai`. The colon is
+  load-bearing, we `split_part(user_id, ':', ...)` all over the place.
+- MLPA uses: `user_id`, `budget_id` (which tier the user is on), `blocked`.
+- Touched by: `get_user`, `list_users`, `update_user_budget`, `block_user`,
+  `count_users_by_service_type`, `list_managed_base_identities`,
+  `has_managed_user_rows`.
+
+**`LiteLLM_BudgetTable`** - the budget tiers (one per service type).
+
+- `budget_id`, `max_budget`, `rpm_limit`, `tpm_limit`, `budget_duration`.
+- MLPA upserts all tiers from config on startup (`create_budget()`), so changing
+  a limit in `config.py` takes effect on next restart, not live.
+
+### app_attest DB (MLPA owns the schema, managed by Alembic)
+
+**`challenges`** - App Attest challenge nonce.
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `key_id_b64` | `VARCHAR(255)` PK | the attested key id |
+| `challenge` | `VARCHAR(255)` | the nonce we issued |
+| `created_at` | `TIMESTAMPTZ` | expires after `CHALLENGE_EXPIRY_SECONDS` (300s) |
+
+**`public_keys`** - the iOS attested key + replay counter.
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `key_id_b64` | `VARCHAR(255)` PK | |
+| `public_key_pem` | `TEXT` | attested public key |
+| `counter` | `BIGINT` | assertion counter, only goes up (replay protection) |
+| `created_at` / `updated_at` | `TIMESTAMPTZ` | |
+
+**`mlpa_user_capacity`** - the signup cap counter. Single row.
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `id` | `SMALLINT` PK `CHECK (id = 1)` | singleton, always 1 |
+| `max_identities` | `BIGINT` | the cap (`MLPA_MAX_SIGNED_IN_USERS`) |
+| `current_identities` | `BIGINT` | how many distinct identities are claimed |
+| `updated_at` | `TIMESTAMPTZ` | |
+
+**`mlpa_user_capacity_identities`** - one row per claimed identity.
+
+| Column | Type | Notes |
+|--------|------|-------|
+| `base_identity` | `TEXT` PK | the `{base_identity}` part of `user_id` |
+| `created_at` | `TIMESTAMPTZ` | |
+
+The two capacity tables are reconciled from `LiteLLM_EndUserTable` on startup
+(`ensure_capacity_state()`), so they don't drift from the real user base.
+
+## Connection pool
+
+Set up in `PGService.connect()`, configured from `config.py`:
+
+| Setting | Default | What |
+|---------|---------|------|
+| `PG_POOL_MIN_SIZE` | 1 | min connections |
+| `PG_POOL_MAX_SIZE` | 10 | max connections |
+| `PG_PREPARED_STMT_CACHE_MAX_SIZE` | 100 | prepared statement cache |
+
+On connect we set these server-side (per session, so they apply to every query
+on the pool):
+
+- `statement_timeout` = `PG_STATEMENT_TIMEOUT_MS`
+- `idle_in_transaction_session_timeout` = `PG_IDLE_IN_TX_TIMEOUT_MS`
+- `application_name` = `mlpa:{db_name}` (handy for `pg_stat_activity`)
+
+## Timeout budgets
+
+The idea: keep a tight default so a runaway query gets killed by Postgres even if
+the client or event loop hangs (no connection pile-up). Then raise the budget
+only for the few queries that legitimately need longer.
+
+| Budget | Default | Used for |
+|--------|---------|----------|
+| `PG_STATEMENT_TIMEOUT_MS` | 3000 (3s) | pool default, every query unless raised |
+| `PG_IDLE_IN_TX_TIMEOUT_MS` | 10000 (10s) | reaps sessions left idle mid-transaction |
+| `PG_ADMIN_READ_TIMEOUT_MS` | 15000 (15s) | admin reads that full-scan the user table |
+| `PG_MAINTENANCE_STATEMENT_TIMEOUT_MS` | 30000 (30s) | startup reconciliation (bigger scans) |
+| `MLPA_ADMISSION_LOCK_TIMEOUT_MS` | 5000 (5s) | `lock_timeout` for the capacity row `FOR UPDATE` |
+| `PG_COMMAND_TIMEOUT_S` | None | optional asyncpg client-side backstop, off by default |
+
+All values are ms (except `PG_COMMAND_TIMEOUT_S`, which is seconds). 0 = unlimited.
+
+### How a budget gets applied
+
+Two ways:
+
+1. **Pool-wide** via `server_settings` (the 3s `statement_timeout` and 10s
+   idle-in-tx). This is the baseline for everything.
+2. **Per-transaction** via `SET LOCAL`, using two context managers in
+   `PGService`. `SET LOCAL` only lasts for the transaction, so the connection
+   goes back to the pool defaults on release.
+
+   - `statement_timeout(ms)` - raises `statement_timeout` AND idle-in-tx to the
+     same `ms`. idle-in-tx has to match, otherwise the 10s reaper could kill a
+     transaction we deliberately gave a longer budget.
+   - `admission_transaction()` - the capacity gate path. Sets `lock_timeout` =
+     `MLPA_ADMISSION_LOCK_TIMEOUT_MS`, and `statement_timeout` = `lock_timeout +
+     PG_STATEMENT_TIMEOUT_MS` (so 5s + 3s = 8s). The statement budget has to sit
+     above the lock budget, because Postgres counts lock-wait time toward
+     `statement_timeout`. If it didn't, the 3s default would cap the lock wait
+     before `lock_timeout` ever fired.
+
+### Which query uses which budget
+
+| Budget | Queries |
+|--------|---------|
+| default 3s | challenge + key CRUD, `get_user`, `update_user_budget`, `block_user`, `create_budget` upsert |
+| admin-read 15s | `list_users` (COUNT(*) + deep OFFSET), `count_users_by_service_type` (GROUP BY `split_part`), `has_managed_user_rows` (EXISTS) |
+| maintenance 30s | `list_managed_base_identities` (DISTINCT scan), `_reconcile_capacity_claims` (bulk DELETE + INSERT) |
+| admission 8s | `admit_managed_base_identity`, `maybe_release_managed_base_identity_if_no_managed_users` |
+
+The admin-read and maintenance ones all hit the same problem: the `user_id` is
+`base:service_type`, so any filter or group on the service-type part uses
+`split_part`/`position`, which is unindexable. That means a full-table scan that
+grows with the user base and can blow past 3s. So they get a bigger budget
+instead.
+
+### Cross-pool read ordering
+
+The capacity reconcile and release paths read from the litellm pool and then open
+a transaction on the app_attest pool. That read always happens BEFORE the
+app_attest transaction opens. If you did it inside, the app_attest session would
+sit idle-in-transaction across the cross-pool `await`, and the idle-in-tx reaper
+could kill it (aborting the work and leaking a capacity claim). See
+`_reconcile_capacity_claims` and `maybe_release_managed_base_identity_if_no_managed_users`.
+
+### Client-side backstop
+
+`PG_COMMAND_TIMEOUT_S` is asyncpg's own client-side cancel and it's off by
+default. Careful: it is NOT relaxed by the per-transaction `SET LOCAL` budgets. If
+you turn it on, set it above `PG_MAINTENANCE_STATEMENT_TIMEOUT_MS` (30s) or it
+will cancel the maintenance/admin reads.
+
+### These timeouts only apply to MLPA
+
+All of the above is set as asyncpg `server_settings` on MLPA's own connection
+pools, at connect time. It's per-session, not database-wide and not on the DB
+role. Nothing in the migrations or scripts sets `statement_timeout` at the
+`ALTER DATABASE` / `ALTER ROLE` level.
+
+So anything else that connects to these databases on its own session is NOT
+affected. That includes the cleanup cron job in the llm-proxy infra, LiteLLM
+itself, and the Cloud SQL console. They run with the Postgres default (usually
+unlimited) unless someone sets a timeout for that role separately. The cron job
+can take as long as it needs, the 3s default won't touch it.
+
+## Migrations
+
+Alembic manages the `app_attest` DB only. LiteLLM manages its own schema.
+
+```bash
+uv run alembic upgrade head      # apply
+uv run alembic downgrade -1      # roll back one
+uv run alembic revision -m "..." # new migration
+```
+
+The `mlpa_user_capacity*` tables are created by migration, then reconciled on
+every startup via `ensure_capacity_state()`. Deploy runs
+`scripts/migrate-app-attest-database.sh` with `-x sqlalchemy.url=...`.
+
+## Startup work
+
+The `lifespan` in `run.py` does two DB things on boot:
+
+1. `litellm_pg.create_budget()` - upsert all budget tiers from config.
+2. `app_attest_pg.ensure_capacity_state()` - seed the singleton capacity row
+   (fatal if it fails, without the row every admission 500s), then reconcile the
+   claim table (best-effort, if it fails the row keeps a stale count and
+   admissions still work).
diff --git a/src/mlpa/core/config.py b/src/mlpa/core/config.py
@@ -295,24 +295,19 @@ def valid_service_type_for_model(self, service_type: str, model: str) -> bool:
     PG_POOL_MAX_SIZE: int = 10
     PG_PREPARED_STMT_CACHE_MAX_SIZE: int = 100
     READINESS_CHECK_TIMEOUT_S: float = 2.0
-    # Postgres query timeouts. Server-enforced (statement_timeout) so a runaway
-    # query is killed even if the client/event loop hangs, preventing connection
-    # pile-up. Values are milliseconds; 0 = unlimited (Postgres semantics).
+    # Server-enforced query timeout (ms, 0 = unlimited): kills a runaway query
+    # even if the client or event loop hangs, so connections don't pile up.
     PG_STATEMENT_TIMEOUT_MS: int = 3000
-    # Reaps transactions left idle between statements (releases held locks).
-    # Should be >= statement_timeout since a tx legitimately spans round-trips.
+    # Reaps sessions left idle mid-transaction (releasing their locks). Keep this
+    # >= statement_timeout, since a transaction can legitimately span round-trips.
     PG_IDLE_IN_TX_TIMEOUT_MS: int = 10000
-    # Raised budget for heavy startup work (capacity reconciliation), applied
-    # per-transaction via SET LOCAL. 0 = unlimited.
+    # Raised budget for heavy startup reconciliation, applied per-transaction via SET LOCAL.
     PG_MAINTENANCE_STATEMENT_TIMEOUT_MS: int = 30000
-    # Raised budget for client-facing admin reads that do unindexable full-table
-    # scans (user listing, counts-by-service-type). 0 = unlimited.
+    # Raised budget for admin reads that full-scan the user table (listing, counts).
     PG_ADMIN_READ_TIMEOUT_MS: int = 15000
-    # Optional asyncpg client-side backstop (seconds). None = disabled.
-    # WARNING: this is a pool-level client-side cancel that is NOT relaxed by the
-    # per-transaction SET LOCAL statement_timeout. If enabled, set it above the
-    # largest per-statement budget (PG_MAINTENANCE_STATEMENT_TIMEOUT_MS), or it
-    # will silently cancel the maintenance/admin-read queries.
+    # Optional asyncpg client-side timeout (seconds, None = off). Unlike the SET
+    # LOCAL budgets above, this one isn't relaxed by them, so keep it above
+    # PG_MAINTENANCE_STATEMENT_TIMEOUT_MS or it'll cancel those queries.
     PG_COMMAND_TIMEOUT_S: float | None = None
 
     # LLM request default values
diff --git a/src/mlpa/core/pg_services/app_attest_pg_service.py b/src/mlpa/core/pg_services/app_attest_pg_service.py
@@ -104,12 +104,10 @@ async def ensure_capacity_state(self) -> None:
         """
         Seed the singleton capacity row, then reconcile the claim table.
 
-        The seed is critical and fatal on failure: without the row every
-        admission 500s, so a failure should crash startup rather than serve
-        broken. Reconciliation is best-effort (see _reconcile_capacity_claims):
-        if it fails the row still exists with a stale count and admissions work.
+        The seed is fatal on failure: without the row every admission 500s, so we
+        let it crash startup. Reconciliation is best-effort — if it fails the row
+        still holds a stale count and admissions keep working.
         """
-        # Seed the singleton row (fatal on failure).
         async with self.pool.acquire() as conn:
             async with conn.transaction():
                 await conn.execute(
@@ -130,7 +128,6 @@ async def ensure_capacity_state(self) -> None:
                     env.MLPA_MAX_SIGNED_IN_USERS,
                 )
 
-        # Reconcile the claim table (best-effort).
         try:
             await self._reconcile_capacity_claims()
         except Exception as e:
@@ -143,16 +140,15 @@ async def _reconcile_capacity_claims(self) -> None:
         """Rebuild the claim table from LiteLLM and refresh current_identities."""
         managed_service_types = list(env.MLPA_CAPPED_SERVICE_TYPES)
 
-        # Read from the litellm pool before opening the app_attest transaction:
-        # doing it inside would leave the session idle-in-transaction across a
+        # Read the litellm pool before opening the app_attest transaction: doing
+        # it inside would leave the session idle-in-transaction across the
         # cross-pool await, where idle_in_transaction_session_timeout could reap it.
         base_identities = await self.litellm_pg.list_managed_base_identities(
             managed_service_types
         )
 
-        # Bulk delete + insert scales with the user base and can exceed the tight
-        # pool-wide statement_timeout. Statements run back-to-back (no inter-
-        # statement await), so the raised statement_timeout alone suffices.
+        # The bulk delete + insert grows with the user base, so run it under the
+        # raised maintenance budget rather than the tight pool default.
         async with self.statement_timeout(
             env.PG_MAINTENANCE_STATEMENT_TIMEOUT_MS
         ) as conn:
@@ -258,10 +254,9 @@ async def maybe_release_managed_base_identity_if_no_managed_users(
 
         managed_service_types = list(env.MLPA_CAPPED_SERVICE_TYPES)
 
-        # Read the litellm state before opening the app_attest transaction: doing
-        # it inside would hold the FOR UPDATE lock idle-in-transaction across a
-        # cross-pool await, where idle_in_transaction_session_timeout could reap
-        # it and abort the release, leaking the claim (mirrors ensure_capacity_state).
+        # Read the litellm state before opening the app_attest transaction (same
+        # cross-pool idle-in-transaction risk as ensure_capacity_state); reaping
+        # the session here would abort the release and leak the claim.
         has_managed_user_rows = await self.litellm_pg.has_managed_user_rows(
             base_identity,
             managed_service_types,
diff --git a/src/mlpa/core/pg_services/litellm_pg_service.py b/src/mlpa/core/pg_services/litellm_pg_service.py
@@ -77,8 +77,7 @@ async def block_user(self, user_id: str, blocked: bool = True) -> dict:
 
     async def list_users(self, limit: int = 50, offset: int = 0) -> dict:
         try:
-            # COUNT(*) + deep OFFSET scan the full table; admin-read budget
-            # rather than the tight pool-wide default.
+            # COUNT(*) + deep OFFSET full-scan the table, so use the admin-read budget.
             async with self.statement_timeout(env.PG_ADMIN_READ_TIMEOUT_MS) as conn:
                 total = await conn.fetchval(
                     'SELECT COUNT(*) FROM "LiteLLM_EndUserTable"'
@@ -109,8 +108,8 @@ async def count_users_by_service_type(self) -> dict:
         `{base_user_id}:{service_type}`.
         """
         try:
-            # GROUP BY split_part(...) is unindexable, so always a full-table
-            # scan; admin-read budget rather than the tight pool-wide default.
+            # GROUP BY split_part(...) is unindexable, so always a full scan: use
+            # the admin-read budget.
             async with self.statement_timeout(env.PG_ADMIN_READ_TIMEOUT_MS) as conn:
                 rows = await conn.fetch(
                     """
@@ -148,9 +147,8 @@ async def list_managed_base_identities(
         """
         Return distinct base identities for cap-managed service types.
 
-        The DISTINCT scan over the full end-user table can exceed the tight
-        pool-wide statement_timeout on a large user base, so it runs under the
-        maintenance budget (startup reconciliation work).
+        The DISTINCT full-scan is startup reconciliation work, so it runs under
+        the maintenance budget.
         """
         async with self.statement_timeout(
             env.PG_MAINTENANCE_STATEMENT_TIMEOUT_MS
@@ -172,12 +170,9 @@ async def has_managed_user_rows(
         """
         Return True if the base identity has any cap-managed LiteLLM end-user rows.
         """
-        # The split_part/position predicate is unindexable, so a no-match EXISTS
-        # scans the full end-user table and can exceed the tight pool-wide
-        # statement_timeout on a large user base (same pattern as
-        # count_users_by_service_type / list_managed_base_identities). Run under
-        # the admin-read budget so a legitimate slow scan on the LiteLLM table is
-        # not killed at 3s, which would leak a capacity claim on the release path.
+        # Unindexable predicate, so a no-match EXISTS full-scans the table. Use
+        # the admin-read budget: killing a slow scan at 3s would leak a capacity
+        # claim on the release path.
         async with self.statement_timeout(env.PG_ADMIN_READ_TIMEOUT_MS) as conn:
             return bool(
                 await conn.fetchval(
@@ -206,8 +201,8 @@ async def create_budget(self):
 
         for service_type, budget_config in user_feature_budgets.items():
             try:
-                # Fast single-row PK upsert: stays a plain autocommit call (it
-                # cannot realistically hit the pool-wide statement_timeout).
+                # Fast single-row PK upsert: a plain autocommit call won't hit
+                # the pool statement_timeout.
                 await self.pool.fetchrow(
                     """
                     INSERT INTO "LiteLLM_BudgetTable"
diff --git a/src/mlpa/core/pg_services/pg_service.py b/src/mlpa/core/pg_services/pg_service.py
@@ -66,8 +66,7 @@ async def _timed_transaction(
         """
         Yield a connection in a transaction with statement_timeout (and
         optionally idle_in_transaction_session_timeout / lock_timeout) set via
-        SET LOCAL, scoped to the transaction so the connection reverts to the
-        pool-wide defaults on release.
+        SET LOCAL, so the connection reverts to the pool defaults on release.
         """
         async with self.pool.acquire() as conn:
             async with conn.transaction():
@@ -87,12 +86,10 @@ async def _timed_transaction(
     @asynccontextmanager
     async def statement_timeout(self, timeout_ms: int):
         """
-        Raise statement_timeout for statements that legitimately exceed the
-        tight pool-wide default (e.g. unindexable full-table scans).
-
-        idle_in_transaction_session_timeout is lifted to the same budget so the
-        pool-wide reaper (10s) cannot abort a transaction we deliberately granted
-        a longer statement budget if an await ever lands between its statements.
+        Raise statement_timeout for a transaction that legitimately exceeds the
+        tight pool default (e.g. unindexable full-table scans). idle-in-tx is
+        lifted to match, so the pool reaper can't abort it if an await lands
+        between statements.
         """
         async with self._timed_transaction(
             timeout_ms, idle_in_tx_timeout_ms=timeout_ms
@@ -102,10 +99,10 @@ async def statement_timeout(self, timeout_ms: int):
     @asynccontextmanager
     async def admission_transaction(self):
         """
-        Signup-capacity admission path: a bounded lock_timeout for the FOR UPDATE
-        on the singleton capacity row, plus a statement_timeout set above it so
-        the lock wait is governed by lock_timeout rather than silently capped by
-        the pool-wide statement_timeout (Postgres counts lock-wait toward it).
+        Signup-capacity admission path. Bounds the FOR UPDATE wait on the
+        capacity row with lock_timeout, and sets statement_timeout above it so
+        the wait is governed by lock_timeout rather than the tight pool default
+        (Postgres counts lock-wait toward statement_timeout).
         """
         lock_ms = env.MLPA_ADMISSION_LOCK_TIMEOUT_MS
         stmt_ms = lock_ms + env.PG_STATEMENT_TIMEOUT_MS
diff --git a/src/tests/unit/test_pg_timeouts.py b/src/tests/unit/test_pg_timeouts.py