[schemas] Smart ingest pipeline tables#196
Open
alanshurafa wants to merge 8 commits into
Open
Conversation
7 tasks
SECURITY DEFINER function was granted to authenticated/anon, allowing RLS bypass. Now restricted to service_role only. Added FOR UPDATE to prevent concurrent evidence appends from losing writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop the stale reference to `schemas/enhanced-thoughts/` (deleted on this branch and not actually used by the SQL — the function only touches `thoughts.id` and `thoughts.metadata`). Also update Expected Outcome to reflect the service-role-only grant on `append_thought_evidence` so users don't re-grant it to anon/authenticated by accident. Why: README claimed a prerequisite that 404s on the repo and mis-stated the RPC's trust boundary. Both were latent user-footguns.
Add nullable `user_id uuid` to `ingestion_jobs` and `ingestion_items` via idempotent `ALTER TABLE ... ADD COLUMN IF NOT EXISTS`. A DO block conditionally adds FKs to `auth.users(id) ON DELETE CASCADE` only when Supabase's `auth` schema and `users` table exist, so the migration stays safe on non-Supabase Postgres. Why: without user_id, multi-tenant deployments leak ingestion history across users. Nullable keeps single-tenant stock OB1 working with no data migration and lets RLS policies (added separately) key off auth.uid() = user_id once populated.
Turn on row level security for `ingestion_jobs` and `ingestion_items`, add a `service_role ALL` policy on each (so worker writes still flow), and — conditionally, only when `auth.uid()` exists — add an `authenticated SELECT` policy scoped to `user_id = auth.uid()`. Policies are wrapped in DROP POLICY IF EXISTS / CREATE so the file is still idempotent on re-run. Why: the grant block was already service-role-only, but without RLS there was no backstop if Supabase's schema-level defaults quietly granted `USAGE`/`SELECT` to `anon` or `authenticated`. RLS closes that door. Giving authenticated users a SELECT scoped to their own rows matches the pattern used by the rest of the Open Brain extensions and is a no-op until someone populates `user_id`.
Add partial indexes keyed on `created_at` for rows in the active
lifecycle (`status = 'pending'` on jobs; `status IN ('pending','ready')`
on items). Both use `CREATE INDEX IF NOT EXISTS` so re-running the
migration is a no-op.
Why: the worker polls for the next pending job and for ready items
repeatedly. Without a partial index, every poll becomes a seq scan
against a table whose historical tail of completed rows grows forever.
Partial indexes stay tiny (only live queue rows) and shrink to near
zero when the queue drains.
Add a Job Claim Semantics section to the README that states the contract explicitly: claim logic lives in the companion Edge Function (`integrations/smart-ingest/`), and any worker that claims a row MUST use `FOR UPDATE SKIP LOCKED`. Include a canonical UPDATE-with-sub-SELECT pattern that pairs with the new partial indexes. Also sync Expected Outcome with the new user_id columns, partial indexes, and RLS policies so the README matches the schema it describes. Why: the schema file is deliberately minimal (no claim RPC), so without this note a downstream author could wire up a plain SELECT- then-UPDATE worker and silently double-process the queue. Putting the contract in the schema README — next to the tables it operates on — keeps the DB layer's requirements discoverable even when the companion Edge Function lives in a separate folder.
e280cbc to
f9cd161
Compare
Collaborator
Author
|
Mergeable, no conflicts against |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion schema for the [smart-ingest integration] (separate PR) — two tables that back async document-ingestion with dry-run/execute semantics:
Security posture:
Ops:
Replaces the content from the closed PR #98 bundle, with the evidence-append hardening fix (`f7b4fd7`) incorporated.
Why
Stock Open Brain has no primitive for "ingest a document, extract multiple candidate thoughts, let a user review before committing." This schema provides the persistence layer for that workflow. The companion Edge Function (separate PR) handles the extraction logic.
Job-claim semantics live in the Edge Function (`FOR UPDATE SKIP LOCKED` pattern) rather than the schema — kept minimal here so operators can use their own claim approach if they wire a custom worker.
Test plan