Skip to content

[schemas] Smart ingest pipeline tables#196

Open
alanshurafa wants to merge 8 commits into
NateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/smart-ingest-schema
Open

[schemas] Smart ingest pipeline tables#196
alanshurafa wants to merge 8 commits into
NateBJones-Projects:mainfrom
alanshurafa:contrib/alanshurafa/smart-ingest-schema

Conversation

@alanshurafa
Copy link
Copy Markdown
Collaborator

Summary

Companion schema for the [smart-ingest integration] (separate PR) — two tables that back async document-ingestion with dry-run/execute semantics:

  • `ingestion_jobs` — one row per ingest request, tracks status, fingerprint, counts
  • `ingestion_items` — extracted thought candidates per job, with evidence appends tracked via the `append_thought_evidence` function

Security posture:

  • RLS enabled on both tables
  • `service_role ALL` policies (unconditional)
  • `authenticated SELECT WHERE user_id = auth.uid()` policies added conditionally if the `auth` schema exists (Supabase-only; no-op on vanilla Postgres)
  • `append_thought_evidence` is `SECURITY DEFINER` but restricted to `service_role` (no anon/authenticated execute grants)
  • `FOR UPDATE` row lock in evidence append to prevent lost-update races

Ops:

  • Partial indexes on `status='pending'` / `status IN ('pending','ready')` keep pending-queue lookups cheap as the tables grow
  • Nullable `user_id uuid` columns on both tables; FK to `auth.users(id) ON DELETE CASCADE` added conditionally if the auth schema exists
  • Fully idempotent (`IF NOT EXISTS` / `CREATE OR REPLACE` / `DROP POLICY IF EXISTS` + `CREATE POLICY`)

Replaces the content from the closed PR #98 bundle, with the evidence-append hardening fix (`f7b4fd7`) incorporated.

Why

Stock Open Brain has no primitive for "ingest a document, extract multiple candidate thoughts, let a user review before committing." This schema provides the persistence layer for that workflow. The companion Edge Function (separate PR) handles the extraction logic.

Job-claim semantics live in the Edge Function (`FOR UPDATE SKIP LOCKED` pattern) rather than the schema — kept minimal here so operators can use their own claim approach if they wire a custom worker.

Test plan

  • Apply `schema.sql` to a stock Supabase Open Brain project — verify tables, indexes, policies
  • Re-apply — verify no errors (idempotent)
  • Apply to vanilla Postgres (no auth schema) — verify `NOTICE` skip of auth.users FK and authenticated policy; RLS still enabled with service_role-only access
  • Call `append_thought_evidence` as service_role — succeeds
  • Call as authenticated — fails with `permission denied` (confirm grant restriction)
  • Two concurrent `append_thought_evidence` calls on same thought_id — verify both appends land without lost-update
  • Verify `metadata.json` passes the gate schema

@github-actions github-actions Bot added the schema Contribution: database extension label Apr 18, 2026
@github-actions github-actions Bot added the recipe Contribution: step-by-step recipe label Apr 22, 2026
alanshurafa and others added 8 commits May 18, 2026 20:08
SECURITY DEFINER function was granted to authenticated/anon, allowing
RLS bypass. Now restricted to service_role only. Added FOR UPDATE to
prevent concurrent evidence appends from losing writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop the stale reference to `schemas/enhanced-thoughts/` (deleted on
this branch and not actually used by the SQL — the function only touches
`thoughts.id` and `thoughts.metadata`). Also update Expected Outcome to
reflect the service-role-only grant on `append_thought_evidence` so
users don't re-grant it to anon/authenticated by accident.

Why: README claimed a prerequisite that 404s on the repo and mis-stated
the RPC's trust boundary. Both were latent user-footguns.
Add nullable `user_id uuid` to `ingestion_jobs` and `ingestion_items`
via idempotent `ALTER TABLE ... ADD COLUMN IF NOT EXISTS`. A DO block
conditionally adds FKs to `auth.users(id) ON DELETE CASCADE` only when
Supabase's `auth` schema and `users` table exist, so the migration
stays safe on non-Supabase Postgres.

Why: without user_id, multi-tenant deployments leak ingestion history
across users. Nullable keeps single-tenant stock OB1 working with no
data migration and lets RLS policies (added separately) key off
auth.uid() = user_id once populated.
Turn on row level security for `ingestion_jobs` and `ingestion_items`,
add a `service_role ALL` policy on each (so worker writes still flow),
and — conditionally, only when `auth.uid()` exists — add an
`authenticated SELECT` policy scoped to `user_id = auth.uid()`.
Policies are wrapped in DROP POLICY IF EXISTS / CREATE so the file is
still idempotent on re-run.

Why: the grant block was already service-role-only, but without RLS
there was no backstop if Supabase's schema-level defaults quietly
granted `USAGE`/`SELECT` to `anon` or `authenticated`. RLS closes that
door. Giving authenticated users a SELECT scoped to their own rows
matches the pattern used by the rest of the Open Brain extensions and
is a no-op until someone populates `user_id`.
Add partial indexes keyed on `created_at` for rows in the active
lifecycle (`status = 'pending'` on jobs; `status IN ('pending','ready')`
on items). Both use `CREATE INDEX IF NOT EXISTS` so re-running the
migration is a no-op.

Why: the worker polls for the next pending job and for ready items
repeatedly. Without a partial index, every poll becomes a seq scan
against a table whose historical tail of completed rows grows forever.
Partial indexes stay tiny (only live queue rows) and shrink to near
zero when the queue drains.
Add a Job Claim Semantics section to the README that states the
contract explicitly: claim logic lives in the companion Edge Function
(`integrations/smart-ingest/`), and any worker that claims a row MUST
use `FOR UPDATE SKIP LOCKED`. Include a canonical UPDATE-with-sub-SELECT
pattern that pairs with the new partial indexes. Also sync Expected
Outcome with the new user_id columns, partial indexes, and RLS
policies so the README matches the schema it describes.

Why: the schema file is deliberately minimal (no claim RPC), so
without this note a downstream author could wire up a plain SELECT-
then-UPDATE worker and silently double-process the queue. Putting the
contract in the schema README — next to the tables it operates on —
keeps the DB layer's requirements discoverable even when the
companion Edge Function lives in a separate folder.
@alanshurafa alanshurafa force-pushed the contrib/alanshurafa/smart-ingest-schema branch from e280cbc to f9cd161 Compare May 19, 2026 00:08
@alanshurafa alanshurafa added area: schemas Review area: schemas/primitives/data model risk: schema Touches database schema, migration, or data model behavior review: ready-for-maintainer Community reviewer recommends maintainer review alan-reviewed Reviewed by Alan Shurafa in Community Reviewer role labels May 20, 2026
@alanshurafa
Copy link
Copy Markdown
Collaborator Author

Mergeable, no conflicts against main. No blockers from my side; ready whenever it reaches the queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

alan-reviewed Reviewed by Alan Shurafa in Community Reviewer role area: schemas Review area: schemas/primitives/data model recipe Contribution: step-by-step recipe review: ready-for-maintainer Community reviewer recommends maintainer review risk: schema Touches database schema, migration, or data model behavior schema Contribution: database extension

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant