fix(ingest): validate stored index definitions before re-execution#113
Closed
sebastiondev wants to merge 2 commits into
Closed
fix(ingest): validate stored index definitions before re-execution#113sebastiondev wants to merge 2 commits into
sebastiondev wants to merge 2 commits into
Conversation
finalize_database() re-executes CREATE INDEX statements captured from pg_indexes during init_database(). The strings were passed straight to cur.execute() with no validation, so any unexpected catalog entry on the chunks table — for example one created by a database role with DDL access but lower privileges than the importer — would be executed verbatim by the ingestion job. That is a textbook second-order SQL injection sink (CWE-89). Add a strict allow-list validator that: * requires a canonical 'CREATE [UNIQUE] INDEX <name> ON [docs.]<table>' prefix as produced by pg_get_indexdef(), * pins the target table to the importer's own chunks_table, and * rejects any embedded ';' so stacked statements cannot be smuggled. Validation runs both when the definitions are captured and again immediately before execution as defence in depth. Includes unit tests covering the canonical shapes plus stacked-statement, wrong-table and non-CREATE-INDEX payloads.
|
|
Author
|
Thanks for the heads-up. The CLA needs to be signed manually by the contributor via the CLA assistant link — flagging this to the author to complete the signature so the PR can move forward. |
Author
|
Closing this as inactive — no maintainer response after 21 days. The security finding and fix remain valid. If this is still relevant, I'm happy to reopen, rebase, or re-submit against a different branch. Just drop a comment. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DocumentImporter.finalize_database()re-executesCREATE INDEXstrings that were captured frompg_indexes.indexdefduringinit_database()and stashed onself._index_defs_to_create. Those strings are passed straight tocur.execute(index_def)with no validation. If the on-disk catalog entry for one of the chunks-table indexes is influenced by anyone other than the importer's role, the importer will faithfully run whatever DDL/DML it finds when it runs next — a textbook second-order SQL injection (CWE-89 / CWE-915 flavor).ingest/document_importer.pycur.execute(index_def)infinalize_database()SELECT indexdef FROM pg_indexes WHERE schemaname='docs' AND tablename = %sininit_database()docs.<chunks_table>by a role other than the importer (see preconditions below). Worth fixing because the importer is the most-privileged thing in this pipeline and the current code makes that trust completely implicit.Fix
A small allow-list validator,
_validate_index_def, applied in two places:init_database()before storing each captured definition.finalize_database()immediately beforecur.execute(...)as defence-in-depth, so a future refactor that bypasses the first call still can't smuggle SQL.The validator:
CREATE [UNIQUE] INDEX [IF NOT EXISTS] <name> ON [docs.]<table> …shape thatpg_get_indexdef()produces.chunks_table— anything else is rejected.;and rejects any other semicolon, blocking stacked statements.It returns the (trimmed) original string on success, so the executed SQL is unchanged for legitimate indexes.
Tests
ingest/tests/test_index_def_validation.pycovers:CREATE INDEX,CREATE UNIQUE INDEX,USING btree,USING bm25 (...) WITH (key_field=id), trailing semicolon.CREATE INDEX … ; DROP TABLE …), wrong target table (docs.other_table,docs.pg_class), non-CREATE INDEXDDL (DROP TABLE,ALTER TABLE,SELECT pg_sleep), empty string, and non-string input.No production behavior changes for valid catalog state — the validator returns the same string
pg_get_indexdef()rendered.Security analysis
Why this is reachable rather than purely theoretical:
init_database()andfinalize_database()run in two different transactions (with the swap-and-rename pattern in between), so anything that can write to the catalog between those two calls — or that influenced what got into the catalog beforeinit_database()ran — gets its SQL executed by the importer's role at finalize time. The importer typically runs withCREATE TABLE/DROP TABLErights indocs; that's strictly more than what an attacker needs to plant a malicious index definition.Preconditions for exploitation:
CREATE INDEXprivilege ondocs.<chunks_table>.pg_get_indexdef()'s reconstruction. PostgreSQL canonicalises identifiers heavily, so this is the narrow part — but "narrow" is not "none", especially across PG versions and extensions likepg_search(USING bm25 (...) WITH (...)), and the validator removes the need to reason about it at all.Mitigations the fix provides, even if you consider direct exploitation unlikely today:
Adversarial review
Before submitting, we tried to talk ourselves out of this one. The honest answer is that in a single-tenant deployment where only the importer's role can touch
docs.*, you'd never hit this —pg_get_indexdef()would always reproduce exactly what the importer itself created. But there's no access-control or framework protection in the current code path that enforces that assumption: the executed string is whatever the catalog hands back, and the importer runs with enough privilege to do real damage if that string is hostile. The fix is ~70 lines, has no behavior change for valid input, and converts an implicit deployment assumption into an enforced invariant, which seemed worth a small PR.cc @lewiswigmore