Skip to content

fix(ingest): validate stored index definitions before re-execution#113

Closed
sebastiondev wants to merge 2 commits into
timescale:mainfrom
sebastiondev:fix/cwe89-document-importer-stored-1556
Closed

fix(ingest): validate stored index definitions before re-execution#113
sebastiondev wants to merge 2 commits into
timescale:mainfrom
sebastiondev:fix/cwe89-document-importer-stored-1556

Conversation

@sebastiondev
Copy link
Copy Markdown

Summary

DocumentImporter.finalize_database() re-executes CREATE INDEX strings that were captured from pg_indexes.indexdef during init_database() and stashed on self._index_defs_to_create. Those strings are passed straight to cur.execute(index_def) with no validation. If the on-disk catalog entry for one of the chunks-table indexes is influenced by anyone other than the importer's role, the importer will faithfully run whatever DDL/DML it finds when it runs next — a textbook second-order SQL injection (CWE-89 / CWE-915 flavor).

  • File: ingest/document_importer.py
  • Sink: cur.execute(index_def) in finalize_database()
  • Source: SELECT indexdef FROM pg_indexes WHERE schemaname='docs' AND tablename = %s in init_database()
  • Severity: Medium — exploitation requires DDL access on docs.<chunks_table> by a role other than the importer (see preconditions below). Worth fixing because the importer is the most-privileged thing in this pipeline and the current code makes that trust completely implicit.

Fix

A small allow-list validator, _validate_index_def, applied in two places:

  1. In init_database() before storing each captured definition.
  2. Again in finalize_database() immediately before cur.execute(...) as defence-in-depth, so a future refactor that bypasses the first call still can't smuggle SQL.

The validator:

  • Requires the canonical CREATE [UNIQUE] INDEX [IF NOT EXISTS] <name> ON [docs.]<table> … shape that pg_get_indexdef() produces.
  • Pins the target table to the importer's own chunks_table — anything else is rejected.
  • Strips a single optional trailing ; and rejects any other semicolon, blocking stacked statements.
  • Rejects non-string input.

It returns the (trimmed) original string on success, so the executed SQL is unchanged for legitimate indexes.

Tests

ingest/tests/test_index_def_validation.py covers:

  • Canonical accepted forms: CREATE INDEX, CREATE UNIQUE INDEX, USING btree, USING bm25 (...) WITH (key_field=id), trailing semicolon.
  • Rejection of stacked statements (CREATE INDEX … ; DROP TABLE …), wrong target table (docs.other_table, docs.pg_class), non-CREATE INDEX DDL (DROP TABLE, ALTER TABLE, SELECT pg_sleep), empty string, and non-string input.
$ python -m pytest ingest/tests/test_index_def_validation.py -q
.............                                                            [100%]
13 passed in 0.32s

No production behavior changes for valid catalog state — the validator returns the same string pg_get_indexdef() rendered.

Security analysis

Why this is reachable rather than purely theoretical: init_database() and finalize_database() run in two different transactions (with the swap-and-rename pattern in between), so anything that can write to the catalog between those two calls — or that influenced what got into the catalog before init_database() ran — gets its SQL executed by the importer's role at finalize time. The importer typically runs with CREATE TABLE / DROP TABLE rights in docs; that's strictly more than what an attacker needs to plant a malicious index definition.

Preconditions for exploitation:

  1. Attacker has CREATE INDEX privilege on docs.<chunks_table>.
  2. The importer runs with higher DB privileges than that attacker (typical multi-tenant / least-privilege deployment).
  3. Attacker can get a non-canonical fragment through pg_get_indexdef()'s reconstruction. PostgreSQL canonicalises identifiers heavily, so this is the narrow part — but "narrow" is not "none", especially across PG versions and extensions like pg_search (USING bm25 (...) WITH (...)), and the validator removes the need to reason about it at all.

Mitigations the fix provides, even if you consider direct exploitation unlikely today:

  • Removes an implicit-trust boundary between the importer and the Postgres catalog.
  • Makes any future change to capture index defs from a wider scope (other schemas, other tables) fail loudly instead of silently executing.
  • Documents the expected shape so reviewers don't have to re-derive it.

Adversarial review

Before submitting, we tried to talk ourselves out of this one. The honest answer is that in a single-tenant deployment where only the importer's role can touch docs.*, you'd never hit this — pg_get_indexdef() would always reproduce exactly what the importer itself created. But there's no access-control or framework protection in the current code path that enforces that assumption: the executed string is whatever the catalog hands back, and the importer runs with enough privilege to do real damage if that string is hostile. The fix is ~70 lines, has no behavior change for valid input, and converts an implicit deployment assumption into an enforced invariant, which seemed worth a small PR.

cc @lewiswigmore

finalize_database() re-executes CREATE INDEX statements captured from
pg_indexes during init_database(). The strings were passed straight to
cur.execute() with no validation, so any unexpected catalog entry on
the chunks table — for example one created by a database role with
DDL access but lower privileges than the importer — would be executed
verbatim by the ingestion job. That is a textbook second-order SQL
injection sink (CWE-89).

Add a strict allow-list validator that:

* requires a canonical 'CREATE [UNIQUE] INDEX <name> ON [docs.]<table>'
  prefix as produced by pg_get_indexdef(),
* pins the target table to the importer's own chunks_table, and
* rejects any embedded ';' so stacked statements cannot be smuggled.

Validation runs both when the definitions are captured and again
immediately before execution as defence in depth. Includes unit tests
covering the canonical shapes plus stacked-statement, wrong-table and
non-CREATE-INDEX payloads.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@sebastiondev
Copy link
Copy Markdown
Author

Thanks for the heads-up. The CLA needs to be signed manually by the contributor via the CLA assistant link — flagging this to the author to complete the signature so the PR can move forward.

@sebastiondev
Copy link
Copy Markdown
Author

Closing this as inactive — no maintainer response after 21 days.

The security finding and fix remain valid. If this is still relevant, I'm happy to reopen, rebase, or re-submit against a different branch. Just drop a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants