Skip to content

fix(pgvector): make doc deletion query faster#289

Open
kyteinsky wants to merge 5 commits into
masterfrom
fix/long-deletes
Open

fix(pgvector): make doc deletion query faster#289
kyteinsky wants to merge 5 commits into
masterfrom
fix/long-deletes

Conversation

@kyteinsky
Copy link
Copy Markdown
Contributor

@kyteinsky kyteinsky commented Mar 20, 2026

CI logging for slow queries has also been enabled, not sure if we will see that in the CI though.

Sample output for the slow deletion query where a missing index on the source_id foreign key in access_list table was the culprit.
Calculated time: 3.495 + 0.310 + 0.129 = 3.934 ms
Actual time: 201177.123 ms or 201 s

        Query Text: DELETE FROM docs WHERE docs.source_id IN ($1::VARCHAR, $2::VARCHAR, ..., $275::VARCHAR) RETURNING docs.chunks
        Query Parameters: ...
        Delete on docs  (cost=1126.32..2018.25 rows=275 width=6) (actual time=0.192..3.495 rows=218 loops=1)
    ->  Bitmap Heap Scan on docs  (cost=1126.32..2018.25 rows=275 width=6) (actual time=0.144..0.310 rows=218 loops=1)
                Recheck Cond: ((source_id)::text = ANY ('{"files__default: 20392","files__default: 23092", ... }'::text[]))
                Heap Blocks: exact=25
                ->  Bitmap Index Scan on docs_pkey  (cost=0.00..1125.56 rows=275 width=0) (actual time=0.129..0.129 rows=218 loops=1)
                      Index Cond: ((source_id)::text = ANY ('{"files__default: 20392", ...
2026-03-19 11:28:59.760 UTC [6703] LOG:  duration: 201177.123 ms  execute <unnamed>: DELETE FROM docs WHERE docs.source_id IN ($1::VARCHAR, $2::VARCHAR, ..., $275::VARCHAR) RETURNING docs.chunks
2026-03-19 11:28:59.760 UTC [6703] DETAIL:  Parameters: $1 = 'files__default: 20392', $2 = ...

(put the chunking part in a different PR)

@kyteinsky kyteinsky requested a review from marcelklehr as a code owner March 20, 2026 12:56
@kyteinsky kyteinsky force-pushed the fix/long-deletes branch 2 times, most recently from 02b8435 to f03c10a Compare March 20, 2026 13:30
f'{DOCUMENTS_TABLE_NAME}.source_id',
ondelete='CASCADE',
),
index=True,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB migration needs to be done for this to happen on existing installations.

@kyteinsky kyteinsky marked this pull request as draft March 20, 2026 13:31
kyteinsky added 2 commits May 20, 2026 19:06
index the source_id column in the access_list table

Signed-off-by: Anupam Kumar <kyteinsky@gmail.com>
Signed-off-by: Anupam Kumar <kyteinsky@gmail.com>
@kyteinsky
Copy link
Copy Markdown
Contributor Author

verified again the index actually does improve things:
before:

ccb=# EXPLAIN ANALYZE DELETE FROM docs WHERE source_id = 'files__default: 100';
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Delete on docs  (cost=0.42..8.44 rows=0 width=0) (actual time=1.434..1.435 rows=0 loops=1)
   ->  Index Scan using source_id_modified_idx on docs  (cost=0.42..8.44 rows=1 width=6) (actual time=0.841..0.844 rows=1 loops=1)
         Index Cond: ((source_id)::text = 'files__default: 100'::text)
 Planning Time: 0.069 ms
 Trigger for constraint access_list_source_id_fkey: time=923.137 calls=1
 Execution Time: 924.593 ms
(6 rows)

after:

ccb=# EXPLAIN ANALYZE DELETE FROM docs WHERE source_id = 'files__default: 101';
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Delete on docs  (cost=0.42..8.44 rows=0 width=0) (actual time=0.098..0.099 rows=0 loops=1)
   ->  Index Scan using source_id_modified_idx on docs  (cost=0.42..8.44 rows=1 width=6) (actual time=0.062..0.065 rows=1 loops=1)
         Index Cond: ((source_id)::text = 'files__default: 101'::text)
 Planning Time: 0.205 ms
 Trigger for constraint access_list_source_id_fkey: time=0.414 calls=1
 Execution Time: 0.563 ms
(6 rows)

the difference in time to pay attention to is Trigger for constraint access_list_source_id_fkey: time
without index: 923.137
with index: 0.414

index can be manually created like so: CREATE INDEX idx_access_list_source_id ON access_list (source_id);

kyteinsky added 3 commits May 21, 2026 15:16
…ir fails

Signed-off-by: kyteinsky <kyteinsky@gmail.com>
Signed-off-by: kyteinsky <kyteinsky@gmail.com>
…onfig

Signed-off-by: kyteinsky <kyteinsky@gmail.com>
@kyteinsky kyteinsky marked this pull request as ready for review May 21, 2026 10:01
@kyteinsky kyteinsky changed the title fix(pgvector): make doc deletion query faster and use chunking fix(pgvector): make doc deletion query faster May 21, 2026
@kyteinsky kyteinsky requested a review from Copilot May 21, 2026 10:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Comment on lines +95 to +99
- name: Enable PostgreSQL slow query logging and auto_explain
run: |
PG_CONTAINER=$(docker ps -q --filter "ancestor=pgvector/pgvector:pg17")
docker exec $PG_CONTAINER bash -c "\
cat >> /var/lib/postgresql/data/postgresql.conf <<EOF
Comment on lines +112 to +117
# file-based logging
logging_collector = on
log_directory = '/var/log/pg_log'
log_filename = 'postgresql.log'
log_file_mode = 0644
EOF"
Comment on lines +420 to +421
PG_CONTAINER=$(docker ps -q --filter "ancestor=pgvector/pgvector:pg17")
docker exec $PG_CONTAINER cat /var/log/pg_log/postgresql.log
Comment on lines +28 to +31
f'Warning: could not read {version_info_path}, assuming no previous version was installed: {e}',
flush=True,
)
return (0, False)
except OSError as e:
print(f'Error: could not list repair directory to get all the eligible repairs: {e}', flush=True)
raise
repair_filenames = [f for f in all_filenames if f.startswith('repair') and f.endswith('.py')]
Comment on lines +25 to +28
conn.execute(sa.text(
'CREATE INDEX IF NOT EXISTS idx_access_list_source_id ON access_list (source_id)'
))
conn.commit()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants