Skip to content

fix(pgvector): make doc deletion query faster and use chunking#289

Draft
kyteinsky wants to merge 2 commits into
masterfrom
fix/long-deletes
Draft

fix(pgvector): make doc deletion query faster and use chunking#289
kyteinsky wants to merge 2 commits into
masterfrom
fix/long-deletes

Conversation

@kyteinsky
Copy link
Copy Markdown
Contributor

@kyteinsky kyteinsky commented Mar 20, 2026

CI logging for slow queries has also been enabled, not sure if we will see that in the CI though.

Sample output for the slow deletion query where a missing index on the source_id foreign key in access_list table was the culprit.
Calculated time: 3.495 + 0.310 + 0.129 = 3.934 ms
Actual time: 201177.123 ms or 201 s

        Query Text: DELETE FROM docs WHERE docs.source_id IN ($1::VARCHAR, $2::VARCHAR, ..., $275::VARCHAR) RETURNING docs.chunks
        Query Parameters: ...
        Delete on docs  (cost=1126.32..2018.25 rows=275 width=6) (actual time=0.192..3.495 rows=218 loops=1)
    ->  Bitmap Heap Scan on docs  (cost=1126.32..2018.25 rows=275 width=6) (actual time=0.144..0.310 rows=218 loops=1)
                Recheck Cond: ((source_id)::text = ANY ('{"files__default: 20392","files__default: 23092", ... }'::text[]))
                Heap Blocks: exact=25
                ->  Bitmap Index Scan on docs_pkey  (cost=0.00..1125.56 rows=275 width=0) (actual time=0.129..0.129 rows=218 loops=1)
                      Index Cond: ((source_id)::text = ANY ('{"files__default: 20392", ...
2026-03-19 11:28:59.760 UTC [6703] LOG:  duration: 201177.123 ms  execute <unnamed>: DELETE FROM docs WHERE docs.source_id IN ($1::VARCHAR, $2::VARCHAR, ..., $275::VARCHAR) RETURNING docs.chunks
2026-03-19 11:28:59.760 UTC [6703] DETAIL:  Parameters: $1 = 'files__default: 20392', $2 = ...

(put the chunking part in a different PR)

@kyteinsky kyteinsky requested a review from marcelklehr as a code owner March 20, 2026 12:56
@kyteinsky kyteinsky force-pushed the fix/long-deletes branch 2 times, most recently from 02b8435 to f03c10a Compare March 20, 2026 13:30
f'{DOCUMENTS_TABLE_NAME}.source_id',
ondelete='CASCADE',
),
index=True,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DB migration needs to be done for this to happen on existing installations.

@kyteinsky kyteinsky marked this pull request as draft March 20, 2026 13:31
kyteinsky added 2 commits May 20, 2026 19:06
index the source_id column in the access_list table

Signed-off-by: Anupam Kumar <kyteinsky@gmail.com>
Signed-off-by: Anupam Kumar <kyteinsky@gmail.com>
@kyteinsky
Copy link
Copy Markdown
Contributor Author

verified again the index actually does improve things:
before:

ccb=# EXPLAIN ANALYZE DELETE FROM docs WHERE source_id = 'files__default: 100';
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Delete on docs  (cost=0.42..8.44 rows=0 width=0) (actual time=1.434..1.435 rows=0 loops=1)
   ->  Index Scan using source_id_modified_idx on docs  (cost=0.42..8.44 rows=1 width=6) (actual time=0.841..0.844 rows=1 loops=1)
         Index Cond: ((source_id)::text = 'files__default: 100'::text)
 Planning Time: 0.069 ms
 Trigger for constraint access_list_source_id_fkey: time=923.137 calls=1
 Execution Time: 924.593 ms
(6 rows)

after:

ccb=# EXPLAIN ANALYZE DELETE FROM docs WHERE source_id = 'files__default: 101';
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Delete on docs  (cost=0.42..8.44 rows=0 width=0) (actual time=0.098..0.099 rows=0 loops=1)
   ->  Index Scan using source_id_modified_idx on docs  (cost=0.42..8.44 rows=1 width=6) (actual time=0.062..0.065 rows=1 loops=1)
         Index Cond: ((source_id)::text = 'files__default: 101'::text)
 Planning Time: 0.205 ms
 Trigger for constraint access_list_source_id_fkey: time=0.414 calls=1
 Execution Time: 0.563 ms
(6 rows)

the difference in time to pay attention to is Trigger for constraint access_list_source_id_fkey: time
without index: 923.137
with index: 0.414

index can be manually created like so: CREATE INDEX idx_access_list_source_id ON access_list (source_id);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant