I need to delete the majority (say, 90%) of a very large table (say, 5m rows). The other 10% of this table is frequently read, but not written to.
From "Best way to delete millions of rows by ID", I gather that I should remove any index on the 90% I'm deleting, to speed up the process (except an index I'm using to select the rows for deletion).
From "PostgreSQL locking mode", I see that this operation will acquire a ROW EXCLUSIVE
lock on the entire table. But since I'm only reading the other 10%, this ought not matter.
So, is it safe to delete everything in one command (i.e. DELETE FROM table WHERE delete_flag='t'
)? I'm worried that if the deletion of one row fails, triggering an enormous rollback, then it will affect my ability to read from the table. Would it be wiser to delete in batches?
Indexes are completely useless for operations on 90% of all rows. Sequential scans will be faster either way.
If you need to allow concurrent reads, you cannot take an exclusive lock on the table. So you can also not drop any indexes in the same transaction.
You could drop indexes in separate transactions to keep the duration of the exclusive lock at a minimum.
And later use CREATE INDEX CONCURRENTLY
to rebuild the index in the background - and only take a very brief exclusive lock.
If you have a stable condition to identify the 10 % of rows that stay, I would strongly suggest a partial index on just those rows to get the best for both:
- Reading queries can access the table quickly (using the partial index) at all times.
- The big
DELETE
is not going to modify the partial index at all, since none of the rows are involved in the DELETE
.
CREATE INDEX foo (some_id) WHERE delete_flag = FALSE;
Assuming delete_flag
is boolean
. You have to include the same predicate in your queries (even if it seems logically redundant) to make sure Postgres understands it can use the partial index.