I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.
I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected).
I tried:
DELETE FROM tbl WHERE id IN (select * from ids)
That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days.
Just kind of looking for the most effective way to delete from a table when I know the specific ID's to delete, and there are millions of IDs.
First make sure you have an index on the ID fields, both in the table you want to delete from and the table you are using for deletion IDs.
100 at a time seems too small. Try 1000 or 10000.
There's no need to delete anything from the deletion ID table. Add a new column for a Batch number and fill it with 1000 for batch 1, 1000 for batch 2, etc. and make sure the deletion query includes the batch number.
It all depends ...
Delete all indexes (except the one on the ID which you need for the delete)
Recreate them afterwards (= much faster than incremental updates to indexes)
Check if you have triggers that can safely be deleted / disabled temporarily
Do foreign keys reference your table? Can they be deleted? Temporarily deleted?
Depending on your autovacuum settings it may help to run
VACUUM ANALYZE
before the operation.Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all.
Some of the points listed in the related chapter of the manual Populating a Database may also be of use, depending on your setup.
If you delete large portions of the table and the rest fits into RAM, the fastest and easiest way would be this:
This way you don't have to recreate views, foreign keys or other depending objects. Read about the
temp_buffers
setting in the manual. This method is fast as long as the table fits into memory, or at least most of it. Be aware that you can lose data if your server crashes in the middle of this operation. You can wrap all of it into a transaction to make it safer.Run
ANALYZE
afterwards. OrVACUUM ANALYZE
if you did not go the truncate route, orVACUUM FULL ANALYZE
if you want to bring it to minimum size. For big tables consider the alternativesCLUSTER
/pg_repack
:For small tables, a simple
DELETE
instead ofTRUNCATE
is often faster:Read the Notes section for
TRUNCATE
in the manual. In particular (as Pedro also pointed out in his comment):And:
Two possible answers:
Your table may have lots of constraint or triggers attached to it when you try to delete a record. It will incur much processor cycles and checking from other tables.
You may need to put this statement inside a transaction.
If the table you're deleting from is referenced by
some_other_table
(and you don't want to drop the foreign keys even temporarily), make sure you have an index on the referencing column insome_other_table
!I had a similar problem and used
auto_explain
withauto_explain.log_nested_statements = true
, which revealed that thedelete
was actually doing seq_scans onsome_other_table
:Apparently it's trying to lock the referencing rows in the other table (which shouldn't exist, or the delete will fail). After I created indexes on the referencing tables, the delete was orders of magnitude faster.
The easiest way to do this would be to drop all your constraints and then do the delete.
We know the update/delete performance of PostgreSQL is not as powerful as Oracle. When we need to delete millions or 10's of millions of rows, it's really difficult and takes a long time.
However, we can still do this in production dbs. The following is my idea:
First, we should create a log table with 2 columns -
id
&flag
(id
refers to the id you want to delete;flag
can beY
ornull
, withY
signifying the record is successfully deleted).Later, we create a function. We do the delete task every 10,000 rows. You can see more details on my blog. Though it's in Chinese, you can still can get the info you want from the SQL code there.
Make sure the
id
column of both tables are indexes, as it will run faster.