Deleting millions of rows in MySQL

2019-01-21 04:56发布

I recently found and fixed a bug in a site I was working on that resulted in millions of duplicate rows of data in a table that will be quite large even without them (still in the millions). I can easily find these duplicate rows and can run a single delete query to kill them all. The problem is that trying to delete this many rows in one shot locks up the table for a long time, which I would like to avoid if possible. The only ways I can see to get rid of these rows, without taking down the site (by locking up the table) are:

  1. Write a script that will execute thousands of smaller delete queries in a loop. This will theoretically get around the locked table issue because other queries will be able to make it into the queue and run in between the deletes. But it will still spike the load on the database quite a bit and will take a long time to run.
  2. Rename the table and recreate the existing table (it'll now be empty). Then do my cleanup on the renamed table. Rename the new table, name the old one back and merge the new rows into the renamed table. This is way takes considerably more steps, but should get the job done with minimal interruption. The only tricky part here is that the table in question is a reporting table, so once it's renamed out of the way and the empty one put in its place all historic reports go away until I put it back in place. Plus the merging process could be a bit of a pain because of the type of data being stored. Overall this is my likely choice right now.

I was just wondering if anyone else has had this problem before and, if so, how you dealt with it without taking down the site and, hopefully, with minimal if any interruption to the users? If I go with number 2, or a different, similar, approach, I can schedule the stuff to run late at night and do the merge early the next morning and just let the users know ahead of time, so that's not a huge deal. I'm just looking to see if anyone has any ideas for a better, or easier, way to do the cleanup.

9条回答
Luminary・发光体
2楼-- · 2019-01-21 05:00

According to the mysql documentation, TRUNCATE TABLE is a fast alternative to DELETE FROM. Try this:

TRUNCATE TABLE table_name

I tried this on 50M rows and it was done within two mins.

Note: Truncate operations are not transaction-safe; an error occurs when attempting one in the course of an active transaction or active table lock

查看更多
迷人小祖宗
3楼-- · 2019-01-21 05:05

I faced a similar problem. We had a really big table, about 500 GB in size with no partitioning and one only one index on the primary_key column. Our master was a hulk of a machine, 128 cores and 512 Gigs of RAM and we had multiple slaves too. We tried a few techniques to tackle the large-scale deletion of rows. I will list them all here from worst to best that we found-

  1. Fetching and Deleting one row at a time. This is the absolute worst that you could do. So, we did not even try this.
  2. Fetching first 'X' rows from the database using a limit query on the primary_key column, then checking the row ids to delete in the application and firing a single delete query with a list of primary_key ids. So, 2 queries per 'X' rows. Now, this approach was fine but doing this using a batch job deleted about 5 million rows in 10 minutes or so, due to which the slaves of our MySQL DB were lagged by 105 seconds. 105-second lag in 10-minute activity. So, we had to stop.
  3. In this technique, we introduced a 50 ms lag between our subsequent batch fetch and deletions of size 'X' each. This solved the lag problem but we were now deleting 1.2-1.3 million rows per 10 minutes as compared to 5 million in technique #2.
  4. Partitioning the database table and then deleting the entire partitions when not needed. This is the best solution we have but it requires a pre-partitioned table. We followed step 3 because we had a non-partitioned very old table with only indexing on the primary_key column. Creating a partition would have taken too much time and we were in a crisis mode. Here are some links related to partitioning that I found helpful- Official MySQL Reference, Oracle DB daily partitioning.

So, IMO, if you can afford to have the luxury of creating a partition in your table, go for the option #4, otherwise, you are stuck with option #3.

查看更多
SAY GOODBYE
4楼-- · 2019-01-21 05:08

I'd use mk-archiver from the excellent Maatkit utilities package (a bunch of Perl scripts for MySQL management) Maatkit is from Baron Schwartz, the author of the O'Reilly "High Performance MySQL" book.

The goal is a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much. You can insert the data into another table, which need not be on the same server. You can also write it to a file in a format suitable for LOAD DATA INFILE. Or you can do neither, in which case it's just an incremental DELETE.

It's already built for archiving your unwanted rows in small batches and as a bonus, it can save the deleted rows to a file in case you screw up the query that selects the rows to remove.

No installation required, just grab http://www.maatkit.org/get/mk-archiver and run perldoc on it (or read the web site) for documentation.

查看更多
再贱就再见
5楼-- · 2019-01-21 05:12

I had a use case of deleting 1M+ rows in the 25M+ rows Table in the MySQL. Tried different approaches like batch deletes (described above).
I've found out that the fastest way (copy of required records to new table):

  1. Create Temporary Table that holds just ids.

CREATE TABLE id_temp_table ( temp_id int);

  1. Insert ids that should be removed:

insert into id_temp_table (temp_id) select.....

  1. Create New table table_new

  2. Insert all records from table to table_new without unnecessary rows that are in id_temp_table

insert into table_new .... where table_id NOT IN (select distinct(temp_id) from id_temp_table);

  1. Rename tables

The whole process took ~1hr. In my use case simple delete of batch on 100 records took 10 mins.

查看更多
贪生不怕死
6楼-- · 2019-01-21 05:13

Do it in batches of lets say 2000 rows at a time. Commit in-between. A million rows isn't that much and this will be fast, unless you have many indexes on the table.

查看更多
够拽才男人
7楼-- · 2019-01-21 05:16

I'd also recommend adding some constraints to your table to make sure that this doesn't happen to you again. A million rows, at 1000 per shot, will take 1000 repetitions of a script to complete. If the script runs once every 3.6 seconds you'll be done in an hour. No worries. Your clients are unlikely to notice.

查看更多
登录 后发表回答