How to efficiently delete expired rows from a larg

2019-06-01 04:33发布

问题:

I have an extremely large table from which I would like to delete old rows. Example of table:

 | customer_id | first_purchase_date | last_purchase_date |
 |<primary key>|                     |   <index>          |

** I am using this example table for argument's sake. The table in question is not a customer table. The real table has grown to 28 GB in size over the past 2 months and is used to calculate something that requires only 2 weeks of historical data.

What I would like to do is delete customers from this table that have not purchased anything in the past year. I.e. delete from table where last_purchase_date < now() - interval 1 year;

Simpy deleting like this would be too costly on the database. I know that a partition can be used to truncate the old rows but I am not sure how to implement it effectively.

In addition, if a customer were to purchase something then that row could potentially move to a different partition by updating the last_purchase_date. Would this not be expensive as well?

Thank you in advance for any direction!

回答1:

You are right in thinking that partitioning is the way forward, because:

Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.

And if this doesn't work for you, it's still possible to

In addition, MySQL 5.7 supports explicit partition selection for queries. For example, SELECT * FROM t PARTITION (p0,p1) WHERE c < 5 selects only those rows in partitions p0 and p1 that match the WHERE condition. In this case, MySQL does not check any other partitions of table t; this can greatly speed up queries when you already know which partition or partitions you wish to examine. Partition selection is also supported for the data modification statements DELETE, INSERT, REPLACE, UPDATE, and LOAD DATA, LOAD XML.

Since you want to delete stuff based on date and not the primary key, what you need is a RANGE partition scheme.

First find the oldest date and create partitions based on that

ALTER TABLE sales
    PARTITION BY RANGE( TO_DAYS(last_purchase_date)) (
    PARTITION p0 VALUES LESS THAN (TO_DAYS('2018-12-31')),
    PARTITION p1 VALUES LESS THAN (TO_DAYS('2017-12-31')),
    PARTITION p2 VALUES LESS THAN (TO_DAYS('2016-12-31')),
    PARTITION p3 VALUES LESS THAN (TO_DAYS('2015-12-31')),
    ..
    PARTITION p10 VALUES LESS THAN MAXVALUE));

Choose an appropriate number of partitions but don't worry too much because you can always change partitions later. When you partition, you might even find that the delete step isn't really needed after all.