I have quite a large table with 19 000 000 records, and I have problem with duplicate rows. There's a lot of similar questions even here in SO, but none of them seems to give me a satisfactory answer. Some points to consider:
- Row uniqueness is determined by two columns,
location_id
anddatetime
. - I'd like to keep the execution time as fast as possible (< 1 hour).
- Copying tables is not very feasible as the table is several gigabytes in size.
- No need to worry about relations.
As said, every location_id
can have only one distinct datetime
, and I would like to remove all the duplicate instances. It does not matter which one of them survives, as the data is identical.
Any ideas?
I think you can use this query to delete the duplicate records from the table
Before doing this, just test with some sample data first..and then Try this....
Note: On version 5.5, it works on MyISAM but not InnoDB.
So you keep the line with the lower datetime. I'm not sure about perf, it depends on your table column, your server etc...
This query works perfectly for every case : tested for Engine : MyIsam for 2 million rows.
ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)
You can delete duplicates using these steps: 1- Export the following query's results into a txt file:
2- Add this to the first of above txt file and run the final query:
Please note that '...' is the contents of txt file created in the first step.