I have a huge table - 36 million rows - in SQLite3.
In this very large table, there are two columns
- hash - text
- d - real
However, some of the rows are duplicates. That is, both hash and d have the same values.
Also, if two hashes are identical, so are the values of d, but two identical ds does not imply two identical hashes
Anyway, I want to delete the duplicate rows. I don't have a primary key column. What's the fastest way to do this?
EDIT: delete from dist where rowid not in (select max(rowid) from dist group by hash);
Appears to do the trick.
I guess the fastest would be to use the very database for it: add a new table with the same columns, but with proper constraints (a unique index on hash/real pair?), iterate through the original table and try to insert records in the new table, ignoring constraint violation errors (i.e. continue iterating when exceptions are raised).
Then delete the old table and rename the new to the old one.
You need a way to distinguish the rows. Based on your comment, you could use the special rowid column for that.
To delete duplicates by keeping the lowest
rowid
per(hash,d)
:If adding a primary key is not an option, then one approach would be to store the duplicates DISTINCT in a temp table, delete all of the duplicated records from the existing table, and then add the records back into the original table from the temp table.
For example (written for SQL Server 2008, but the technique is the same for any database):
I'm not sure if sqlite has a
ROW_NUMBER()
type function, but if it does you could also try some of the approaches listed here: Delete duplicate records from a SQL table without a primary key