What is the best way to check for duplicate TEXT f

2019-06-26 11:45发布

问题:

My code pulls ~1000 HTML files, extracts the relevant information & then stores that information in a MySQL TEXT field (as it is usually quite long). I am looking for a system to prevent duplicate entries in the DB

My first idea is to add a HASH field to the table (probably MD5), pull the hash list at the beginning of each run & check for duplicates before inserting into the DB.

Second idea is to store the file length (bytes or chars or whatever), index that, & check for duplicate file lengths, doublechecking content if a duplicate length is found.

No idea what is the best solution performance-wise. Perhaps there is a better way?

If there is an efficient way to check if files are >95% similar that would be ideal, but I doubt there is?

Thanks for any help!

BTW I am using PHP5/Kohana


EDIT:

just had an idea on checking for similarity: I could count all alphanumeric characters & log the occurrence of each

eg: 17aB... = 1a,7b,10c,27c,...

potential problem would be the upper limit for a char count (around 61?)

I imagine false positives would still be rare . . .

good idea/bad idea?

回答1:

The hash idea is probably the best. You might have collisions, but they would be exceedingly rare.

Make the hash field a unique key for the table, and catch the duplicate error code. Or use insert ignore or insert replace.



回答2:

That sounds pretty good, I have implemented something similar. The hash field should be a key since duplicates are not allowed.

If each text record is long you could compute a constant multiple number (say 2) of hashes per record. Then maybe if just one of them is identical, that is close enough. Obviously the more hashes you have per record the closer you get to comparing the full text.

MD5's are 16 bytes. How many potential hashes will there be over time? If this number stays reasonable, you should be okay doing the comparison in memory.