My code pulls ~1000 HTML files, extracts the relevant information & then stores that information in a MySQL TEXT field (as it is usually quite long). I am looking for a system to prevent duplicate entries in the DB
My first idea is to add a HASH field to the table (probably MD5), pull the hash list at the beginning of each run & check for duplicates before inserting into the DB.
Second idea is to store the file length (bytes or chars or whatever), index that, & check for duplicate file lengths, doublechecking content if a duplicate length is found.
No idea what is the best solution performance-wise. Perhaps there is a better way?
If there is an efficient way to check if files are >95% similar that would be ideal, but I doubt there is?
Thanks for any help!
BTW I am using PHP5/Kohana
EDIT:
just had an idea on checking for similarity: I could count all alphanumeric characters & log the occurrence of each
eg: 17aB... = 1a,7b,10c,27c,...
potential problem would be the upper limit for a char count (around 61?)
I imagine false positives would still be rare . . .
good idea/bad idea?