What is the best way to check for duplicate TEXT f

My code pulls ~1000 HTML files, extracts the relevant information & then stores that information in a MySQL TEXT field (as it is usually quite long). I am looking for a system to prevent duplicate entries in the DB

My first idea is to add a HASH field to the table (probably MD5), pull the hash list at the beginning of each run & check for duplicates before inserting into the DB.

Second idea is to store the file length (bytes or chars or whatever), index that, & check for duplicate file lengths, doublechecking content if a duplicate length is found.

No idea what is the best solution performance-wise. Perhaps there is a better way?

If there is an efficient way to check if files are >95% similar that would be ideal, but I doubt there is?

Thanks for any help!

BTW I am using PHP5/Kohana

EDIT:

just had an idea on checking for similarity: I could count all alphanumeric characters & log the occurrence of each

eg: 17aB... = 1a,7b,10c,27c,...

potential problem would be the upper limit for a char count (around 61?)

I imagine false positives would still be rare . . .

good idea/bad idea?

标签： php mysql hash duplicates

2条回答

神经病院院长

2楼-- · 2019-06-26 11:33

That sounds pretty good, I have implemented something similar. The hash field should be a key since duplicates are not allowed.

If each text record is long you could compute a constant multiple number (say 2) of hashes per record. Then maybe if just one of them is identical, that is close enough. Obviously the more hashes you have per record the closer you get to comparing the full text.

MD5's are 16 bytes. How many potential hashes will there be over time? If this number stays reasonable, you should be okay doing the comparison in memory.

0人赞添加讨论(0) 举报

女痞

3楼-- · 2019-06-26 11:40

The hash idea is probably the best. You might have collisions, but they would be exceedingly rare.

Make the hash field a unique key for the table, and catch the duplicate error code. Or use insert ignore or insert replace.

0人赞添加讨论(0) 举报

What is the best way to check for duplicate TEXT f

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间