I need to remove duplicated paragraphs in a text with many paragraphs.
I use functions from the class java.security.MessageDigest
to calculate each paragraph's MD5 hash value, and then add these hash value into a Set
.
If add()
'ed successfully, it means the latest paragraph is a duplicate one.
Is there any risk of this way?
Except String.equals()
, is there any other way to do it?
Before hashing you could normalize the paragraphs e.g. Removing punctuation, conversion to lower case and removing additional whitespace. After normalizing, paragraphs that only differ there would get the same hash.
There's no need to calculate the MD5 hash, just use a
HashSet
and try to put the strings itself into this set. This will use theString#hashCode()
method to compute a hash value for the String and check if it's already in the set.Using a
LinkedHashSet
even keeps the original order of the paragraphs.I think this is a good way. However, there are some things to keep in mind:
As others have suggested, you should be aware that minute differences in punctuation, white space, line breaks etc. may render your hashes different for paragraphs that are essentially the same.
Perhaps you should consider a less brittle metric, such as eg. the Cosine Similarity which is well suited for matching paragraphs.
Cheers,
If the MD5 hash is not yet in the set, it means the paragraph is unique. But the opposite is not true. So if you find that the hash is already in the set, you could potentially have a non-duplicate with the same hash value. This would be very unlikely, but you'll have to test that paragraph against all others to be sure. For that String.equals would do.
Moreover, you should very well consider what you call unique (regarding typo's, whitespaces, capitals, and so on), but that would be the case with any method.