I need to remove duplicated paragraphs in a text with many paragraphs.
I use functions from the class java.security.MessageDigest
to calculate each paragraph's MD5 hash value, and then add these hash value into a Set
.
If add()
'ed successfully, it means the latest paragraph is a duplicate one.
Is there any risk of this way?
Except String.equals()
, is there any other way to do it?
Before hashing you could normalize the paragraphs e.g. Removing punctuation, conversion to lower case and removing additional whitespace.
After normalizing, paragraphs that only differ there would get the same hash.
If the MD5 hash is not yet in the set, it means the paragraph is unique. But the opposite is not true. So if you find that the hash is already in the set, you could potentially have a non-duplicate with the same hash value. This would be very unlikely, but you'll have to test that paragraph against all others to be sure. For that String.equals would do.
Moreover, you should very well consider what you call unique (regarding typo's, whitespaces, capitals, and so on), but that would be the case with any method.
There's no need to calculate the MD5 hash, just use a HashSet
and try to put the strings itself into this set. This will use the String#hashCode()
method to compute a hash value for the String and check if it's already in the set.
public Set removeDuplicates(String[] paragraphs) {
Set<String> set = new LinkedHashSet<String>();
for (String p : paragraphs) {
set.add(p);
}
return set;
}
Using a LinkedHashSet
even keeps the original order of the paragraphs.
As others have suggested, you should be aware that minute differences in punctuation, white space, line breaks etc. may render your hashes different for paragraphs that are essentially the same.
Perhaps you should consider a less brittle metric, such as eg. the Cosine Similarity which is well suited for matching paragraphs.
Cheers,
I think this is a good way. However, there are some things to keep in mind:
- Please note that calculating a hash is a heavy operation. This could render your program slow, if you had to repeat it for millions of paragraphs.
- Even in this way, you could end up with slightly different paragraphs (with typos, for examplo) going undetecetd. If this is the case, you should normalize the paragraphs before calculaing the hash (putting it into lower case, removing extra-spaces and so on).