How to compare two paragraphs of text?

I need to remove duplicated paragraphs in a text with many paragraphs.

I use functions from the class java.security.MessageDigest to calculate each paragraph's MD5 hash value, and then add these hash value into a Set.

If add()'ed successfully, it means the latest paragraph is a duplicate one.

Is there any risk of this way?

Except String.equals(), is there any other way to do it?

标签： java string compare md5 paragraph

5条回答

叛逆

2楼-- · 2019-06-20 07:23

Before hashing you could normalize the paragraphs e.g. Removing punctuation, conversion to lower case and removing additional whitespace. After normalizing, paragraphs that only differ there would get the same hash.

0人赞添加讨论(0) 举报

唯我独甜

3楼-- · 2019-06-20 07:27

There's no need to calculate the MD5 hash, just use a HashSet and try to put the strings itself into this set. This will use the String#hashCode() method to compute a hash value for the String and check if it's already in the set.

public Set removeDuplicates(String[] paragraphs) {
    Set<String> set = new LinkedHashSet<String>();
    for (String p : paragraphs) {
        set.add(p);
    }
    return set;
}

Using a LinkedHashSet even keeps the original order of the paragraphs.

0人赞添加讨论(0) 举报

戒情不戒烟

4楼-- · 2019-06-20 07:31

I think this is a good way. However, there are some things to keep in mind:

Please note that calculating a hash is a heavy operation. This could render your program slow, if you had to repeat it for millions of paragraphs.
Even in this way, you could end up with slightly different paragraphs (with typos, for examplo) going undetecetd. If this is the case, you should normalize the paragraphs before calculaing the hash (putting it into lower case, removing extra-spaces and so on).

0人赞添加讨论(0) 举报

来，给爷笑一个

5楼-- · 2019-06-20 07:33

As others have suggested, you should be aware that minute differences in punctuation, white space, line breaks etc. may render your hashes different for paragraphs that are essentially the same.

Perhaps you should consider a less brittle metric, such as eg. the Cosine Similarity which is well suited for matching paragraphs.

Cheers,

0人赞添加讨论(0) 举报

贼婆χ

6楼-- · 2019-06-20 07:36

If the MD5 hash is not yet in the set, it means the paragraph is unique. But the opposite is not true. So if you find that the hash is already in the set, you could potentially have a non-duplicate with the same hash value. This would be very unlikely, but you'll have to test that paragraph against all others to be sure. For that String.equals would do.

Moreover, you should very well consider what you call unique (regarding typo's, whitespaces, capitals, and so on), but that would be the case with any method.

0人赞添加讨论(0) 举报

How to compare two paragraphs of text?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间