Clustering Strings Based on Similar Word Sequences

2019-03-01 00:58发布

问题:

I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences.

Consider a list of strings like:

the fruit hut number one
the ice cre  am shop number one
jim's taco
ice cream shop in the corner
the ice cream shop
the fruit hut
jim's taco outlet number one
jim's t  aco in the corner
the fruit hut in the corner

After the algorithm runs on them I want them clustered as follows:

the ice cre  am shop number one
ice cream shop in the corner
the ice cream shop

jim's taco
jim's taco outlet number one
jim's t  aco in the corner

the fruit hut
fruit hut number one
the fruit hut in the corner

As it is obvious, the sequences that differentiate the clusters are:

ice cream shop, jim's taco and fruit hut

回答1:

I think you are looking for Near Duplicates Detection, with some unknown threshold you will use to cluster not only "near duplicates" - but also similar enough documents together.

One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings (or k-grams), instead sets of only words.
In your example, with k=2, the sets will be: ice cream shop in the corner

s2 = { the ice, ice cre, cre am, am shop, shop number, number one }
s4 = {ice cream, cream shop, shop in, in the, the corner }
s5 = { the ice, ice cream, cream shop }

s4 [union] s5 = { ice cream, cream shop, shop in, in the, the corner, the ice } 
s4 [intersection] s5 = { ice cream, cream shop }

In the above, the jaccard-similarity will be 2/6.
In your case maybe the ordinary k-shinglings will perform worse than using a single word (1-shingling), but you will have to test these approaches.

This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture ~2 years ago, based on the author's notes).

After you are done with this procedure, you basically have a measure d(s1,s2) that measures the distance between every two sentences, and you can use any known clustering algorithm to cluster them.


Disclaimer: used my answer from this thread as basis for this after realizing near duplicates might fit here.



回答2:

Clustering is the wrong tool for you.

For any unsupervised algorithm, the following partitioning will be as good:

the fruit hut number one
the ice cre am shop number one
jim's taco outlet number one

the ice cream shop
the fruit hut
jim's taco

ice cream shop in the corner
jim's t aco in the corner
the fruit hut in the corner

Because to a clustering algorithm "number one" and "in the corner" are also shared phrases. The second cluster are the leftovers.

Use something supervised instead.