String clustering in Python

2019-07-14 04:48发布

问题:

I have a list of strings and I want to classify it by using clustering in Python.

list = ['String1', 'String2', 'String3',...]

I want to use Levenshtein distance, so I used jellyfish library. Given two strings, I know that their distance can be found this way:

jellyfish.levenshtein_distance('string1', 'string2')

My problem is that I don't know how to use scipy.cluster.hierarchy to get a list in Python of each cluster. I have also tried using linkage function:

linkage(y[, method, metric])

But I am not able to get the final list with clusters.

Any help?

回答1:

After using linkage for implementing hierarchical clustering on the distance you have, you should use cluster.hierarchy.cut_tree to cut the tree. If you want two clusters:

cluster.hierarchy.cut_tree(linkage_output,2).ravel() #.ravel makes it 1D array.