Python: String clustering with scikit-learn's

I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters.

I am facing some problems using Scikit-learn's implementation of dbscan.

This snippet below works on small datasets in the format I an using, but since it is precomputing the entire distance matrix, that takes O(n^2) space and time and is way too much for my large datasets. I have run this for many hours but it just ends up taking all the memory of my PC.

lev_similarity = -1*np.array([[distance.levenshtein(w1[0],w2[0]) for w1 in words] for w2 in words])
dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1)
dbscan.fit(lev_similarity)

So I figured I needed some way to compute the similarity on the fly and hence tried this method.

dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1, metric = distance.levenshtein)
dbscan.fit(words)

But this method ends up giving me an error:

ValueError: could not convert string to float: URL

Which I realize means that its trying to convert the inputs to the similarity function to floats. But I don't want it to do that. As far as I understand, it just needs a function that can take two arguments and return a float value that it can then compare to eps, which the levenshtein distance should do.

I am stuck at this point, as I do not know the implementation details of sklearn's dbscan to find why it is trying to convert it to float, and neither do I have any better idea on how to avoid the O(n^2) matrix computation.

Please let me know if there is any better or faster way to cluster these many strings that I may have overlooked.

标签： python machine-learning scikit-learn cluster-analysis levenshtein-distance

2条回答

够拽才男人

2楼-- · 2020-06-03 04:57

Try ELKI instead of sklearn.

It is the only tool I know that allows index accelerated DBSCAN with any metric.

It includes Levenshtein distance. You need to add an index to your database with -db.index. I always use the cover tree index (you need to choose the same distance for the index and for the algorithm, of course!)

You could use "pyfunc" distances and ball trees in sklearn, but performance was really bad because of the interpreter. Also, DBSCAN in sklearn is much more memory intensive.

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

3楼-- · 2020-06-03 05:08

From the scikit-learn faq you can do this by making a custom metric:

from leven import levenshtein       
import numpy as np
from sklearn.cluster import dbscan
data = ["ACCTCCTAGAAG", "ACCTACTAGAAGTT", "GAATATTAGGCCGA"]
def lev_metric(x, y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return levenshtein(data[i], data[j])

X = np.arange(len(data)).reshape(-1, 1)
dbscan(X, metric=lev_metric, eps=5, min_samples=2)

0人赞添加讨论(0) 举报

Python: String clustering with scikit-learn's

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间