I have been trying to cluster multiple datasets of URLs (around 1 million each), to find the original and the typos of each URL. I decided to use the levenshtein distance as a similarity metric, along with dbscan as the clustering algorithm as k-means algorithms won't work because I do not know the number of clusters.
I am facing some problems using Scikit-learn's implementation of dbscan.
This snippet below works on small datasets in the format I an using, but since it is precomputing the entire distance matrix, that takes O(n^2) space and time and is way too much for my large datasets. I have run this for many hours but it just ends up taking all the memory of my PC.
lev_similarity = -1*np.array([[distance.levenshtein(w1[0],w2[0]) for w1 in words] for w2 in words])
dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1)
dbscan.fit(lev_similarity)
So I figured I needed some way to compute the similarity on the fly and hence tried this method.
dbscan = sklearn.cluster.DBSCAN(eps = 7, min_samples = 1, metric = distance.levenshtein)
dbscan.fit(words)
But this method ends up giving me an error:
ValueError: could not convert string to float: URL
Which I realize means that its trying to convert the inputs to the similarity function to floats. But I don't want it to do that. As far as I understand, it just needs a function that can take two arguments and return a float value that it can then compare to eps, which the levenshtein distance should do.
I am stuck at this point, as I do not know the implementation details of sklearn's dbscan to find why it is trying to convert it to float, and neither do I have any better idea on how to avoid the O(n^2) matrix computation.
Please let me know if there is any better or faster way to cluster these many strings that I may have overlooked.