-->

In scikit-learn, can DBSCAN use sparse matrix?

2020-07-06 05:56发布

问题:

I got Memory Error when I was running dbscan algorithm of scikit. My data is about 20000*10000, it's a binary matrix.

(Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number)

Anyway I found sparse matrix and feature extraction of scikit.

http://scikit-learn.org/dev/modules/feature_extraction.html http://docs.scipy.org/doc/scipy/reference/sparse.html

But I still have no idea how to use it. In DBSCAN's specification, there is no indication about using sparse matrix. Is it not allowed?

If anyone knows how to use sparse matrix in DBSCAN, please tell me. Or you can tell me a more suitable cluster method.

回答1:

The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.

As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.

May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)

Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).

Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.



回答2:

You can pass a distance matrix to DBSCAN, so assuming X is your sample matrix, the following should work:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X, X)
db = DBSCAN(metric="precomputed").fit(D)

However, the matrix D will be even larger than X: n_samples² entries. With sparse matrices, k-means is probably the best option.

(DBSCAN may seem attractive because it doesn't need a pre-determined number of clusters, but it trades that for two parameters that you have to tune. It's mostly applicable in settings where the samples are points in space and you know how close you want those points to be to be in the same cluster, or when you have a black box distance metric that scikit-learn doesn't support.)



回答3:

Yes, since version 0.16.1. Here's a commit for a test:

https://github.com/scikit-learn/scikit-learn/commit/494b8e574337e510bcb6fd0c941e390371ef1879



回答4:

Sklearn's DBSCAN algorithm doesn't take sparse arrays. However, KMeans and Spectral clustering do, you can try these. More on sklearns clustering methods: http://scikit-learn.org/stable/modules/clustering.html