I was trying to use DBSCAN algorithm from scikit-learn library with cosine metric but was stuck with the error.
The line of code is
db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X)
where X
is a csr_matrix
. The error is the following:
Metric 'cosine' not valid for algorithm 'auto',
though the documentation says that it is possible to use this metric.
I tried to use option algorithm='kd_tree'
and 'ball_tree'
but got the same. However, there is no error if I use euclidean
or, say, l1
metric.
The matrix X
is large, so I can't use a precomputed matrix of pairwise distances.
I use python 2.7.6
and scikit-learn 0.16.1
.
My dataset doesn't have a full row of zeros, so cosine metric is well-defined.
The indexes in sklearn (probably - this may change with new versions) cannot accelerate cosine.
Try algorithm='brute'
.
For a list of metrics that your version of sklearn can accelerate, see the supported metrics of the ball tree:
from sklearn.neighbors.ball_tree import BallTree
print(BallTree.valid_metrics)
If you want a normalized distance like the cosine distance, you can also normalize your vectors first and then use the euclidean metric. Notice that for two normalized vectors u and v the euclidean distance is equal to sqrt(2-2*cos(u, v)) (see this discussion)
You can hence do something like:
Xnorm = np.linalg.norm(X,axis = 1)
Xnormed = np.divide(X,Xnorm.reshape(Xnorm.shape[0],1))
db = DBSCAN(eps=0.5, min_samples=2, metric='euclidean').fit(Xnormed)
The distances will lie in [0,2] so make sure you adjust your parameters accordingly.