Using the class sklearn.cluster.SpectralClustering

2019-02-10 01:09发布

问题:

I'm having trouble understanding a specific use case of the sklearn.cluster.SpectralClustering class as outlined in the official documentation here. Say I want to use my own affinity matrix to perform clustering. I first instantiate an object of class SpectralClustering as follows:

from sklearn.clustering import SpectralClustering

cl = SpectralClustering(n_clusters=5,affinity='precomputed')

The documentation for the affinity parameter above is as follows:

affinity : string, array-like or callable, default ‘rbf’

If a string, this may be one of ‘nearest_neighbors’, ‘precomputed’, ‘rbf’ or one of the kernels supported by sklearn.metrics.pairwise_kernels. Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.

Now the object cl has a method fit for which the documentation about its sole parameter X is as follows:

X : array-like or sparse matrix, shape (n_samples, n_features)

OR, if affinity==precomputed, a precomputed affinity matrix of shape (n_samples, n_samples)

This is where it gets confusing. I am using my own affinity matrix, where a measure of 0 means two points are identical, with a higher number meaning two points are more dissimilar. However, the other choices for the parameter affinity actually take a data set and produce a similarity matrix, for which higher values are indicative of more similarity, and lower values indicate dissimilarity (such as the radial basis kernel).

So when using the fit method on my instance of SpectralClustering, do I actually need to transform my affinity matrix into a similarity matrix before passing it to the fit method call as the parameter X? The same documentation page makes a note on transforming distance to well-behaved similarities, but does not explicitly indicate where this step should should be carried out, and via which method call.

回答1:

Straight from the docs:

If you have an affinity matrix, such as a distance matrix, for which 0 means identical elements, and high values means very dissimilar elements, it can be transformed in a similarity matrix that is well suited for the algorithm by applying the Gaussian (RBF, heat) kernel:

np.exp(- X ** 2 / (2. * delta ** 2))

This goes in your own code, and the result of this can be passed to fit. For the purpose of this algorithm, affinity means similarity, not distance.