I'm having trouble understanding a specific use case of the sklearn.cluster.SpectralClustering
class as outlined in the official documentation here. Say I want to use my own affinity matrix to perform clustering. I first instantiate an object of class SpectralClustering
as follows:
from sklearn.clustering import SpectralClustering
cl = SpectralClustering(n_clusters=5,affinity='precomputed')
The documentation for the affinity
parameter above is as follows:
affinity : string, array-like or callable, default ‘rbf’
If a string, this may be one of ‘nearest_neighbors’, ‘precomputed’, ‘rbf’ or one of the kernels supported by sklearn.metrics.pairwise_kernels. Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.
Now the object cl
has a method fit
for which the documentation about its sole parameter X
is as follows:
X : array-like or sparse matrix, shape (n_samples, n_features)
OR, if affinity==
precomputed
, a precomputed affinity matrix of shape (n_samples, n_samples)
This is where it gets confusing. I am using my own affinity matrix, where a measure of 0 means two points are identical, with a higher number meaning two points are more dissimilar. However, the other choices for the parameter affinity
actually take a data set and produce a similarity matrix, for which higher values are indicative of more similarity, and lower values indicate dissimilarity (such as the radial basis kernel).
So when using the fit
method on my instance of SpectralClustering
, do I actually need to transform my affinity matrix into a similarity matrix before passing it to the fit
method call as the parameter X
? The same documentation page makes a note on transforming distance to well-behaved similarities, but does not explicitly indicate where this step should should be carried out, and via which method call.