Clustering given pairwise distances with unknown c

2020-02-17 08:07发布

I have a set of objects {obj1, obj2, obj3, ..., objn}. I have calculated the pairwise distances of all possible pairs. The distances are stored in a n*n matrix M, with Mij being the distance between obji and objj. Then it is natural to see M is a symmetric matrix.

Now I wish to perform unsupervised clustering to these objects. After some searching, I find Spectral Clustering may be a good candidate, since it deals with such pairwise-distance cases.

However, after carefully reading its description, I find it unsuitable in my case, as it requires the number of clusters as the input. Before clustering, I don't know the number of clusters. It has to be figured out by the algorithm while performing the clustering, like DBSCAN.

Considering these, please suggest me some clustering methods that fit my case, where

  1. The pairwise distances are all available.
  2. The number of clusters is unknown.

7条回答
时光不老,我们不散
2楼-- · 2020-02-17 08:41

You can try to use hierarchical clustering. It has two types:

  • Agglomerative or "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  • Divisive or "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
查看更多
Evening l夕情丶
3楼-- · 2020-02-17 08:44

It's easy to do with the metric='precomputed' argument in sklearn clustering algorithms. You fit the model with the pairwise distance matrix rather than original features.

The idea how to do this is the following (for the case when you need to create a pairwise distance matrix too):

def my_metric(x, y):
   # implement your distance measure between x and y

def create_pairwise_dist(X_data):
   # create a matrix of pairwised distances between all elements in your X_data
   # for example with sklearn.metrics.pairwise.pairwise_distances
   # or scipy.spatial.distance.pdist
   # or your own code

X_data = <prepare your data matrix of features>
X_dist = create_pairwise_dist(X_data)

# then you can use DBSCAN

dbscan = DBSCAN(eps=1.3, metric='precomputed')
dbscan.fit(X_dist)
查看更多
一夜七次
4楼-- · 2020-02-17 08:52

Have you considered Correlation Clustering?
If you read carefully section 2.1 in that paper you'll see a probabilistic interpretation to the recovered number of clusters.

The only modification you need for your M matrix is to set a threshold deciding what distance is considered "same" and what distance is too large and should be considered as "not-same".

Section 7.2 in the aforementioned paper deals with a clustering of a full matrix where the recovering of the underlying number of clusters is an important part of the task at hand.

查看更多
贼婆χ
5楼-- · 2020-02-17 08:53

You can try multidimensional scaling (MDS). After you use MDS to convert the distance-like data into a geometrical picture, you can apply common clustering methods (like k-means) for clustering. See here and here for more.

查看更多
三岁会撩人
6楼-- · 2020-02-17 08:57

There are many possible clustering methods, and none of them can be considered "best", everything depends on the data, as always:

查看更多
我只想做你的唯一
7楼-- · 2020-02-17 09:01

Another approach that nobody has suggested thus far, if you like probabilistic clustering, is Bayesian non-parametrics (Dirichlet process priors being the simplest case). You can use multinomial likelihood for count-type data, or multivariate Gaussian likelihood if your data are continuous.

查看更多
登录 后发表回答