Use Absolute Pearson Correlation as Distance in K-

2019-03-31 12:56发布

问题:

I need to do some clustering using a correlation distance but instead of using the built-in 'distance' 'correlation' which is defined as d=1-r i need the absolute pearson distance.In my aplication anti-correlated data should get the same cluter ID. And now when using the kmeans() function im getting centroids that are highly anticorreleted wich i would like to avoid by combineing them. Now, im not that fluent in matlab yet and have some problems reading the kmeans function. Would it be possible to edit it for my pourpose?

Example:

Row 1 and 2 should get the same cluster ID when using the correlation distance as metrics.

I did some attempts to edit the built-in matlab function ( open kmeans- >line 775) but whats weird - when i change the distance function im getting a valid distance matrix but wrong cluster indexes, cant find the reason for it. Would love to get some tips! all best!

回答1:

This is a good example of why you should not use k-means with other distance functions.

k-means does not minimize distances. It minimizes the sum of squared 1-dimensional deviations (SSQ).

Which is mathematically equivalent to squared Euclidean distance, so it does minimize Euclidean distances, as a mathematical side effect. It does not minimize arbitrary other distances, which are not equivalent to variance minimization.

In your case, it's pretty nice to see why it fails; I have to remember this as a demo case.

As you may know, k-means (Lloyds, that is) consists of two steps: assign by minimum squared deviation and then recompute the means.

Now the problem is, recomputing the mean is not consistent with absolute pearson correlation.

Let's take two of your vectors, which are -1 correlated:

+1 +2 +3 +4 +5
-1 -2 -3 -4 -5

and compute the mean:

 0  0  0  0  0

Boom. They are not at all correlated to their mean. In fact, Pearson correlation is not even well-defined for this vector anymore, because it has zero variance...

Why does this happen? Because you misinterpreted k-means as distance based. It's actually as much arithmetic mean based. The arithmetic mean is a least-squares (!!) estimator - it minimizes the sum of squared deviations. And that is why squared Euclidean distance works: it optimizes the same quantity as recomputing the mean. Optimizing the same objective in both steps makes the algorithm converge.

See also this counter-example for Earth-movers distance, where the mean step of k-means yields suboptimal results (although probably not as bad as with absolute pearson)

Instead of using k-means, consider using k-medoids aka PAM, which does work for arbitrary distances. Or one of the many other clustering algorithms, including DBSCAN and OPTICS.



回答2:

You can try to modify another version of kmeans: This version is also efficient, but much more simple (around 10 lines of code). Here you have the exmplanation of the code.