Most mutually distant k elements (clustering?)

2019-04-10 00:36发布

问题:

I have a simple machine learning question:

I have n (~110) elements, and a matrix of all the pairwise distances. I would like to choose the 10 elements that are most far apart. That is, I want to

Maximize:
  Choose 10 different elements.
  Return min distance over (all pairings within the 10).

My distance metric is symmetric and respects the triangle inequality.

What kind of algorithm can I use? My first instinct is to do the following:

  1. Cluster the n elements into 20 clusters.
  2. Replace each cluster with just the element of that cluster that is furthest from the mean element of the original n.
  3. Use brute force to solve the problem on the remaining 20 candidates. Luckily, 20 choose 10 is only 184,756.

Edit: thanks to etarion's insightful comment, changed "Return sum of (distances)" to "Return min distance" in the optimization problem statement.

回答1:

Here's how you might approach this combinatorial optimization problem by taking the convex relaxation.

Let D be an upper triangular matrix with your distances on the upper triangle. I.e. where i < j, D_i,j is the distance between elements i and j. (Presumably, you'll have zeros on the diagonal, as well.)

Then your objective is to maximize x'*D*x, where x is binary valued with 10 elements set to 1 and the rest to 0. (Setting the ith entry in x to 1 is analogous to selecting the ith element as one of your 10 elements.)

The "standard" convex optimization thing to do with a combinatorial problem like this is to relax the constraints such that x need not be discrete valued. Doing so gives us the following problem:

maximize y'*D*y subject to: 0 <= y_i <= 1 for all i, 1'*y = 10

This is (morally) a quadratic program. (If we replace D with D + D', it'll become a bona fide quadratic program and the y you get out should be no different.) You can use an off-the-shelf QP solver, or just plug it in to the convex optimization solver of your choice (e.g. cvx).

The y you get out need not be (and probably won't be) a binary vector, but you can convert the scalar values to discrete ones in a bunch of ways. (The simplest is probably to let x be 1 in the 10 entries where y_i is highest, but you might need to do something a little more complicated.) In any case, y'*D*y with the y you get out does give you an upper bound for the optimal value of x'*D*x, so if the x you construct from y has x'*D*x very close to y'*D*y, you can be pretty happy with your approximation.

Let me know if any of this is unclear, notation or otherwise.



回答2:

Nice question.

I'm not sure if it can be solved exactly in an efficient manner, and your clustering based solution seems reasonable. Another direction to look at would be local search method such as simulated annealing and hill climbing.

Here's an obvious baseline I would compare any other solution against:

  1. Repeat 100 times:
  2. Greedily select the datapoint that whose removal decreases the objective function the least and remove it.