I have a set S of n points in dimension d for which I can calculate all pairwise distances if need be. I need to select k points in this set so that the sum of their pairwise distances is maximal. In other, slightly more mathematical words, I want p1, ..., pk in S such that sum(i,j < k) dist(pi, pj) is maximal.
I know this question is related to this one (which is basically the same as mine but for k=2) and maybe to this one (with 'farthest' instead of 'closest').
I am not too sure about that, but maybe all the possible solutions have all their points in the convex hull ?
Any reasonable approximation/heuristic is fine.
Virtual bonus point #1 for a solution which works for any function that gives a score out of the four points (one of which could be the square root of the sum of the squared distances).
Virtual bonus point #2 if the solution is easily implemented in python + numpy/scipy.
Your problem seemed similar to the weighted minimum vertex cover problem (which is NP-complete). Thanks to @Gareth Rees for the comments below clarifying that I was incorrect in understanding a vertex cover's relationship to the set you're looking for. But you may still investigate the vertex cover problem and literature because your problem might be discussed alongside it, as they still do share some features.
If you're willing to work with diameter instead of summed graph weight, you could use the approach for the minimal diameter set that you linked in your question. If your current distance measure is called d
(the one for which you want the points furthest from each other) then just define d' = 1/d
and solve the minimum distance problem with d'
.
There might also be a relationship between some form of graph cutting algorithm, like say normalized cut, and the subset you seek. If your distance measure is used as the graph weight or affinity between nodes, you might be able to modify an existing graph cutting objective function to match your objective function (looking for the group of k nodes that have maximum summed weight).
This seems like a combinatorially difficult problem. You might consider something simple like simulated annealing. The proposal function could just choose at point that's currently in the k
-subset at random and replace it randomly with a point not currently in the k
-subset.
You would need a good cooling schedule for the temperature term and may need to use reheating as a function of cost. But this sort of this is really simple to program. As long as n
is reasonably small, you can then just constantly randomly select k
-subsets and anneal towards a k
-subset with very large total distance.
This would only give you an approximation, but even deterministic methods probably will solve this approximately.
Below is a first hack at what the simulated annealing code might be. Note that I'm not making guarantees about this. It could be an inefficient solution if calculating distance is too hard or the problem instance size grows too large. I'm using very naive geometric cooling with a fixed cooling rate, and you may also want to tinker with a fancier proposal than just randomly swapping around nodes.
all_nodes = np.asarray(...) # Set of nodes
all_dists = np.asarray(...) # Pairwise distances
N = len(all_nodes)
k = 10 # Or however many you want.
def calculate_distance(node_subset, distances):
# A function you write to determine sum of distances
# among a particular subset of nodes.
# Initial random subset of k elements
shuffle = np.random.shuffle(all_nodes)
current_subset = shuffle[0:k]
current_outsiders = shuffle[k:]
# Simulated annealing parameters.
temp = 100.0
cooling_rate = 0.95
num_iters = 10000
# Simulated annealing loop.
for ii in range(num_iters):
proposed_subset = current_subset.copy()
proposed_outsiders = current_outsiders.copy()
index_to_swap = np.random.randint(k)
outsider_to_swap = np.random.randint(N - k)
tmp = current_subset[index_to_swap]
proposed_subset[index_to_swap] = current_outsiders[outsider_to_swap]
proposed_outsiders[outsider_to_swap] = tmp
potential_change = np.exp((-1.0/temp)*
calculate_distance(proposed_subset,all_dists)/
calculate_distance(current_subset, all_dists))
if potential_change > 1 or potential_change >= np.random.rand():
current_subset = proposed_subset
current_outsiders = proposed_outsiders
temp = cooling_rate * temp
How about this greedy algorithm:
- add to the solution the 2 points with the greatest distance between them in S
- until you reach a solution of size k, add to the solution the point for which the sum of distances from it to all the points already in the solution is the greatest.
Lets call the greatest distance between any 2 point D, for each point we add to the solution, we add at least D due to the triangle inequality. so the solution will be at least (k-1)*D, while any solution will have (k-1)^2 distances, none of them exceeds D, so at the worse case you get a solution k times the optimum.
I'm not sure this is the tightest bound that can be proved for this heuristic.
Step 1: Precompute dist(pi, pj) for all pairs pi,pj
Step 2: Construct a complete graph V={p1,...,pn} with edge weights w_ij = dist(pi, pj)
Step 3: Solve the maximum edge-weighted clique (MEC) problem.
MEC is definitely NP-complete, and it is strongly related to quadratic programming. Thus, you might focus on heuristics if n is large (or even of moderate size).
Note that, by precomputing edge-weights, there are no restrictions on the distance function
Here's a working (brute-force) implementation for small n, which if nothing else shows the expressiveness of generator comprehensions:
selection = max(
itertools.combinations(points, k),
key=lambda candidate: sum(
dist(p, q) for p, q in itertools.combinations(candidate, 2)
)
)
Although this ends up calling dist
a lot:
(k! / 2! / (k-2)!) * (n! / k! / (n-k)! == n! /(2(k-2)!(n-k)!)