Given a set of xy coordinates, how can I choose n points such that those n points are most distant from each other?
An inefficient method that probably wouldn't do too well with a big dataset would be the following (identify 20 points out of 1000 that are most distant):
xy <- cbind(rnorm(1000),rnorm(1000))
n <- 20
bestavg <- 0
bestSet <- NA
for (i in 1:1000){
subset <- xy[sample(1:nrow(xy),n),]
avg <- mean(dist(subset))
if (avg > bestavg) {
bestavg <- avg
bestSet <- subset
}
}
This code, based on Pascal's code, drops the point that has the largest row sum in the distance matrix.
Run on a Gaussian cloud, where
m1
is @pascal's function:See who did best by looking at the sum of the interpoint distances:
Method 2 wins! And compare with a random sample of 20 points:
which does pretty poorly as expected.
So what's going on? Let's plot:
Method 1 generates the red points, which are evenly spaced out over the space. Method 2 creates the blue points, which almost define the perimeter. I suspect the reason for this is easy to work out (and even easier in one dimension...).
Using a bimodal pattern of initial points also illustrates this:
and again method 2 produces much larger total sum distance than method 1, but both do better than random sampling:
Following @Spacedman's suggestion, I have written a function that drops a point from the closest pair, until the desired number of points remains. It seems to work well, however, it slows down pretty quickly as you add points.