The question is:
Given N points(in 2D) with x and y coordinates, find a point P (in N given points) such that the sum of distances from other(N-1) points to P is minimum.
This point is commonly known as Geometric Median. Is there any efficient algorithm to solve this problem, other than the naive O(N^2)
one?
As mentioned earlier, the type of algorithm to use depends on the way you measure distance. Since your question does not specify this measure, here are C implementations for both the Manhattan distance and the Squared Euclidean distance. Use
dim = 2
for 2D points. ComplexityO(n log n)
.Manhattan distance
Short explanation: We can sum the distance per dimension, 2 in your case. Say we have
N
points and the values in one dimension arev_0
, ..,v_(N-1)
andT = v_0 + .. + v_(N-1)
. Then for each valuev_i
we haveS_i = v_0 .. v_(i-1)
. Now we can express the Manhattan distance for this value by summing those on the left side:i * v_i - S_i
and the right side:T - S_i - (N - i) * v_i
, which results in(2 * i - N) * v_i - 2 * S_i + T
. AddingT
to all elements does not change the order, so we leave that out. AndS_i
can be computed on the fly.Here is the rest of the code that makes it into an actual C program:
Squared Euclidean distance
Shorter explanation: Pretty much the same approach as the previous, but with a slightly more complicated derivation. Say
TT = v_0^2 + .. + v_(N-1)^2
we getTT + N * v_i^2 - 2 * v_i^2 * T
. Again TT is added to all so it can be left out. More explanation on request.You can solve the problem as a convex programming (The objective function is not always convex). The convex program can be solved using an iterative such as L-BFGS. The cost for each iteration is O(N) and usually the number of required iteration is not large. One important point to reduce the number of required iterations is that we know the optimum answer is one of the point in the input. So, the optimization can be stopped when its answer become close to one of the input points.
It appears that the problem is difficult to solve in better than
O(n^2)
time when using Euclidean distances. However the point that minimizes the sum of Manhattan distances to other points or the point that minimizes the sum of squares of Euclidean distances to other points can be found inO(n log n)
time. (Assuming multiplying two numbers isO(1)
). Let me shamelessly copy/paste my solution for Manhattan distances from a recent post:We can follow a similar approach for computing the point that minimizes the sum of squares of Euclidean distances to other points. Let the sorted x-coordinates be: x1, x2, x3, ..., xn. We scan this list from left to right and for each point xi we compute:
li = sum of distances to all the elements to the left of xi = (xi-x1) + (xi-x2) + .... + (xi-xi-1) , and
sli = sum of squares of distances to all the elements to the left of xi = (xi-x1)^2 + (xi-x2)^2 + .... + (xi-xi-1)^2
Note that given li and sli we can compute li+1 and sli+1 in
O(1)
time as follows:Let d = xi+1-xi. Then:
li+1 = li + id and sli+1 = sli + id^2 + 2*i*d
Thus we can compute all the li and sli in linear time by scanning from left to right. Similarly, for every element we can compute the ri: sum of distances to all elements to the right and the sri: sum of squares of distances to all the elements to the right in linear time. Adding sri and sli for each i, gives the sum of squares of horizontal distances to all the elements, in linear time. Similarly, compute the sum of squares of vertical distances to all the elements.
Then we can scan through the original points array and find the point that minimizes the sum of squares of vertical and horizontal distances as before.
I implemented the Weiszfeld method (I know it's not what you are looking for, but it may help to aproximate your Point), the complexity is O(N*M/k) where N is the number of points, M the dimension of the points (in your case is 2) and k is the error desired:
https://github.com/j05u3/weiszfeld-implementation
I solved something similar for a local online judge once using simulated annealing. That was the official solution as well and the program got AC.
The only difference was that the point I had to find did not have to be part of the
N
given points.This was my C++ code, and
N
could be as large as50000
. The program executes in0.1s
on a 2ghz pentium 4.Then I think It's correct to pick the one from your list that is closest to the
(x, y)
returned by this algorithm.This algorithm takes advantage of what this wikipedia paragraph on the geometric median says:
The first paragraph above explains why this works: because the function we are trying to optimize does not have any local minimums, so you can greedily find the minimum by iteratively improving it.
Think of this as a sort of binary search. First, you approximate the result. A good approximation will be the center of gravity, which my code computes when reading the input. Then, you see if adjacent points to this give you a better solution. In this case, a point is considered adjacent if it as a distance of
step
away from your current point. If it is better, then it is fine to discard your current point, because, as I said, this will not trap you into a local minimum because of the nature of the function you are trying to minimize.After this, you half the step size, just like in binary search, and continue until you have what you consider to be a good enough approximation (controlled by the
eps
constant).The complexity of the algorithm therefore depends on how accurate you want the result to be.
Step 1: Sort the points collection by x-dimension (nlogn)
Step 2: Calculate the x-distance between each point and all points TO THE LEFT of it:
Step 3: Calculate the x-distance between each point and all points TO THE RIGHT of it:
Step 4: Sum both up you'll get the total x-distance from each point to the other N-1 points
The point with the smallest sum of
xDist
andyDist
is the answerTotal Complexity O(nlogn)
Answer in C++
Further explanation:
The idea is to reuse the already computed total distance of the previous point.
Lets say we have 3 point ABCD sorted, we see that the total left distance of D to the others before it are:
In which
(AC + BC)
is the total left distance of C to the others before it, we took advantage of this and only need to computeldist(C) + 3CD