Finding the farthest point in one set from another

2020-05-28 05:57发布

问题:

My goal is a more efficient implementation of the algorithm posed in this question.

Consider two sets of points (in N-space. 3-space for the example case of RGB colorspace, while a solution for 1-space 2-space differs only in the distance calculation). How do you find the point in the first set that is the farthest from its nearest neighbor in the second set?

In a 1-space example, given the sets A:{2,4,6,8} and B:{1,3,5}, the answer would be 8, as 8 is 3 units away from 5 (its nearest neighbor in B) while all other members of A are just 1 unit away from their nearest neighbor in B. edit: 1-space is overly simplified, as sorting is related to distance in a way that it is not in higher dimensions.

The solution in the source question involves a brute force comparison of every point in one set (all R,G,B where 512>=R+G+B>=256 and R%4=0 and G%4=0 and B%4=0) to every point in the other set (colorTable). Ignore, for the sake of this question, that the first set is elaborated programmatically instead of iterated over as a stored list like the second set.

回答1:

First you need to find every element's nearest neighbor in the other set.

To do this efficiently you need a nearest neighbor algorithm. Personally I would implement a kd-tree just because I've done it in the past in my algorithm class and it was fairly straightforward. Another viable alternative is an R-tree.

Do this once for each element in the smallest set. (Add one element from the smallest to larger one and run the algorithm to find its nearest neighbor.)

From this you should be able to get a list of nearest neighbors for each element.

While finding the pairs of nearest neighbors, keep them in a sorted data structure which has a fast addition method and a fast getMax method, such as a heap, sorted by Euclidean distance.

Then, once you're done simply ask the heap for the max.

The run time for this breaks down as follows:

N = size of smaller set
M = size of the larger set

  • N * O(log M + 1) for all the kd-tree nearest neighbor checks.
  • N * O(1) for calculating the Euclidean distance before adding it to the heap.
  • N * O(log N) for adding the pairs into the heap.
  • O(1) to get the final answer :D

So in the end the whole algorithm is O(N*log M).

If you don't care about the order of each pair you can save a bit of time and space by only keeping the max found so far.

*Disclaimer: This all assumes you won't be using an enormously high number of dimensions and that your elements follow a mostly random distribution.



回答2:

The most obvious approach seems to me to be to build a tree structure on one set to allow you to search it relatively quickly. A kd-tree or similar would probably be appropriate for that.

Having done that, you walk over all the points in the other set and use the tree to find their nearest neighbour in the first set, keeping track of the maximum as you go.

It's nlog(n) to build the tree, and log(n) for one search so the whole thing should run in nlog(n).



回答3:

To make things more efficient, consider using a Pigeonhole algorithm - group the points in your reference set (your colorTable) by their location in n-space. This allows you to efficiently find the nearest neighbour without having to iterate all the points.

For example, if you were working in 2-space, divide your plane into a 5 x 5 grid, giving 25 squares, with 25 groups of points.

In 3 space, divide your cube into a 5 x 5 x 5 grid, giving 125 cubes, each with a set of points.

Then, to test point n, find the square/cube/group that contains n and test distance to those points. You only need to test points from neighbouring groups if point n is closer to the edge than to the nearest neighbour in the group.



回答4:

For each point in set B, find the distance to its nearest neighbor in set A.

To find the distance to each nearest neighbor, you can use a kd-tree as long as the number of dimensions is reasonable, there aren't too many points, and you will be doing many queries - otherwise it will be too expensive to build the tree to be worthwhile.



回答5:

Maybe I'm misunderstanding the question, but wouldn't it be easiest to just reverse the sign on all the coordinates in one data set (i.e. multiply one set of coordinates by -1), then find the first nearest neighbour (which would be the farthest neighbour)? You can use your favourite knn algorithm with k=1.



回答6:

EDIT: I meant nlog(n) where n is the sum of the sizes of both sets.

In the 1-Space set I you could do something like this (pseudocode)

Use a structure like this

Struct Item {
    int value
    int setid
}

(1) Max Distance = 0
(2) Read all the sets into Item structures
(3) Create an Array of pointers to all the Items
(4) Sort the array of pointers by Item->value field of the structure
(5) Walk the array from beginning to end, checking if the Item->setid is different from the previous Item->setid if (SetIDs are different)
check if this distance is greater than Max Distance if so set MaxDistance to this distance

Return the max distance.