Biggest diameter of a set with a distance function

2019-08-15 06:52发布

问题:

I have a set of elements with a distance function between elements satisfying the triangle inequality.

I want to find the pair of elements separated by the greatest distance.

Is there any known solution better than trying all pairs?

回答1:

If you measure the distance from point a to points b, c and d, and you find that |ab| + |ac| < |ad|, then you know that |bc| is shorter than |ad|, and there's no need to measure |bc|. So not all pairs need to be checked to find the longest distance.

A possible algorithm would be:
Start by measuring the distance from point a to all other points, find the point n which is furthest away from a, and then give all pairs b,x for which |ab|+|ax| < |an| the distance |ab|+|ax| (because that is their maximum possible distance).
Do the same for point b, measuring only those distances which haven't yet been set. Check if you've found a new maximum, and then again give all pairs c,x for which |bc|+|bx| < MAX the distance |bc|+|bx|.
Go on doing this for points c, d, ...

In the best case scenario, you could find the longest distance in a set of N points after just N-1 measurements (if |ax| is twice as long as any other distance from a). In the worst case, you would need to measure every single pair (if the shortest distance is more than half of the longest distance, or if you are unlucky in the order in which you run through the points).


If you want to reduce the number of distance measurements to the absolute minimum, and for every unknown distance x,y you check every previously stored value |ax|+|ay|, |bx|+|by|, |cx|+|cy| ... to see whether it's smaller than the current maximum and can thus be used as a value for |xy|, the number of measurements is reduced substantially.

Running this algorithm on 1000 random points in a square 2D space, which would normally require 499,500 measurements, returns the maximum distance with between 2,000 and 10,000 measurements (or between 0.4% and 2% of the total, with an average around 1%).

This doesn't necessarily mean that the algorithm is much faster in practice than measuring every distance; that depends on how expensive the measuring is compared to the combination of loops, additions and comparisons required to avoid the measurements.


As @mcdowella pointed out, this method becomes less efficient as the number of dimensions of the space increases. The number of points also has a big impact. The table below shows the number of measurements that have to be carried out in relation to the total number of pairs. These are averages from a test with randomly distributed points in a "square" space (i.e. the coordinates in all dimensions are within the same range). As you can see, this method makes most sense for geometrical problems with many points in a 2D or 3D space. However, if your data is highly biased in some way, the results may be different.

       10 points (45 pairs)      100 points (4950 pairs)   1000 points (499500 pairs)
dim    measurem.   % of total    measurem.   % of total    measurem.   % of total

 1      16.6674      37.04         221.17       4.47        4877.97       0.98
 2      22.4645      49.92         346.77       7.01        5346.78       1.07
 3      27.5892      61.31         525.73      10.62        7437.16       1.49
 4      31.9398      70.98         731.83      14.78       12780.02       2.56
 5      35.3313      78.51         989.27      19.99       19457.84       3.90
 6      38.1420      84.76        1260.89      25.47       26360.16       5.28
 7      40.2296      89.40        1565.80      31.63       33221.32       6.65
 8      41.6864      92.64        1859.08      37.56       44073.42       8.82
 9      42.7149      94.92        2168.03      43.80       56374.36      11.29
10      43.4463      96.55        2490.69      50.32       73053.06      14.63
20      44.9789      99.95        4617.41      93.28      289978.20      58.05
30      44.9996      99.999       4936.68      99.73      460056.04      92.10
40                                4949.79      99.99      496893.10      99.48
50                                4949.99      99.9999    499285.80      99.96
60                                                        499499.60      99.9999

As expected, the results of the tests become predictable at higher dimensions, with only a few percent between the outliers, while in 2D some test cases required 30 times more measurements than others.