可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

My current code is pretty quick, but I need to make it even faster so we can accommodate even more markers. Any suggestions?

Notes:

The code runs fastest when the SQL statement is ordered by marker name - which itself does a very partial job of clustering the markers (the names of markers in the same location are often, but not always similar).
I can't pre-cluster the markers, because they can be dynamically searched and filtered.
I've tried grid-based clustering - but the results often aren't very nice.
I know that the clusters are slightly skewed on a Mercator projection.
I'm not interested in a commercial clustering service.

The code:

$singleMarkers = array();
$clusterMarkers = array();

while (count($markers)) {
    $marker  = array_pop($markers);
    $cluster = array();

    // Compare marker against all remaining markers.
    foreach ($markers as $key => $compareMarker) {
        // This function returns the distance between two markers, at a defined zoom level.
        $pixels = pixelDistance($marker['lat'], $marker['lng'], $compareMarker['lat'], $compareMarker['lng'], $zoomLevel);
        // If two markers are closer than defined distance, remove compareMarker from array and add to cluster.
        if ($pixels < $distance) {
            unset($markers[$key]);
            $cluster[] = $compareMarker;
        }
    }

    // If a marker was added to cluster, also add the marker we were comparing to.
    if (count($cluster) > 0) {
        $cluster[] = $marker;
        $clusterMarkers[] = $cluster;
    } else {
        $singleMarkers[] = $marker;
    }
}

function pixelDistance($lat1, $lon1, $lat2, $lon2, $zoom) {
    $x1 = $lon1*10000000; //This is what I did to compensate for using lat/lon values instead of pixels.
    $y1 = $lat1*10000000;
    $x2 = $lon2*10000000;
    $y2 = $lat2*10000000;

    return sqrt(pow(($x1-$x2),2) + pow(($y1-$y2),2)) >> (21 - $zoom); //21 is the max zoom level
}

UPDATE

Here's the current code:

$singleMarkers = array();
$clusterMarkers = array();

// Minimum distance between markers to be included in a cluster, at diff. zoom levels
$DISTANCE = (10000000 >> $ZOOM) / 100000;

// Loop until all markers have been compared.
while (count($markers)) {
    $marker  = array_pop($markers);
    $cluster = array();

    // Compare against all markers which are left.
    foreach ($markers as $key => $target) {
        $pixels = abs($marker['lat']-$target['lat']) + abs($marker['lng']-$target['lng']);

        // If the two markers are closer than given distance remove target marker from array and add it to cluster.
        if ($pixels < $DISTANCE) {
            unset($markers[$key]);
            $cluster[] = $target;
        }
    }

    // If a marker has been added to cluster, add also the one we were comparing to.
    if (count($cluster) > 0) {
        $cluster[] = $marker;
        $clusterMarkers[] = $cluster;
    } else {
        $singleMarkers[] = $marker;
    }
}

回答1:

Do you actually need to calculate the Euclidean distance? If you are just comparing relative magnitudes of distances, you can probably get away with using the Manhattan distance, which will save you two calls to pow() and one to sqrt():

function pixelDistance($lat1, $lon1, $lat2, $lon2, $zoom) {
    $x1 = $lon1*10000000; //This is what I did to compensate for using lat/lon values instead of pixels.
    $y1 = $lat1*10000000;
    $x2 = $lon2*10000000;
    $y2 = $lat2*10000000;

    return ($x1-$x2) + ($y1-$y2) >> (21 - $zoom);
}

Not sure if you need the >> (21 - $zoom) bit for your calculations, so I left it in. But unless you actually need to use the calculated distance values elsewhere, you can probably get away with just using the latitude/longitude directly (no need to multiply by anything) and taking the Manhattan distance, assuming you pre-calculate $distance to fit in with that measure, which will be a lot cheaper in computational terms than coercing all the distances to fit in with the units and magnitude of $distance.

EDIT: When I was researching this problem last year, I found some useful stuff on Wikipedia - yes, it can happen ;-)

Cluster analysis
Hierarchical clustering

I can also highly recommend the book Programming Collective Intelligence: Building Smart Web 2.0 Applications which goes into clustering in great depth, as applied not only to geographical data but also to other areas of data analysis.

回答2:

Expanding on what John said, I think you should try inlining that function. Function calls in PHP are very slow, so you should get a decent speedup from that.

回答3:

So here's what I did - I added two extra columns to the markers(point) table with the mercator converted values for the latitude and longitude using the following functions:

public static $offset = 268435456;
public static $radius = 85445659.44705395; /* $offset / pi(); */

function LonToX($lon)
{
    return round(self::$offset + self::$radius * $lon * pi() / 180);        
}

function LatToY($lat)
{
    return round(self::$offset - self::$radius * log((1 + sin($lat * pi() / 180)) / (1 - sin($lat * pi() / 180))) / 2);
}

This way I could get accurately placed clusters. I'm still trying to work out how to avoid using array_pop and cycling through every time. So far its pretty efficient with sub-1000 markers. I'll be posting results for +5K and +10K markers later.

Avoiding the pixelDistance function and having it inline increases performance by almost half!

回答4:

It seems like speeding up the pixelDistance() function could be part of your solution, since it runs inside the loop. This is a good place to look first, but you didn't include that code, so I can't be sure.

回答5:

I can see two more possible improvements here:

Can you just loop through $markers with a for loop instead of popping them off the array? The array popping is completely unneeded - you should really only be using arrays as queues if you're adding and removing items to them at the same time (which you aren't; you're just processing them then throwing them away)
You should try calculating the count() of the arrays at the beginning, and then manually increase or decrease a $count variable. Recalculating the size of an array each loop is wasteful.

回答6:

The following are some ideas you can implement in case performance is of great issue:

Reduce the dimensionality of the data: you have 2d data of long/lat, perhaps you can try projecting it down to 1D using something like Multidimensional Scaling (MDS) which tries to reduce the number of dimensions while preserving the distances in the original space, thus the distance function will only have to deal with one feature instead of two. An alternative way is to use Principal component analysis (PCA)
Faster search: the step of computing the distance to each marker can be improved using KD-trees.

Both of these technique are applied in an offline setting, so they are usually computed once and then used many times..

回答7:

A simple optimization would be to take advantage that sqrt(x) < sqrt(y) is true iff x < y, so you can omit the sqrt in the pixelDistance and calculate $distance squared outside the loop. You can also calculate the 21 - $zoomLevel outside the loop, you'll have to multiply it by 2 if you're comparing the squared values. Another small optimization would be to save 2 multiplies by doing $x1-$x2 before scaling by 10000000. And for a tiny bit more, storing the delta in a var and multiplying it by itself is probably faster than the pow function. And for some more you can inline the pixeldistance function. This kind of optimization will only yield a constant factor speedup.

For a larger speedup you'll need some kind of acceleration datastructure. An easy one would be to bin markers into distance sized squares. Then you can run over the bins look for markers to cluster with in only the same bin and 3 others chosen depending on which side of the center of the bin the marker falls. This will result in linear time clustering that will beat any optimizations to the quadratic algorithm for larger resultsets.

回答8:

If you can, sort your markers by longitude on the initial search; then as soon as a marker is more than a marker width from the next marker in the sorted list, you definitely know that the remaining markers will not overlap, so you can then break the foreach loop and save yourself a ton of processing time. I have implemented this on my own site and it works very efficiently.

I have some source code in Python here.