Python - how to speed up calculation of distances

2019-03-29 12:25发布

问题:

I have 55249 cities in my database. Every single one has got latitude longitude values. For every city I want to calculate distances to every other city and store those that are no further than 30km. Here is my algorithm:

# distance function
from math import sin, cos, sqrt, atan2, radians

def distance(obj1, obj2):
    lat1 = radians(obj1.latitude)
    lon1 = radians(obj1.longitude)
    lat2 = radians(obj2.latitude)
    lon2 = radians(obj2.longitude)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2))**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    return round(6373.0 * c, 2)

def distances():
    cities = City.objects.all()  # I am using Django ORM
    for city in cities:
        closest = list()
        for tested_city in cities:
            distance = distance(city, tested_city)
            if distance <= 30. and distance != 0.:
                closest.append(tested_city)
        city.closest_cities.add(*closest)  # again, Django thing
        city.save()  # Django

This works but takes awful lot of time. Gonna take weeks to complete. Any way I could speed it up?

回答1:

You can't afford to compute the distance between every pair of cities. Instead, you need to put your cities in a space-partitioning data structure for which you can make fast nearest-neighbour queries. SciPy comes with a kd-tree implementation, scipy.spatial.KDTree, that is suitable for this application.

There are two difficulties here. First, scipy.spatial.KDTree uses Euclidean distance between points, but you want to use the great circle distance along the surface of the Earth. Second, longitude wraps around, so that nearest neighbours might have longitudes that differ by 360°. Both problems can be solved if you take the following approach:

  1. Convert your locations from geodetic coordinates (latitude, longitude) to ECEF (Earth-Centered, Earth-Fixed) coordinates (x, y, z).

  2. Put these ECEF coordinates into the scipy.spatial.KDTree.

  3. Convert your great circle distance (for example, 30 km) into a Euclidean distance.

  4. Call scipy.spatial.KDTree.query_ball_point to get the cities within range.

Here's some example code to illustrate this approach. The function geodetic2ecef comes from PySatel by David Parunakian and is licensed under the GPL.

from math import radians, cos, sin, sqrt

# Constants defined by the World Geodetic System 1984 (WGS84)
A = 6378.137
B = 6356.7523142
ESQ = 6.69437999014 * 0.001

def geodetic2ecef(lat, lon, alt=0):
    """Convert geodetic coordinates to ECEF."""
    lat, lon = radians(lat), radians(lon)
    xi = sqrt(1 - ESQ * sin(lat))
    x = (A / xi + alt) * cos(lat) * cos(lon)
    y = (A / xi + alt) * cos(lat) * sin(lon)
    z = (A / xi * (1 - ESQ) + alt) * sin(lat)
    return x, y, z

def euclidean_distance(distance):
    """Return the approximate Euclidean distance corresponding to the
    given great circle distance (in km).

    """
    return 2 * A * sin(distance / (2 * B))

Let's make up fifty thousand random city locations and convert them to ECEF coordinates:

>>> from random import uniform
>>> cities = [(uniform(-90, 90), uniform(0, 360)) for _ in range(50000)]
>>> ecef_cities = [geodetic2ecef(lat, lon) for lat, lon in cities]

Put them into a scipy.spatial.KDTree:

>>> import numpy
>>> from scipy.spatial import KDTree
>>> tree = KDTree(numpy.array(ecef_cities))

Find all cities within about 100 km of London:

>>> london = geodetic2ecef(51, 0)
>>> tree.query_ball_point([london], r=euclidean_distance(100))
array([[37810, 15755, 16276]], dtype=object)

This array contains, for each point that you queried, an array the cities within the distance r. Each neighbour is given as its index in the original array that you passed to KDTree. So there are three cities within about 100 km of London, namely the cities with indexes 37810, 15755, and 16276 in the original list:

>>> from pprint import pprint
>>> pprint([cities[i] for i in [37810, 15755, 16276]])
[(51.7186871990946, 359.8043453670437),
 (50.82734317063884, 1.1422052710187103),
 (50.95466110717763, 0.8956257749604779)]

Notes:

  1. You can see from the example output that neighbours with longitudes that differ by about 360° are correctly discovered.

  2. The approach seems fast enough. Here we find neighbours within 30 km for the first thousand cities, taking about 5 seconds:

    >>> from timeit import timeit
    >>> timeit(lambda:tree.query_ball_point(ecef_cities[:1000], r=euclidean_distance(30)), number=1)
    5.013611573027447
    

    Extrapolating, we expect to find neighbours within 30 km for all 50,000 cities in about four minutes.

  3. My euclidean_distance function overestimates the Euclidean distance corresponding to a given great circle distance (so as not to miss any cities). This might be good enough for some applications—after all, cities are not point objects—but if you need more accuracy than this, then you could filter the resulting points using, say, one of the great circle distance functions from geopy.



回答2:

You can speed up your distance calculation by not entering the complicated trigonometric formulae if you know that the cities are further than 30km apart, because their difference in latitude corresponds to more than a 30km arc. An arc of length a = 30 km corresponds to an angle of a/r = 0.00470736, hence:

def distance(obj1, obj2):
    lat1 = radians(obj1.latitude)
    lon1 = radians(obj1.longitude)
    lat2 = radians(obj2.latitude)
    lon2 = radians(obj2.longitude)
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    if dlat > 0.00471:
        return 32

    a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2))**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    return round(6373.0 * c, 2)

The radius 32 is just a dummy value to indicate that the cities are farther than 30 km apart. You should apply a similar logic for the longitude, for which you have to consider the largest absolute latitude:

    if cos(lat1) * dlon > 0.00471 and cos(lat2) * dlon > 0.00471:
        return 32

If you know that your cities are in a fixed range of latitudes, you can adjust the constant limit to the worst case. For example, if all your cities are in the contiguous United States, they should be below latitude 49°N, and your limit is then 0.00471 / cos(49°) = 0.00718.

    if dlon > 0.00718:
        return 32

This simpler criterion means that you are entering the exact calculation for too many cities in Texas or Florida. You could also chain these criteria. use the approximate limit first, then the exact limit based onn the maximum absolute latitude next, then calclate the exact distance for all remaining candidates.

You can calculate this limit beforehand with your maximum absolute latitude. This heuristic should also help you to put cities into buckets of fixed longitude and latitude, as RemcoGerlich suggested. His method should speed up your process considerably by considering only reasonable pairs of cities beforehand.

Edit I'm a bit ashamed to see that my code above does not check the absolute value for the limit. Anyway, the real lesson here ist that no matter how much you speed up your distance calculation, the real benefit for large data sets comes from chosing an intelligent search mechanism like the bucket search or kd trees other commenters suggested, possibly together with some memoization to weed out double checks.



回答3:

I would first create "sectors", each being limited by 2 latitudes X km apart and 2 longitudes X km apart. X should be as big as possible, with one restriction: All cities within a sector are no more than 30 km away.

The sectors could be stored in an array:

Sector[][] sectors;

Within this array, it is straightforward to identity a sector containing specific coordinates. It is also easy to identify adjacent sectors for a specific sector.

Then:

(1) Each city is assigned its sector. Each sector has a list of cities lying therein.

(2) For each city, find all cities in its sector. Those immediately meet the 30-km criterion.

(3) For each city C, find all cities C' in all 8 adjacent sectors. For each C', check the distance C-C' and output C-C' if it's < 30 km.

This algorithm still is O(n^2) but it should be way faster, since for each city you only check a small subset of the whole set.



回答4:

  1. Try different algorithms to speed up the single distance calculation as someone stated.
  2. Use only the permutations of cities to not recalculate repetitions.
  3. Use multiprocessing module to distribute work on multiple cores.

1 and 2 are straightforward. For the 3rd point I suggest to use imap_unordered() to achieve maximum speed with a workflow similar to this:

  • get all permutations
  • in the main loop load all city models in a single datastore call
  • distribute the distance calculation to the worker Pool
  • try saving the results in the single worker or in the main thread. I suspect it would be better in the worker but I don't know how django will handle concurrency in an offline script (if anyone knows please add up in the comments so we can integrate).
  • either if you save in the main thread or in the single workers, try to save big chunks of models using transactions.

But

You need also to change somewhat your models. For a distributed processing you need to decouple the closest_cities variable. As different processes will change it. You can use a dictionary of lists at main process level storing all the closest cities for any given city as key and then storing it for every model, t the end of the loop or meanwhile.



回答5:

You're doing a whole lot of unnecessary work.

As others have suggested, you can limit the number of calculations by changing your looping structure. You have:

for city in cities:
    for tested_city in cities:

So not only will you compare every city with itself, you'll also compare city1 with city2 and later you'll compare city2 with city1.

I'm not a Python programmer, so I can't tell you what syntax to use here, but what you want is a nested loop structure similar to:

for (i = 0; i < cities.Length-1; ++i)
{
    for (j = i+1; j < cities.Length; ++j)
    {
        compare_cities(cities[i], cities[j]);
    }
}

That will halve the number of city comparisons you need to do. That reduces it from about 3 billion distance calculations to about 1.5 billion.

Others have also mentioned an early out potential by comparing the dlat and dlong prior to getting into the expensive trig functions.

You can also save some time by converting lat1 and lon1 to radians once, and also computing cos(lat1) once and passing those values to your distance calculation rather than computing them every time. For example:

for (i = 0; i < cities.Length-1; ++i)
{
    lat1 = radians(cities[i].latitude
    lon1 = radians(cities[i].longitude
    cos1 = cos(lat1)
    for (j = i+1; j < cities.Length; ++j)
    {
        compare_cities(lat1, lon1, cos1, cities[j]);
    }
}

And you don't really need to convert c to kilometers. For example, you have:

return round(6373.0 * c, 2)

The result of that has to be <= 30.0. Why do the multiplication and rounding? You could just return c, and in your code compare the return value with 0.0047 (which is 30.0/6373).