I have 55249 cities in my database. Every single one has got latitude longitude values. For every city I want to calculate distances to every other city and store those that are no further than 30km. Here is my algorithm:
# distance function
from math import sin, cos, sqrt, atan2, radians
def distance(obj1, obj2):
lat1 = radians(obj1.latitude)
lon1 = radians(obj1.longitude)
lat2 = radians(obj2.latitude)
lon2 = radians(obj2.longitude)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2))**2
c = 2 * atan2(sqrt(a), sqrt(1-a))
return round(6373.0 * c, 2)
def distances():
cities = City.objects.all() # I am using Django ORM
for city in cities:
closest = list()
for tested_city in cities:
distance = distance(city, tested_city)
if distance <= 30. and distance != 0.:
closest.append(tested_city)
city.closest_cities.add(*closest) # again, Django thing
city.save() # Django
This works but takes awful lot of time. Gonna take weeks to complete. Any way I could speed it up?
You're doing a whole lot of unnecessary work.
As others have suggested, you can limit the number of calculations by changing your looping structure. You have:
So not only will you compare every city with itself, you'll also compare
city1
withcity2
and later you'll comparecity2
withcity1
.I'm not a Python programmer, so I can't tell you what syntax to use here, but what you want is a nested loop structure similar to:
That will halve the number of city comparisons you need to do. That reduces it from about 3 billion distance calculations to about 1.5 billion.
Others have also mentioned an early out potential by comparing the
dlat
anddlong
prior to getting into the expensive trig functions.You can also save some time by converting
lat1
andlon1
to radians once, and also computingcos(lat1)
once and passing those values to your distance calculation rather than computing them every time. For example:And you don't really need to convert
c
to kilometers. For example, you have:The result of that has to be
<= 30.0
. Why do the multiplication and rounding? You could justreturn c
, and in your code compare the return value with0.0047
(which is30.0/6373
).multiprocessing
module to distribute work on multiple cores.1 and 2 are straightforward. For the 3rd point I suggest to use
imap_unordered()
to achieve maximum speed with a workflow similar to this:But
You need also to change somewhat your models. For a distributed processing you need to decouple the
closest_cities
variable. As different processes will change it. You can use a dictionary of lists at main process level storing all the closest cities for any given city as key and then storing it for every model, t the end of the loop or meanwhile.You can't afford to compute the distance between every pair of cities. Instead, you need to put your cities in a space-partitioning data structure for which you can make fast nearest-neighbour queries. SciPy comes with a kd-tree implementation,
scipy.spatial.KDTree
, that is suitable for this application.There are two difficulties here. First,
scipy.spatial.KDTree
uses Euclidean distance between points, but you want to use the great circle distance along the surface of the Earth. Second, longitude wraps around, so that nearest neighbours might have longitudes that differ by 360°. Both problems can be solved if you take the following approach:Convert your locations from geodetic coordinates (latitude, longitude) to ECEF (Earth-Centered, Earth-Fixed) coordinates (x, y, z).
Put these ECEF coordinates into the
scipy.spatial.KDTree
.Convert your great circle distance (for example, 30 km) into a Euclidean distance.
Call
scipy.spatial.KDTree.query_ball_point
to get the cities within range.Here's some example code to illustrate this approach. The function
geodetic2ecef
comes from PySatel by David Parunakian and is licensed under the GPL.Let's make up fifty thousand random city locations and convert them to ECEF coordinates:
Put them into a
scipy.spatial.KDTree
:Find all cities within about 100 km of London:
This array contains, for each point that you queried, an array the cities within the distance
r
. Each neighbour is given as its index in the original array that you passed toKDTree
. So there are three cities within about 100 km of London, namely the cities with indexes 37810, 15755, and 16276 in the original list:Notes:
You can see from the example output that neighbours with longitudes that differ by about 360° are correctly discovered.
The approach seems fast enough. Here we find neighbours within 30 km for the first thousand cities, taking about 5 seconds:
Extrapolating, we expect to find neighbours within 30 km for all 50,000 cities in about four minutes.
My
euclidean_distance
function overestimates the Euclidean distance corresponding to a given great circle distance (so as not to miss any cities). This might be good enough for some applications—after all, cities are not point objects—but if you need more accuracy than this, then you could filter the resulting points using, say, one of the great circle distance functions from geopy.You can speed up your distance calculation by not entering the complicated trigonometric formulae if you know that the cities are further than 30km apart, because their difference in latitude corresponds to more than a 30km arc. An arc of length a = 30 km corresponds to an angle of a/r = 0.00470736, hence:
The radius 32 is just a dummy value to indicate that the cities are farther than 30 km apart. You should apply a similar logic for the longitude, for which you have to consider the largest absolute latitude:
If you know that your cities are in a fixed range of latitudes, you can adjust the constant limit to the worst case. For example, if all your cities are in the contiguous United States, they should be below latitude 49°N, and your limit is then 0.00471 / cos(49°) = 0.00718.
This simpler criterion means that you are entering the exact calculation for too many cities in Texas or Florida. You could also chain these criteria. use the approximate limit first, then the exact limit based onn the maximum absolute latitude next, then calclate the exact distance for all remaining candidates.
You can calculate this limit beforehand with your maximum absolute latitude. This heuristic should also help you to put cities into buckets of fixed longitude and latitude, as RemcoGerlich suggested. His method should speed up your process considerably by considering only reasonable pairs of cities beforehand.
Edit I'm a bit ashamed to see that my code above does not check the absolute value for the limit. Anyway, the real lesson here ist that no matter how much you speed up your distance calculation, the real benefit for large data sets comes from chosing an intelligent search mechanism like the bucket search or kd trees other commenters suggested, possibly together with some memoization to weed out double checks.
I would first create "sectors", each being limited by 2 latitudes X km apart and 2 longitudes X km apart. X should be as big as possible, with one restriction: All cities within a sector are no more than 30 km away.
The sectors could be stored in an array:
Within this array, it is straightforward to identity a sector containing specific coordinates. It is also easy to identify adjacent sectors for a specific sector.
Then:
(1) Each city is assigned its sector. Each sector has a list of cities lying therein.
(2) For each city, find all cities in its sector. Those immediately meet the 30-km criterion.
(3) For each city C, find all cities C' in all 8 adjacent sectors. For each C', check the distance C-C' and output C-C' if it's < 30 km.
This algorithm still is O(n^2) but it should be way faster, since for each city you only check a small subset of the whole set.