I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinate pairs to produce the distance:
def haversine(point1, point2, miles=False):
""" Calculate the great-circle distance bewteen two points on the Earth surface.
:input: two 2-tuples, containing the latitude and longitude of each point
in decimal degrees.
Example: haversine((45.7597, 4.8422), (48.8567, 2.3508))
:output: Returns the distance bewteen the two points.
The default unit is kilometers. Miles can be returned
if the ``miles`` parameter is set to True.
"""
I can calculate the distance between all points using a nested for loop as follows:
data.head()
id coordinates
0 1 (16.3457688674, 6.30354512503)
1 2 (12.494749307, 28.6263955635)
2 3 (27.794615136, 60.0324947881)
3 4 (44.4269923769, 110.114216113)
4 5 (-69.8540884125, 87.9468778773)
using a simple function:
distance = {}
def haver_loop(df):
for i, point1 in df.iterrows():
distance[i] = []
for j, point2 in df.iterrows():
distance[i].append(haversine(point1.coordinates, point2.coordinates))
return pd.DataFrame.from_dict(distance, orient='index')
But this takes quite a while given the time complexity, running at around 20s for 500 points and I have a much longer list. This has me looking at vectorization, and I've come across numpy.vectorize
((docs), but can't figure out how to apply it in this context.
start by getting all combinations using
itertools.product
that said Im not sure how fast it will be this looks like it might be a duplicate of Python: speeding up geographic comparison
From
haversine's function definition
, it looked pretty parallelizable. So, using one of the best tools for vectorization with NumPy akabroadcasting
and replacing the math funcs with the NumPy equivalentsufuncs
, here's one vectorized solution -Runtime tests -
The other
np.vectorize based solution
has shown some positive promise on performance improvement over the original code, so this section would compare the posted broadcasting based approach against that one.Function definitions -
Timings -
You would provide your function as an argument to
np.vectorize()
, and could then use it as an argument topandas.groupby.apply
as illustrated below:For instance, with sample data as follows:
compare for 500 points: