可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinate pairs to produce the distance:

def haversine(point1, point2, miles=False):
    """ Calculate the great-circle distance bewteen two points on the Earth surface.

    :input: two 2-tuples, containing the latitude and longitude of each point
    in decimal degrees.

    Example: haversine((45.7597, 4.8422), (48.8567, 2.3508))

    :output: Returns the distance bewteen the two points.
    The default unit is kilometers. Miles can be returned
    if the ``miles`` parameter is set to True.

    """

I can calculate the distance between all points using a nested for loop as follows:

data.head()

   id                      coordinates
0   1   (16.3457688674, 6.30354512503)
1   2    (12.494749307, 28.6263955635)
2   3    (27.794615136, 60.0324947881)
3   4   (44.4269923769, 110.114216113)
4   5  (-69.8540884125, 87.9468778773)

using a simple function:

distance = {}
def haver_loop(df):
    for i, point1 in df.iterrows():
        distance[i] = []
        for j, point2 in df.iterrows():
            distance[i].append(haversine(point1.coordinates, point2.coordinates))

    return pd.DataFrame.from_dict(distance, orient='index')

But this takes quite a while given the time complexity, running at around 20s for 500 points and I have a much longer list. This has me looking at vectorization, and I've come across numpy.vectorize ((docs), but can't figure out how to apply it in this context.

回答1:

You would provide your function as an argument to np.vectorize(), and could then use it as an argument to pandas.groupby.apply as illustrated below:

haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

For instance, with sample data as follows:

length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})

compare for 500 points:

def haver_vect(data):
    distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
    return distance

%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop

%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop

回答2:

From haversine's function definition, it looked pretty parallelizable. So, using one of the best tools for vectorization with NumPy aka broadcasting and replacing the math funcs with the NumPy equivalents ufuncs, here's one vectorized solution -

# Get data as a Nx2 shaped NumPy array
data = np.array(df['coordinates'].tolist())

# Convert to radians
data = np.deg2rad(data)                     

# Extract col-1 and 2 as latitudes and longitudes
lat = data[:,0]                     
lng = data[:,1]         

# Elementwise differentiations for lattitudes & longitudes
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng

# Finally Calculate haversine
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))

Runtime tests -

The other np.vectorize based solution has shown some positive promise on performance improvement over the original code, so this section would compare the posted broadcasting based approach against that one.

Function definitions -

def vectotized_based(df):
    haver_vec = np.vectorize(haversine, otypes=[np.int16])
    return df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

def broadcasting_based(df):
    data = np.array(df['coordinates'].tolist())
    data = np.deg2rad(data)                     
    lat = data[:,0]                     
    lng = data[:,1]         
    diff_lat = lat[:,None] - lat
    diff_lng = lng[:,None] - lng
    d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

Timings -

In [123]: # Input
     ...: length = 500
     ...: d1 = np.random.uniform(-90, 90, length)
     ...: d2 = np.random.uniform(-180, 180, length)
     ...: coords = tuple(zip(d1, d2))
     ...: df = pd.DataFrame({'id':np.arange(length), 'coordinates':coords})
     ...: 

In [124]: %timeit vectotized_based(df)
1 loops, best of 3: 1.12 s per loop

In [125]: %timeit broadcasting_based(df)
10 loops, best of 3: 68.7 ms per loop

回答3:

start by getting all combinations using itertools.product

 results= [(p1,p2,haversine(p1,p2))for p1,p2 in itertools.product(points,repeat=2)]

that said Im not sure how fast it will be this looks like it might be a duplicate of Python: speeding up geographic comparison

Vectorizing Haversine distance calculation in Pyth

问题:

回答1:

回答2:

回答3:

收藏的人(0)

Vectorizing Haversine distance calculation in Pyth

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮