Pandas Dataframe rolling with two columns and two

2020-04-07 03:37发布

问题:

I got a dataframe with two columns that are holding Longitude and Latitude coordinates:

import pandas as pd

values = {'Latitude': {0: 47.021503365600005,
  1: 47.021503365600005,
  2: 47.021503365600005,
  3: 47.021503365600005,
  4: 47.021503365600005,
  5: 47.021503365600005},
 'Longitude': {0: 15.481974060399999,
  1: 15.481974060399999,
  2: 15.481974060399999,
  3: 15.481974060399999,
  4: 15.481974060399999,
  5: 15.481974060399999}}

df = pd.DataFrame(values)
df.head()

Now I want to apply a rolling window function on the dataframe that takes the Longitude AND Latitude (two columns) of one row and another row (window size 2) in order to calculate the haversine distance.

def haversine_distance(x):
    print (x)

df.rolling(2, axis=1).apply(haversine_distance)

My problem is that I never get all four values Lng1, Lat1 (first row) and Lng2, Lat2 (second row). If I use axis=1, then I will get Lng1 and Lat1 of the first row. If I use axis=0, then I will get Lng1 and Lng2 of the first and second row, but Longitude only.

How can I apply a rolling window using two rows and two columns? Somewhat like this:

def haversine_distance(x):
    row1 = x[0]
    row2 = x[1]
    lng1, lat1 = row1['Longitude'], row1['Latitude']
    lng2, lat2 = row2['Longitude'], row2['Latitude']
    # do your stuff here
    return 1

Currently I'm doing this calculation by joining the dataframe with itself by shift(-1) resulting in all four coordinates in one line. But it should be possible with rolling as well. Another option is combining Lng and Lat into one column and apply rolling with axis=0 onto that. But there must be an easier way, right?

回答1:

Since pandas v0.23 it is now possible to pass a Series instead of a ndarray to Rolling.apply(). Just set raw=False.

raw : bool, default None

False : passes each row or column as a Series to the function.

True or None : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance. The raw parameter is required and will show a FutureWarning if not passed. In the future raw will default to False.

New in version 0.23.0.

So building on your given example, you could move the latitude to the index and pass the whole longitude series---including the index---to your function:

df = df.set_index('Latitude')
df['Distance'] = df['Longitude'].rolling(2).apply(haversine_distance, raw=False)