可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to do some rolling window calculation in pandas which need to deal with two columns at the same time. I'll take an simple instance to express the problem clearly:

import pandas as pd

df = pd.DataFrame({
    'x': [1, 2, 3, 2, 1, 5, 4, 6, 7, 9],
    'y': [4, 3, 4, 6, 5, 9, 1, 3, 1, 2]
})

windowSize = 4
result = []

for i in range(1, len(df)+1):
    if i < windowSize:
        result.append(None)
    else:
        x = df.x.iloc[i-windowSize:i]
        y = df.y.iloc[i-windowSize:i]
        m = y.mean()
        r = sum(x[y > m]) / sum(x[y <= m])
        result.append(r)

print(result)

Is there any way without for loop in pandas to solve the problem? Any help is appreciated

回答1:

Here's one vectorized approach using NumPy tools -

windowSize = 4
a = df.values
X = strided_app(a[:,0],windowSize,1)
Y = strided_app(a[:,1],windowSize,1)
M = Y.mean(1)
mask = Y>M[:,None]
sums = np.einsum('ij,ij->i',X,mask)
rest_sums = X.sum(1) - sums
out = sums/rest_sums

strided_app is taken from here.

Runtime test -

Approaches -

# @kazemakase's solution
def rolling_window_sum(df, windowSize=4):
    rw = rolling_window(df.values.T, windowSize)
    m = np.mean(rw[1], axis=-1, keepdims=True)
    a = np.sum(rw[0] * (rw[1] > m), axis=-1)
    b = np.sum(rw[0] * (rw[1] <= m), axis=-1)
    result = a / b
    return result    

# Proposed in this post    
def strided_einsum(df, windowSize=4):
    a = df.values
    X = strided_app(a[:,0],windowSize,1)
    Y = strided_app(a[:,1],windowSize,1)
    M = Y.mean(1)
    mask = Y>M[:,None]
    sums = np.einsum('ij,ij->i',X,mask)
    rest_sums = X.sum(1) - sums
    out = sums/rest_sums
    return out

Timings -

In [46]: df = pd.DataFrame(np.random.randint(0,9,(1000000,2)))

In [47]: %timeit rolling_window_sum(df)
10 loops, best of 3: 90.4 ms per loop

In [48]: %timeit strided_einsum(df)
10 loops, best of 3: 62.2 ms per loop

To squeeze in more performance, we can compute the Y.mean(1) part, which is basically a windowed summation with Scipy's 1D uniform filter. Thus, M could be alternatively computed for windowSize=4 as -

from scipy.ndimage.filters import uniform_filter1d as unif1d

M = unif1d(a[:,1].astype(float),windowSize)[2:-1]

The performance gains are significant -

In [65]: %timeit strided_einsum(df)
10 loops, best of 3: 61.5 ms per loop

In [66]: %timeit strided_einsum_unif_filter(df)
10 loops, best of 3: 49.4 ms per loop

回答2:

You can use the rolling window trick for numpy arrays and apply it to the array underlying the DataFrame.

import pandas as pd
import numpy as np

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

df = pd.DataFrame({
    'x': [1, 2, 3, 2, 1, 5, 4, 6, 7, 9],
    'y': [4, 3, 4, 6, 5, 9, 1, 3, 1, 2]
})

windowSize = 4    

rw = rolling_window(df.values.T, windowSize)
m = np.mean(rw[1], axis=-1, keepdims=True)
a = np.sum(rw[0] * (rw[1] > m), axis=-1)
b = np.sum(rw[0] * (rw[1] <= m), axis=-1)
result = a / b

The result lacks the leading None values, but they should be easy to append (in form of np.nan or after converting the result to a list).

This is probably not what you are looking for, working with pandas, but it will get the job done without loops.