pandas groupby and rolling_apply ignoring NaNs

2020-07-07 05:58发布

问题:

I have a pandas dataframe and I want to calculate the rolling mean of a column (after a groupby clause). However, I want to exclude NaNs.

For instance, if the groupby returns [2, NaN, 1], the result should be 1.5 while currently it returns NaN.

I've tried the following but it doesn't seem to work:

df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3,  lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))

If I even try this:

df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 3,  lambda x: 1)

I'm getting NaN in the output so it must be something to do with how pandas works in the background.

Any ideas?

EDIT: Here is a code sample with what I'm trying to do:

import pandas as pd
import numpy as np

df = pd.DataFrame({'var1' : ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'b'], 'value' : [1, 2, 3, np.nan, 2, 3, 4, 1] })
print df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2,  lambda x: np.mean([i for i in x if i is not np.nan and i!='NaN']))

The result is:

0    NaN
1    NaN
2    2.0
3    NaN
4    2.5
5    NaN
6    3.0
7    2.0

while I wanted to have the following:

0    NaN
1    NaN
2    2.0
3    2.0
4    2.5
5    3.0
6    3.0
7    2.0

回答1:

As always in pandas, sticking to vectorized methods (i.e. avoiding apply) is essential for performance and scalability.

The operation you want to do is a little fiddly as rolling operations on groupby objects are not NaN-aware at present (version 0.18.1). As such, we'll need a few short lines of code:

g1 = df.groupby(['var1'])['value']              # group values  
g2 = df.fillna(0).groupby(['var1'])['value']    # fillna, then group values

s = g2.rolling(2).sum() / g1.rolling(2).count() # the actual computation

s.reset_index(level=0, drop=True).sort_index()  # drop/sort index

The idea is to sum the values in the window (using sum), count the NaN values (using count) and then divide to find the mean. This code gives the following output that matches your desired output:

0    NaN
1    NaN
2    2.0
3    2.0
4    2.5
5    3.0
6    3.0
7    2.0
Name: value, dtype: float64

Testing this on a larger DataFrame (around 100,000 rows), the run-time was under 100ms, significantly faster than any apply-based methods I tried.

It may be worth testing the different approaches on your actual data as timings may be influenced by other factors such as the number of groups. It's fairly certain that vectorized computations will win out, though.


The approach shown above works well for simple calculations, such as the rolling mean. It will work for more complicated calculations (such as rolling standard deviation), although the implementation is more involved.

The general idea is look at each simple routine that is fast in pandas (e.g. sum) and then fill any null values with an identity element (e.g. 0). You can then use groubpy and perform the rolling operation (e.g. .rolling(2).sum()). The output is then combined with the output(s) of other operations.

For example, to implement groupby NaN-aware rolling variance (of which standard deviation is the square-root) we must find "the mean of the squares minus the square of the mean". Here's a sketch of what this could look like:

def rolling_nanvar(df, window):
    """
    Group df by 'var1' values and then calculate rolling variance,
    adjusting for the number of NaN values in the window.

    Note: user may wish to edit this function to control degrees of
    freedom (n), depending on their overall aim.
    """
    g1 = df.groupby(['var1'])['value']
    g2 = df.fillna(0).groupby(['var1'])['value']
    # fill missing values with 0, square values and groupby
    g3 = df['value'].fillna(0).pow(2).groupby(df['var1'])

    n = g1.rolling(window).count()

    mean_of_squares = g3.rolling(window).sum() / n
    square_of_mean = (g2.rolling(window).sum() / n)**2
    variance = mean_of_squares - square_of_mean
    return variance.reset_index(level=0, drop=True).sort_index()

Note that this function may not be numerically stable (squaring could lead to overflow). pandas uses Welford's algorithm internally to mitigate this issue.

Anyway, this function, although it uses several operations, is still very fast. Here's a comparison with the more concise apply-based method suggested by Yakym Pirozhenko:

>>> df2 = pd.concat([df]*10000, ignore_index=True) # 80000 rows
>>> %timeit df2.groupby('var1')['value'].apply(\
         lambda gp: gp.rolling(7, min_periods=1).apply(np.nanvar))
1 loops, best of 3: 11 s per loop

>>> %timeit rolling_nanvar(df2, 7)
10 loops, best of 3: 110 ms per loop

Vectorization is 100 times faster in this case. Of course, depending on how much data you have, you may wish to stick to using apply since it allows you generality/brevity at the expense of performance.



回答2:

Can this result match your expectations? I slightly changed your solution with min_periods parameter and right filter for nan.

In [164]: df.groupby(by=['var1'])['value'].apply(pd.rolling_apply, 2,  lambda x: np.mean([i for i in x if not np.isnan(i)]), min_periods=1)
Out[164]: 
0    1.0
1    2.0
2    2.0
3    2.0
4    2.5
5    3.0
6    3.0
7    2.0
dtype: float64


回答3:

Here is an alternative implementation without list comprehension, but it also fails to populate the first entry of the output with np.nan

means = df.groupby('var1')['value'].apply(
    lambda gp: gp.rolling(2, min_periods=1).apply(np.nanmean))