I have a Pandas
data frame like this:
test = pd.DataFrame({ 'Date' : ['2016-04-01','2016-04-01','2016-04-02',
'2016-04-02','2016-04-03','2016-04-04',
'2016-04-05','2016-04-06','2016-04-06'],
'User' : ['Mike','John','Mike','John','Mike','Mike',
'Mike','Mike','John'],
'Value' : [1,2,1,3,4.5,1,2,3,6]
})
As you can see below, the data set does not have observations for every day necessarily:
Date User Value
0 2016-04-01 Mike 1.0
1 2016-04-01 John 2.0
2 2016-04-02 Mike 1.0
3 2016-04-02 John 3.0
4 2016-04-03 Mike 4.5
5 2016-04-04 Mike 1.0
6 2016-04-05 Mike 2.0
7 2016-04-06 Mike 3.0
8 2016-04-06 John 6.0
I'd like to add a new column which shows the average value for each user for the past n days (in this case n = 2) if at least one day is available, else it would have nan
value. For example, on 2016-04-06
John gets a nan
because he has no data for 2016-04-05
and 2016-04-04
. So the result will be something like this:
Date User Value Value_Average_Past_2_days
0 2016-04-01 Mike 1.0 NaN
1 2016-04-01 John 2.0 NaN
2 2016-04-02 Mike 1.0 1.00
3 2016-04-02 John 3.0 2.00
4 2016-04-03 Mike 4.5 1.00
5 2016-04-04 Mike 1.0 2.75
6 2016-04-05 Mike 2.0 2.75
7 2016-04-06 Mike 3.0 1.50
8 2016-04-06 John 6.0 NaN
It seems that I should a combination of group_by
and customized rolling_mean
after reading several posts in the forum, but I couldn't quite figure out how to do it.
Summary
I think you can use first convert column
Date
to_datetime
, then find missingDays
bygroupby
withresample
and lastapply
rolling