I was motivated to use pandas rolling
feature to perform a rolling multi-factor regression (This question is NOT about rolling multi-factor regression). I expected that I'd be able to use apply
after a df.rolling(2)
and take the resulting pd.DataFrame
extract the ndarray with .values
and perform the requisite matrix multiplication. It didn't work out that way.
Here is what I found:
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(5, 2).round(2), columns=['A', 'B'])
X = np.random.rand(2, 1).round(2)
What do objects look like:
print "\ndf = \n", df
print "\nX = \n", X
print "\ndf.shape =", df.shape, ", X.shape =", X.shape
df =
A B
0 0.44 0.41
1 0.46 0.47
2 0.46 0.02
3 0.85 0.82
4 0.78 0.76
X =
[[ 0.93]
[ 0.83]]
df.shape = (5, 2) , X.shape = (2L, 1L)
Matrix multiplication behaves normally:
df.values.dot(X)
array([[ 0.7495],
[ 0.8179],
[ 0.4444],
[ 1.4711],
[ 1.3562]])
Using apply to perform row by row dot product behaves as expected:
df.apply(lambda x: x.values.dot(X)[0], axis=1)
0 0.7495
1 0.8179
2 0.4444
3 1.4711
4 1.3562
dtype: float64
Groupby -> Apply behaves as I'd expect:
df.groupby(level=0).apply(lambda x: x.values.dot(X)[0, 0])
0 0.7495
1 0.8179
2 0.4444
3 1.4711
4 1.3562
dtype: float64
But when I run:
df.rolling(1).apply(lambda x: x.values.dot(X))
I get:
AttributeError: 'numpy.ndarray' object has no attribute 'values'
Ok, so pandas is using straight ndarray
within its rolling
implementation. I can handle that. Instead of using .values
to get the ndarray
, let's try:
df.rolling(1).apply(lambda x: x.dot(X))
shapes (1,) and (2,1) not aligned: 1 (dim 0) != 2 (dim 0)
Wait! What?!
So I created a custom function to look at the what rolling is doing.
def print_type_sum(x):
print type(x), x.shape
return x.sum()
Then ran:
print df.rolling(1).apply(print_type_sum)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
<type 'numpy.ndarray'> (1L,)
A B
0 0.44 0.41
1 0.46 0.47
2 0.46 0.02
3 0.85 0.82
4 0.78 0.76
My resulting pd.DataFrame
is the same, that's good. But it printed out 10 single dimensional ndarray
objects. What about rolling(2)
print df.rolling(2).apply(print_type_sum)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
<type 'numpy.ndarray'> (2L,)
A B
0 NaN NaN
1 0.90 0.88
2 0.92 0.49
3 1.31 0.84
4 1.63 1.58
Same thing, expect output but it printed 8 ndarray
objects. rolling
is producing a single dimensional ndarray
of length window
for each column as opposed to what I expected which was an ndarray
of shape (window, len(df.columns))
.
Question is Why?
I now don't have a way to easily run a rolling multi-factor regression.
Using the
strides views concept on dataframe
, here's a vectorized approach -Runtime test -
Verify results -
Huge improvement there, which I am hoping would stay noticeable on larger arrays!
I wanted to share what I've done to work around this problem.
Given a
pd.DataFrame
and a window, I generate a stackedndarray
usingnp.dstack
(see answer). I then convert it to apd.Panel
and usingpd.Panel.to_frame
convert it to apd.DataFrame
. At this point, I have apd.DataFrame
that has an additional level on its index relative to the originalpd.DataFrame
and the new level contains information about each rolled period. For example, if the roll window is 3, the new index level will contain be[0, 1, 2]
. An item for each period. I can nowgroupby
level=0
and return the groupby object. This now gives me an object that I can much more intuitively manipulate.Roll Function
Demonstration
Let's
sum
To peek under the hood, we can see the stucture:
But what about the purpose for which I built this, rolling multi-factor regression. But I'll settle for matrix multiplication for now.
Made the following modifications to the above answer since I needed to return the entire rolling window as is done in pd.DataFrame.rolling()
Since pandas v0.23 it is now possible to pass a
Series
instead of andarray
to Rolling.apply(). Just setraw=False
.As noted; if you only need one single dimension, passing it raw is obviously more efficient. This is probably the answer to your question; Rolling.apply() was initially built to pass an
ndarray
only because this is the most efficient.