I have a 16x10 panda dataframe with 1x35000 arrays (or NaN) in each cell. I want to take the element-wise mean over rows for each column.
1 2 3 ... 10
1 1x35000 1x35000 1x35000 1x35000
2 1x35000 NaN 1x35000 1x35000
3 1x35000 NaN 1x35000 NaN
...
16 1x35000 1x35000 NaN 1x35000
To avoid misunderstandings: take the first element of each array in the first column and take the mean. Then take the second element of each array in the first column and take the mean again. In the end I want to have a 1x10 dataframe with one 1x35000 array each per column. The array should be the element-wise mean of my former arrays.
1 2 3 ... 10
1 1x35000 1x35000 1x35000 1x35000
Do you have an idea to get there elegantly preferably without for-loops?
Setup
np.random.seed([3,14159])
df = pd.DataFrame(
np.random.randint(10, size=(3, 3, 5)).tolist(),
list('XYZ'), list('ABC')
).applymap(np.array)
df.loc['X', 'B'] = np.nan
df.loc['Z', 'A'] = np.nan
df
A B C
X [4, 8, 1, 1, 9] NaN [8, 2, 8, 4, 9]
Y [4, 3, 4, 1, 5] [1, 2, 6, 2, 7] [7, 1, 1, 7, 8]
Z NaN [9, 3, 8, 7, 7] [2, 6, 3, 1, 9]
Solution
g = df.stack().groupby(level=1)
g.apply(np.sum, axis=0) / g.size()
A [4.0, 5.5, 2.5, 1.0, 7.0]
B [5.0, 2.5, 7.0, 4.5, 7.0]
C [5.66666666667, 3.0, 4.0, 4.0, 8.66666666667]
dtype: object
If you insist on the shape you presented
g = df.stack().groupby(level=1)
(g.apply(np.sum, axis=0) / g.size()).to_frame().T
A B C
0 [4.0, 5.5, 2.5, 1.0, 7.0] [5.0, 2.5, 7.0, 4.5, 7.0] [5.66666666667, 3.0, 4.0, 4.0, 8.66666666667]
Approach #1 : Loopy
Given the mixed dtype input data, we might want to loop through for performance efficiency. So, looping with explicit loops or under-the-hood usages of .apply/.applymap
would be the solutions that could be suggested.
Here's one way looping through columns -
mask = ~df.isnull().values
n = df.shape[1]
out = np.empty((1,n),dtype=object)
for i in range(n):
out[0,i] = df.iloc[mask[:,i],i].mean()
df_out = pd.DataFrame(out)
Sample input, output -
In [326]: df
Out[326]:
0 1 2
0 [4, 0, 1, 6] [4, 2, 2, 2] [5, 3, 5, 4]
1 NaN [0, 5, 6, 8] NaN
2 NaN NaN NaN
3 NaN NaN NaN
In [327]: df_out
Out[327]:
0 1 2
0 [4.0, 0.0, 1.0, 6.0] [2.0, 3.5, 4.0, 5.0] [5.0, 3.0, 5.0, 4.0]
Approach #2 : Vectorized
If you have to vectorize, here's one way using matrix-multiplication
to replace the mean-reductions
and that could bring about improvements for large data -
mask = ~df.isnull().values
v = np.vstack(df.values[mask])
r,c = np.where(mask)
n = df.shape[1]
pos_mask = c == np.arange(n)[:,None]
out = pos_mask.dot(v)/np.bincount(c).astype(float)[:,None]
df_out1 = pd.DataFrame(out)
Sample output -
In [328]: df_out1
Out[328]:
0 1 2 3
0 4.0 0.0 1.0 6.0
1 2.0 3.5 4.0 5.0
2 5.0 3.0 5.0 4.0
Approach #3 : Vectorized one more
Making use of np.add.reduceat
to get those mean-reductions
-
mask = ~df.T.isnull().values
v = np.vstack(df.values.T[mask])
count = mask.sum(1)
out0 = np.add.reduceat(v, np.r_[0,count.cumsum()[:-1]])
out = out0/count[:,None].astype(float)
df_out2 = pd.DataFrame(out)