I was wondering if someone could help explain the below behaviour using agg()
import numpy as np
import pandas as pd
import string
Initialise Data Frame
df = pd.DataFrame(data=[list(string.ascii_lowercase)[0:5]*2,list(range(1,11)),list(range(11,21))]).T
df.columns = columns=['g','c1','c2']
df.sort_values(['g']).head(5)
g c1 c2
0 a 1 11
5 a 6 16
1 b 2 12
6 b 7 17
2 c 3 13
As an example I am summing and averaging across c1 and c2 while doing a group by g
No data error scenario:
f = { 'c1' : lambda g: df.loc[g.index].c2.sum() + g.sum(), 'c2' : lambda g: (df.loc[g.index].c1.sum() + g.sum())/(g.count()+df.loc[g.index].c1.count())}
df = df.groupby('g',as_index=False).agg(f)
Error with data type:
rnm_cols = dict(sum='Sum', mean='Mean') #, std='Std')
df = df.set_index(['g']).stack().groupby('g').agg(rnm_cols.keys()).rename(columns=rnm_cols)
I get the -> DataError: No numeric types to aggregate
I know if I initialise my data frame using the below I can avoid this issue:
df[['c1','c2']] = df[['c1','c2']].apply(lambda x: pd.to_numeric(x, errors='coerce'))
However I'm trying to understand why aggregating with the mean function provides such errors ?
This is due to the way
GroupBy
objects handle the different aggregation methods. In factsum
andmean
are handled differently (see below for more details).But the bottom line is that
mean
only works for numeric types which are not present in your data frame:By applying
pd.to_numeric
you convert them to numeric type and theagg
works.But let's take a closer look:
GroupBy.mean
This function call dispatches to
self._cython_agg_general
which checks for numeric types and in case it doesn't find any (which is the case for your example) it raises aDataError
. Though the call toself._cython_agg_general
is wrapped intry/except
in case of aGroupByError
it just re-raises andDataError
inherits fromGroupByError
. Thus the exception.GroupBy.sum
This function is defined in a different way, namely here (via this function). The wrapper function similarly dispatches to
self._cython_agg_general
, wrapped intry/except
, but it doesn't add a specific clause forGroupByError
s (no idea why though; maybe that's a good question for the developers, so they can unify the behavior ofGroupBy
objects). Becauseself._cython_agg_general
again raises theDataError
it will enter theexcept Exception
clause for which it falls back toself.aggregate
. From here you can trace it down through a dozen of additional function calls but in the end it will simply add the single items of the series (which are stored asobject
s but adding in Python is no problem since they areint
s in fact).Summary
So it all comes down to the different ways exceptions are handled by the two aggregation functions;
mean
re-raises onDataError
butsum
doesn't. The "why" still remains an open question to me as well.See also