I was wondering if someone could help explain the below behaviour using agg()
import numpy as np
import pandas as pd
import string
Initialise Data Frame
df = pd.DataFrame(data=[list(string.ascii_lowercase)[0:5]*2,list(range(1,11)),list(range(11,21))]).T
df.columns = columns=['g','c1','c2']
df.sort_values(['g']).head(5)
g c1 c2
0 a 1 11
5 a 6 16
1 b 2 12
6 b 7 17
2 c 3 13
As an example I am summing and averaging across c1 and c2 while doing a group by g
No data error scenario:
f = { 'c1' : lambda g: df.loc[g.index].c2.sum() + g.sum(), 'c2' : lambda g: (df.loc[g.index].c1.sum() + g.sum())/(g.count()+df.loc[g.index].c1.count())}
df = df.groupby('g',as_index=False).agg(f)
Error with data type:
rnm_cols = dict(sum='Sum', mean='Mean') #, std='Std')
df = df.set_index(['g']).stack().groupby('g').agg(rnm_cols.keys()).rename(columns=rnm_cols)
I get the -> DataError: No numeric types to aggregate
I know if I initialise my data frame using the below I can avoid this issue:
df[['c1','c2']] = df[['c1','c2']].apply(lambda x: pd.to_numeric(x, errors='coerce'))
However I'm trying to understand why aggregating with the mean
function provides such errors ?
This is due to the way GroupBy
objects handle the different aggregation methods. In fact sum
and mean
are handled differently (see below for more details).
But the bottom line is that mean
only works for numeric types which are not present in your data frame:
>>> df.dtypes
g object
c1 object
c2 object
dtype: object
By applying pd.to_numeric
you convert them to numeric type and the agg
works.
But let's take a closer look:
GroupBy.mean
This function call dispatches to self._cython_agg_general
which checks for numeric types and in case it doesn't find any (which is the case for your example) it raises a DataError
. Though the call to self._cython_agg_general
is wrapped in try/except
in case of a GroupByError
it just re-raises and DataError
inherits from GroupByError
. Thus the exception.
GroupBy.sum
This function is defined in a different way, namely here (via this function). The wrapper function similarly dispatches to self._cython_agg_general
, wrapped in try/except
, but it doesn't add a specific clause for GroupByError
s (no idea why though; maybe that's a good question for the developers, so they can unify the behavior of GroupBy
objects). Because self._cython_agg_general
again raises the DataError
it will enter the except Exception
clause for which it falls back to self.aggregate
. From here you can trace it down through a dozen of additional function calls but in the end it will simply add the single items of the series (which are stored as object
s but adding in Python is no problem since they are int
s in fact).
Summary
So it all comes down to the different ways exceptions are handled by the two aggregation functions; mean
re-raises on DataError
but sum
doesn't. The "why" still remains an open question to me as well.
See also
- Inconsistencies in groupby aggregation with non-numeric types
- SeriesGroupby.cumsum raises on object dtype