DataError: No numeric types using mean aggregate f

2019-07-02 01:02发布

问题:

I was wondering if someone could help explain the below behaviour using agg()

import numpy as np
import pandas as pd
import string

Initialise Data Frame

df = pd.DataFrame(data=[list(string.ascii_lowercase)[0:5]*2,list(range(1,11)),list(range(11,21))]).T
df.columns = columns=['g','c1','c2']

df.sort_values(['g']).head(5)

g   c1  c2
0   a   1   11
5   a   6   16
1   b   2   12
6   b   7   17
2   c   3   13

As an example I am summing and averaging across c1 and c2 while doing a group by g

No data error scenario:

f = { 'c1' : lambda g: df.loc[g.index].c2.sum() + g.sum(), 'c2' : lambda g: (df.loc[g.index].c1.sum() + g.sum())/(g.count()+df.loc[g.index].c1.count())} 
df = df.groupby('g',as_index=False).agg(f)

Error with data type:

rnm_cols = dict(sum='Sum', mean='Mean') #, std='Std')
df = df.set_index(['g']).stack().groupby('g').agg(rnm_cols.keys()).rename(columns=rnm_cols)

I get the -> DataError: No numeric types to aggregate

I know if I initialise my data frame using the below I can avoid this issue:

df[['c1','c2']] = df[['c1','c2']].apply(lambda x: pd.to_numeric(x, errors='coerce'))

However I'm trying to understand why aggregating with the mean function provides such errors ?

回答1:

This is due to the way GroupBy objects handle the different aggregation methods. In fact sum and mean are handled differently (see below for more details).

But the bottom line is that mean only works for numeric types which are not present in your data frame:

>>> df.dtypes
g     object
c1    object
c2    object
dtype: object

By applying pd.to_numeric you convert them to numeric type and the agg works.

But let's take a closer look:

GroupBy.mean

This function call dispatches to self._cython_agg_general which checks for numeric types and in case it doesn't find any (which is the case for your example) it raises a DataError. Though the call to self._cython_agg_general is wrapped in try/except in case of a GroupByError it just re-raises and DataError inherits from GroupByError. Thus the exception.

GroupBy.sum

This function is defined in a different way, namely here (via this function). The wrapper function similarly dispatches to self._cython_agg_general, wrapped in try/except, but it doesn't add a specific clause for GroupByErrors (no idea why though; maybe that's a good question for the developers, so they can unify the behavior of GroupBy objects). Because self._cython_agg_general again raises the DataError it will enter the except Exception clause for which it falls back to self.aggregate. From here you can trace it down through a dozen of additional function calls but in the end it will simply add the single items of the series (which are stored as objects but adding in Python is no problem since they are ints in fact).

Summary

So it all comes down to the different ways exceptions are handled by the two aggregation functions; mean re-raises on DataError but sum doesn't. The "why" still remains an open question to me as well.

See also

  • Inconsistencies in groupby aggregation with non-numeric types
  • SeriesGroupby.cumsum raises on object dtype