Pandas: aggregate when column contains numpy array

I'm using a pandas DataFrame in which one column contains numpy arrays. When trying to sum that column via aggregation I get an error stating 'Must produce aggregated value'.

e.g.

import pandas as pd
import numpy as np

DF = pd.DataFrame([[1,np.array([10,20,30])],
               [1,np.array([40,50,60])], 
               [2,np.array([20,30,40])],], columns=['category','arraydata'])

This works the way I would expect it to:

DF.groupby('category').agg(sum)

output:

             arraydata
category 1   [50 70 90]
         2   [20 30 40]

However, since my real data frame has multiple numeric columns, arraydata is not chosen as the default column to aggregate on, and I have to select it manually. Here is one approach I tried:

g=DF.groupby('category')
g.agg({'arraydata':sum})

Here is another:

g=DF.groupby('category')
g['arraydata'].agg(sum)

Both give the same output:

Exception: must produce aggregated value

However if I have a column that uses numeric rather than array data, it works fine. I can work around this, but it's confusing and I'm wondering if this is a bug, or if I'm doing something wrong. I feel like the use of arrays here might be a bit of an edge case and indeed wasn't sure if they were supported. Ideas?

Thanks

标签： python numpy pandas aggregation

2条回答

甜甜的少女心

2楼-- · 2019-02-06 20:56

Pandas works much more efficiently if you don't do this (e.g using numeric data, as you suggest). Another alternative is to use a Panel object for this kind of multidimensional data.

Saying that, this looks like a bug, the Exception is being raised purely because the result is an array:

Exception: Must produce aggregated value

In [11]: %debug
> /Users/234BroadWalk/pandas/pandas/core/groupby.py(1511)_aggregate_named()
   1510             if isinstance(output, np.ndarray):
-> 1511                 raise Exception('Must produce aggregated value')
   1512             result[name] = self._try_cast(output, group)

ipdb> output
array([50, 70, 90])

If you were to recklessly remove these two lines from the source code it works as expected:

In [99]: g.agg(sum)
Out[99]:
             arraydata
category
1         [50, 70, 90]
2         [20, 30, 40]

Note: They're almost certainly in there for a reason...

0人赞添加讨论(0) 举报

Diving into the Internals

The problem here is that pandas is checking explicitly that the output not be an ndarray because it wants to intelligently reshape your array, as you can see in this snippet from _aggregate_named where the error occurs.

def _aggregate_named(self, func, *args, **kwargs):
    result = {}

    for name, group in self:
        group.name = name
        output = func(group, *args, **kwargs)
        if isinstance(output, np.ndarray):
            raise Exception('Must produce aggregated value')
        result[name] = self._try_cast(output, group)

    return result

My guess is that this happens because groupby is explicitly set up to try to intelligently put back together a DataFrame with the same indexes and everything aligned nicely. Since it's rare to have nested arrays in a DataFrame like that, it checks for ndarrays to make sure that you are actually using an aggregate function. In my gut, this feels like a job for Panel, but I'm not sure how to transform it perfectly. As an aside, you can sidestep this problem by converting your output to a list, like this:

DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})

Pandas doesn't complain, because now you have an array of Python objects. [but this is really just cheating around the typecheck]. And if you want to convert back to array, just apply np.array to it.

result = DF.groupby("category").agg({"arraydata": lambda x: list(x.sum())})
result["arraydata"] = result["arraydata"].apply(np.array)

How you want to resolve this issue really depends on why you have columns of ndarray and whether you want to aggregate anything else at the same time. That said, you can always iterate over GroupBy like I've shown above.

0人赞添加讨论(0) 举报

Pandas: aggregate when column contains numpy array

Diving into the Internals

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间