Group dataframe by multiple columns and append the

This is similar to Attach a calculated column to an existing dataframe, however, that solution doesn't work when grouping by more than one column in pandas v0.14.

For example:

$ df = pd.DataFrame([
    [1, 1, 1],
    [1, 2, 1],
    [1, 2, 2],
    [1, 3, 1],
    [2, 1, 1]],
    columns=['id', 'country', 'source'])

The following calculation works:

$ df.groupby(['id','country'])['source'].apply(lambda x: x.unique().tolist())


0       [1]
1    [1, 2]
2    [1, 2]
3       [1]
4       [1]
Name: source, dtype: object

But assigning the output to a new column result in an error:

df['source_list'] = df.groupby(['id','country'])['source'].apply(
                               lambda x: x.unique().tolist())

TypeError: incompatible index of inserted column with frame index

标签： pandas pandas-groupby

3条回答

smile是对你的礼貌

2楼-- · 2019-05-29 21:55

An alternative method that avoids the post-facto merge is providing the index in the function applied to each group, e.g.

def calculate_on_group(x):
    fill_val = x.unique().tolist()
    return pd.Series([fill_val] * x.size, index=x.index)

df['source_list'] = df.groupby(['id','country'])['source'].apply(calculate_on_group)

0人赞添加讨论(0) 举报

三岁会撩人

3楼-- · 2019-05-29 21:56

This can be achieved without the merge by reassigning the result of the groupby.apply to the original dataframe.

df = df.groupby(['id', 'country']).apply(lambda group: _add_sourcelist_col(group))

with your _add_sourcelist_col function being,

def _add_sourcelist_col(group):
    group['source_list'] = list(set(group.tolist()))
    return group

Note that additional columns can also be added in your defined function. Just simply add them to each group dataframe, and be sure to return the group at the end of your function declaration.

Edit: I'll leave the info above as it might still be useful, but I misinterpreted part of the original quesiton. What the OP was trying to accomplish can be done using,

df = df.groupby(['id', 'country']).apply(lambda x: addsource(x))

def addsource(x):
    x['source_list'] = list(set(x.source.tolist()))
    return x

0人赞添加讨论(0) 举报

Fickle 薄情

4楼-- · 2019-05-29 21:59

Merge grouped result with the initial DataFrame:

>>> df1 = df.groupby(['id','country'])['source'].apply(
             lambda x: x.tolist()).reset_index()

>>> df1
  id  country      source
0  1        1       [1.0]
1  1        2  [1.0, 2.0]
2  1        3       [1.0]
3  2        1       [1.0]

>>> df2 = df[['id', 'country']]
>>> df2
  id  country
1  1        1
2  1        2
3  1        2
4  1        3
5  2        1

>>> pd.merge(df1, df2, on=['id', 'country'])
  id  country      source
0  1        1       [1.0]
1  1        2  [1.0, 2.0]
2  1        2  [1.0, 2.0]
3  1        3       [1.0]
4  2        1       [1.0]

0人赞添加讨论(0) 举报

Group dataframe by multiple columns and append the

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间