Pivot a pandas DataFrame to be the correct format:

2019-01-28 11:15发布

Here is a pandas DataFrame I would like to manipulate:

import pandas as pd

data = {"grouping": ["item1", "item1", "item1", "item2", "item2", "item2", "item2", ...],
        "labels": ["A", "B", "C", "A", "B", "C", "D", ...],
        "count": [5, 1, 8, 3, 731, 189, 9, ...]}

df = pd.DataFrame(data)

print(df)
>>>   grouping            labels       count
0        item1             A            5
1        item1             B            1
2        item1             C            8
3        item2             A            3
4        item2             B          731
5        item2             C          189
6        item2             D            9
7        ...               ...         ....

I would like to "unfold" this dataframe into the following format:

grouping    A    B    C    D
item1       5    1    8    3
item2       3    731  189  9
....        ........

How would one do this? I would think that this would work:

pd.pivot_table(df,index=["grouping", "labels"]

but I get the following error:

DataError: No numeric types to aggregate

4条回答
倾城 Initia
2楼-- · 2019-01-28 11:47

You put labels in the index, but you want it in the columns:

>>> df.pivot_table(index='grouping', columns='labels')
         count                   
labels       A      B      C    D
grouping                         
item1      5.0    1.0    8.0  NaN
item2      3.0  731.0  189.0  9.0

Note that this makes the columns a MultiIndex. If you don't want that, explicitly pass values: df.pivot_table(index='grouping', columns='labels', values='count').

Also, note that the kind of reshape you seem to be looking for will only be possible if each combination of grouping and label has exactly one or zero values. If any combination occurs more than once, you need to decide how to aggregate them (e.g., by summing the matching values).

查看更多
乱世女痞
3楼-- · 2019-01-28 11:57

Use set_index and unstack:

df = df.set_index(['grouping','labels']).unstack().rename_axis(None)
df.columns = df.columns.droplevel()
print(df)

Output:

labels  A    B    C     D
item1   5    1    8  None
item2   3  731  189     9
查看更多
够拽才男人
4楼-- · 2019-01-28 12:04

Try:

In [1]: import pandas as pd
   ...: 
   ...: data = {"grouping": ["item1", "item1", "item1", "item2", "item2", "item2", "item2"],
   ...:         "labels": ["A", "B", "C", "A", "B", "C", "D"],
   ...:         "count": [5, 1, 8, 3, 731, 189, 9]}
   ...: 
In [2]: df = pd.DataFrame(data)
In [3]: df.pivot_table(index="grouping",columns="labels")

Out[3]: 
             count              
    labels       A    B    C   D
    grouping                    
    item1        5    1    8 NaN
    item2        3  731  189   9
查看更多
一夜七次
5楼-- · 2019-01-28 12:05

There are four idiomatic pandas ways to do this.

  • No duplicates among grouping columns. Does not require aggregation
    • pivot
    • set_index
  • Duplicates among grouping columns. Does require aggregation
    • pivot_table
    • groupby

pivot

df.pivot('grouping', 'labels', 'count')

set_index

df.set_index(['grouping', 'labels'])['count'].unstack()

pivot_table

df.pivot_table('count', 'grouping', 'labels')

groupby

df.groupby(['grouping', 'labels'])['count'].sum().unstack()

All yield

labels      A      B      C    D
grouping                        
item1     5.0    1.0    8.0  NaN
item2     3.0  731.0  189.0  9.0

timing

enter image description here

With the groupby, set_index, or pivot_table approach, you can easily fill in missing values with fill_value=0

df.pivot_table('count', 'grouping', 'labels', fill_value=0)

df.groupby(['grouping', 'labels'])['count'].sum().unstack(fill_value=0)

df.set_index(['grouping', 'labels'])['count'].sum().unstack(fill_value=0)

All yield

labels    A    B    C  D
grouping                
item1     5    1    8  0
item2     3  731  189  9

Additional thoughts on groupby

Because we don't require any aggregation. If we wanted to use groupby, we can minimize the impact of the implicit aggregation by utilizing a less impactful aggregator.

df.groupby(['grouping', 'labels'])['count'].max().unstack()

or

df.groupby(['grouping', 'labels'])['count'].first().unstack()

timing groupby

enter image description here

查看更多
登录 后发表回答