Have Pandas column containing lists, how to pivot

2019-03-25 14:14发布

I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attributes describing the product. I want to create a column in the dataframe for every unique attribute and populate the row in that column with the attribute's substring. Example df below.

PRODUCTS     DATE        DESCRIPTION
Product A    2016-9-12   Steel, Red, High Hardness
Product B    2016-9-11   Blue, Lightweight, Steel
Product C    2016-9-12   Red

I figure the first step is to split the description into a list.

In: df2 = df['DESCRIPTION'].str.split(',')

Out:
DESCRIPTION
['Steel', 'Red', 'High Hardness']
['Blue', 'Lightweight', 'Steel']
['Red']

My desired output looks like the table below. The column names are not particularly important.

PRODUCTS     DATE        STEEL_COL  RED_COL    HIGH HARDNESS_COL  BLUE COL   LIGHTWEIGHT_COL
Product A    2016-9-12   Steel      Red        High Hardness
Product B    2016-9-11   Steel                                    Blue       Lightweight
Product C    2016-9-12              Red

I believe the columns can be set up using a Pivot but I'm not sure the most Pythonic way to populate the columns after establishing them. Any help is appreciated.

UPDATE

Thank you very much for the answers. I selected @MaxU's response as correct since it seems slightly more flexible, but @piRSquared's gets a very similar result and may even be considered the more Pythonic approach. I tested both version and both do what I needed. Thanks!

5条回答
劫难
2楼-- · 2019-03-25 14:53

Here is my crack at a solution extended from a problem I was already working on.

def group_agg_pivot_df(df, group_cols, agg_func='count', agg_col=None):

    if agg_col is None:
        agg_col = group_cols[0]

    grouped = df.groupby(group_cols).agg({agg_col: agg_func}) \
        .unstack().fillna(0)
    # drop aggregation column name from hierarchical column names
    grouped.columns = grouped.columns.droplevel()

    # promote index to column (the first element of group_cols)
    pivot_df = grouped.reset_index()
    pivot_df.columns = [s.replace(' ', '_').lower() for s in pivot_df.columns]
    return pivot_df

def split_stack_df(df, id_cols, split_col, new_col_name):
    # id_cols are the columns we want to pair with the values
    # from the split column

    stacked = df.set_index(id_cols)[split_col].str.split(',', expand=True) \
        .stack().reset_index(level=id_cols)
    stacked.columns = id_cols + [new_col_name]
    return stacked

stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc')
final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])

I also benchmarked @MaxU's, @piRSquared's, and my solutions on a pandas data frame with 11592 rows, and a column containing lists with 2681 unique values. Obviously the column names are different in the testing data frame but I have kept them the same as in the question.

Here are the benchmarks for each method

In [277]: %timeit pd.get_dummies(df.set_index(['PRODUCTS', 'DATE']) \
 ...:                        .DESCRIPTION.str.split(',', expand=True) \
 ...:                        .stack()) \
 ...:     .groupby(['PRODUCTS', 'DATE']).sum()
 ...: 

1 loop, best of 3: 1.14 s per loop

In [278]: %timeit df.set_index(['PRODUCTS', 'DATE']) \
 ...:     .DESCRIPTION.str.split(',', expand=True) \
 ...:     .stack() \
 ...:     .reset_index() \
 ...:     .pivot_table(index=['PRODUCTS', 'DATE'], columns=0, fill_value=0, aggfunc='size')

1 loop, best of 3: 612 ms per loop

In [286]: %timeit stacked = split_stack_df(df, ['PRODUCTS', 'DATE'], 'DESCRIPTION', 'desc'); \
 ...:     final_df = group_agg_pivot_df(stacked, ['PRODUCTS', 'DATE', 'desc'])

1 loop, best of 3: 62.7 ms per loop

My guess is that aggregation and unstacking is faster than either pivot_table() or pd.get_dummies().

查看更多
虎瘦雄心在
3楼-- · 2019-03-25 14:57

you can build up a sparse matrix:

In [27]: df
Out[27]:
    PRODUCTS       DATE                DESCRIPTION
0  Product A  2016-9-12  Steel, Red, High Hardness
1  Product B  2016-9-11   Blue, Lightweight, Steel
2  Product C  2016-9-12                        Red

In [28]: (df.set_index(['PRODUCTS','DATE'])
   ....:    .DESCRIPTION.str.split(',\s*', expand=True)
   ....:    .stack()
   ....:    .reset_index()
   ....:    .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value=0, aggfunc='size')
   ....: )
Out[28]:
0                    Blue  High Hardness  Lightweight  Red  Steel
PRODUCTS  DATE
Product A 2016-9-12     0              1            0    1      1
Product B 2016-9-11     1              0            1    0      1
Product C 2016-9-12     0              0            0    1      0

In [29]: (df.set_index(['PRODUCTS','DATE'])
   ....:    .DESCRIPTION.str.split(',\s*', expand=True)
   ....:    .stack()
   ....:    .reset_index()
   ....:    .pivot_table(index=['PRODUCTS','DATE'], columns=0, fill_value='', aggfunc='size')
   ....: )
Out[29]:
0                   Blue High Hardness Lightweight Red Steel
PRODUCTS  DATE
Product A 2016-9-12                  1               1     1
Product B 2016-9-11    1                         1         1
Product C 2016-9-12                                  1
查看更多
叼着烟拽天下
4楼-- · 2019-03-25 14:58

The answers posted by @piRSquared and @MaxU works very well.

But, only when the data doesn't have any NaN values. Data I was working with was very sparse. It had around 1M rows which was getting reduced to only some 100 rows after applying the above method as it dropped all the rows with NaNs in any of the columns. Took me more than a day to figure out the fixes. Sharing the slightly modified code to save time for others.

Supposing you have df DataFrame as mentioned above,

  • Replace all NaN occurrences first with something which is not expected in any of the other columns, as you have to replace it back to NaN later.

    cols = ['PRODUCTS', 'DATE']
    col = "DESCRIPTION"
    df.loc[:, cols] = df.loc[:, cols].fillna("SOME_UNIQ_NAN_REPLACEMENT")
    

    This is needed as groupby drops all rows with NaN values. :/

  • Then we run what is suggested in other answers with a minor modification stack(dropna=False). By default, dropna=True.

    df = pd.get_dummies(df.set_index(index_columns[col]\
            .str.split(",\s*", expand=True).stack(dropna=False), prefix=col)\
            .groupby(index_columns, sort=False).sum().astype(int).reset_index()
    
  • And then you put back NaN in df to not to alter data of other columns.

    df.replace("SOME_UNIQ_NAN_REPLACEMENT", np.nan, inplace=True)
    

Hope this saves hours of frustration for someone.

查看更多
迷人小祖宗
5楼-- · 2019-03-25 15:06

How about something that places an 'X' in the feature column if the product has that feature.

The below creates a list of unique features ('Steel', 'Red', etc.), then creates a column for each feature in the original df. Then we iterate through each row and for each product feature, we place an 'X' in the cell.

ml = []  

a = [ml.append(item) for l in df.DESCRIPTION for item in l]

unique_list_of_attributes = list(set(ml)) # unique features list

# place empty columns in original df for each feature
df = pd.concat([df,pd.DataFrame(columns=unique_list_of_attributes)]).fillna(value='')

# add 'X' in column if product has feature
for row in df.iterrows():
    for attribute in row[1]['DESCRIPTION']:
        df.loc[row[0],attribute] = 'X'

updated with example output:

    PRODUCTS       DATE                 DESCRIPTION Blue HighHardness  \
0  Product A  2016-9-12  [Steel, Red, HighHardness]                 X   
1  Product B  2016-9-11  [Blue, Lightweight, Steel]    X                
2  Product C  2016-9-12                       [Red]                     

  Lightweight Red Steel  
0               X     X  
1           X         X  
2               X       
查看更多
看我几分像从前
6楼-- · 2019-03-25 15:14

Use pd.get_dummies

cols = ['PRODUCTS', 'DATE']
pd.get_dummies(
    df.set_index(cols).DESCRIPTION \
      .str.split(',\s*', expand=True).stack()
).groupby(level=cols).sum().astype(int)

enter image description here

查看更多
登录 后发表回答