Iterate over rows and expand pandas dataframe

2020-02-09 14:05发布

I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:

dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
 u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})

    location     name
0   Amsterdam   Tom
1   [Berlin, Paris] Jim
2   [Antwerp, Barcelona, Pisa]  Claus

I want to turn into:

dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})

    location     name
0   Amsterdam   Tom
1   Berlin   Jim
2   Paris   Jim
3   Antwerp Claus
4   Barcelona   Claus
5   Pisa    Claus

I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...

def duplicator(series):
    if type(series['location']) == list:
        for location in series['location']:
            subSeries = series
            subSeries['location'] = location
            dfOut.append(subSeries)
    else:
        dfOut.append(series)

for index, row in dfIn.iterrows():
    duplicator(row)

2条回答
叛逆
2楼-- · 2020-02-09 14:57

Not as much interesting/fancy pandas usage, but this works:

import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})

It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all location entries are already iterables, you can remove the atleast_1d call, which gives about another 20% speedup.

查看更多
该账号已被封号
3楼-- · 2020-02-09 14:59

If you return a series whose index is a list of locations, then dfIn.apply will collate those series into a table:

import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
                     u'location': ['Amsterdam', ['Berlin','Paris'],
                                   ['Antwerp','Barcelona','Pisa'] ]})

def expand(row):
    locations = row['location'] if isinstance(row['location'], list) else [row['location']]
    s = pd.Series(row['name'], index=list(set(locations)))
    return s

In [156]: dfIn.apply(expand, axis=1)
Out[156]: 
  Amsterdam Antwerp Barcelona Berlin Paris   Pisa
0       Tom     NaN       NaN    NaN   NaN    NaN
1       NaN     NaN       NaN    Jim   Jim    NaN
2       NaN   Claus     Claus    NaN   NaN  Claus

You can then stack this DataFrame to obtain:

In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]: 
0  Amsterdam      Tom
1  Berlin         Jim
   Paris          Jim
2  Antwerp      Claus
   Barcelona    Claus
   Pisa         Claus
dtype: object

This is a Series, while you want a DataFrame. A little massaging with reset_index gives you the desired result:

dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)

yields

    location   name
0  Amsterdam    Tom
1     Berlin    Jim
2      Paris    Jim
3  Amsterdam  Claus
4    Antwerp  Claus
5  Barcelona  Claus
查看更多
登录 后发表回答