I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})
location name
0 Amsterdam Tom
1 [Berlin, Paris] Jim
2 [Antwerp, Barcelona, Pisa] Claus
I want to turn into:
dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Antwerp Claus
4 Barcelona Claus
5 Pisa Claus
I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...
def duplicator(series):
if type(series['location']) == list:
for location in series['location']:
subSeries = series
subSeries['location'] = location
dfOut.append(subSeries)
else:
dfOut.append(series)
for index, row in dfIn.iterrows():
duplicator(row)
Not as much interesting/fancy pandas usage, but this works:
It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all
location
entries are already iterables, you can remove theatleast_1d
call, which gives about another 20% speedup.If you return a series whose
index
is a list of locations, thendfIn.apply
will collate those series into a table:You can then stack this DataFrame to obtain:
This is a Series, while you want a DataFrame. A little massaging with
reset_index
gives you the desired result:yields