可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I\'m looking to turn a pandas cell containing a list into rows for each of those values.
So, take this:
![\"enter](\"https://i.stack.imgur.com/j7lFk.png\")
If I\'d like to unpack and stack the values in the \'nearest_neighbors\" column so that each value would be a row within each \'opponent\' index, how would I best go about this? Are there pandas methods that are meant for operations like this? I\'m just not aware.
Thanks in advance, guys.
回答1:
In the code below, I first reset the index to make the row iteration easier.
I create a list of lists where each element of the outer list is a row of the target DataFrame and each element of the inner list is one of the columns. This nested list will ultimately be concatenated to create the desired DataFrame.
I use a lambda
function together with list iteration to create a row for each element of the nearest_neighbors
paired with the relevant name
and opponent
.
Finally, I create a new DataFrame from this list (using the original column names and setting the index back to name
and opponent
).
df = (pd.DataFrame({\'name\': [\'A.J. Price\'] * 3,
\'opponent\': [\'76ers\', \'blazers\', \'bobcats\'],
\'nearest_neighbors\': [[\'Zach LaVine\', \'Jeremy Lin\', \'Nate Robinson\', \'Isaia\']] * 3})
.set_index([\'name\', \'opponent\']))
>>> df
nearest_neighbors
name opponent
A.J. Price 76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
bobcats [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
df.reset_index(inplace=True)
rows = []
_ = df.apply(lambda row: [rows.append([row[\'name\'], row[\'opponent\'], nn])
for nn in row.nearest_neighbors], axis=1)
df_new = pd.DataFrame(rows, columns=df.columns).set_index([\'name\', \'opponent\'])
>>> df_new
nearest_neighbors
name opponent
A.J. Price 76ers Zach LaVine
76ers Jeremy Lin
76ers Nate Robinson
76ers Isaia
blazers Zach LaVine
blazers Jeremy Lin
blazers Nate Robinson
blazers Isaia
bobcats Zach LaVine
bobcats Jeremy Lin
bobcats Nate Robinson
bobcats Isaia
EDIT JUNE 2017
An alternative method is as follows:
>>> (pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(),
id_vars=[\'name\', \'opponent\'],
value_name=\'nearest_neighbors\')
.set_index([\'name\', \'opponent\'])
.drop(\'variable\', axis=1)
.dropna()
.sort_index()
)
回答2:
Use apply(pd.Series)
and stack
, then reset_index
and to_frame
In [1803]: (df.nearest_neighbors.apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame(\'nearest_neighbors\'))
Out[1803]:
nearest_neighbors
name opponent
A.J. Price 76ers Zach LaVine
76ers Jeremy Lin
76ers Nate Robinson
76ers Isaia
blazers Zach LaVine
blazers Jeremy Lin
blazers Nate Robinson
blazers Isaia
bobcats Zach LaVine
bobcats Jeremy Lin
bobcats Nate Robinson
bobcats Isaia
Details
In [1804]: df
Out[1804]:
nearest_neighbors
name opponent
A.J. Price 76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
bobcats [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
回答3:
I think this a really good question, in Hive you would use EXPLODE
, I think there is a case to be made that Pandas should include this functionality by default. I would probably explode the list column with a nested generator comprehension like this:
pd.DataFrame({
\"name\": i[0],
\"opponent\": i[1],
\"nearest_neighbor\": neighbour
}
for i, row in df.iterrows() for neighbour in row.nearest_neighbors
).set_index([\"name\", \"opponent\"])
回答4:
Nicer alternative solution with apply(pd.Series):
df = pd.DataFrame({\'listcol\':[[1,2,3],[4,5,6]]})
# expand df.listcol into its own dataframe
tags = df[\'listcol\'].apply(pd.Series)
# rename each variable is listcol
tags = tags.rename(columns = lambda x : \'listcol_\' + str(x))
# join the tags dataframe back to the original dataframe
df = pd.concat([df[:], tags[:]], axis=1)
回答5:
The fastest method I found so far is extending the DataFrame with .iloc
and assigning back the flattened target column.
Given the usual input (replicated a bit):
df = (pd.DataFrame({\'name\': [\'A.J. Price\'] * 3,
\'opponent\': [\'76ers\', \'blazers\', \'bobcats\'],
\'nearest_neighbors\': [[\'Zach LaVine\', \'Jeremy Lin\', \'Nate Robinson\', \'Isaia\']] * 3})
.set_index([\'name\', \'opponent\']))
df = pd.concat([df]*10)
df
Out[3]:
nearest_neighbors
name opponent
A.J. Price 76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
bobcats [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
76ers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
blazers [Zach LaVine, Jeremy Lin, Nate Robinson, Isaia]
...
Given the following suggested alternatives:
col_target = \'nearest_neighbors\'
def extend_iloc():
# Flatten columns of lists
col_flat = [item for sublist in df[col_target] for item in sublist]
# Row numbers to repeat
lens = df[col_target].apply(len)
vals = range(df.shape[0])
ilocations = np.repeat(vals, lens)
# Replicate rows and add flattened column of lists
cols = [i for i,c in enumerate(df.columns) if c != col_target]
new_df = df.iloc[ilocations, cols].copy()
new_df[col_target] = col_flat
return new_df
def melt():
return (pd.melt(df[col_target].apply(pd.Series).reset_index(),
id_vars=[\'name\', \'opponent\'],
value_name=col_target)
.set_index([\'name\', \'opponent\'])
.drop(\'variable\', axis=1)
.dropna()
.sort_index())
def stack_unstack():
return (df[col_target].apply(pd.Series)
.stack()
.reset_index(level=2, drop=True)
.to_frame(col_target))
I find that extend_iloc()
is the fastest:
%timeit extend_iloc()
3.11 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit melt()
22.5 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit stack_unstack()
11.5 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
回答6:
Similar to Hive\'s EXPLODE functionality:
import copy
def pandas_explode(df, column_to_explode):
\"\"\"
Similar to Hive\'s EXPLODE function, take a column with iterable elements, and flatten the iterable to one element
per observation in the output table
:param df: A dataframe to explod
:type df: pandas.DataFrame
:param column_to_explode:
:type column_to_explode: str
:return: An exploded data frame
:rtype: pandas.DataFrame
\"\"\"
# Create a list of new observations
new_observations = list()
# Iterate through existing observations
for row in df.to_dict(orient=\'records\'):
# Take out the exploding iterable
explode_values = row[column_to_explode]
del row[column_to_explode]
# Create a new observation for every entry in the exploding iterable & add all of the other columns
for explode_value in explode_values:
# Deep copy existing observation
new_observation = copy.deepcopy(row)
# Add one (newly flattened) value from exploding iterable
new_observation[column_to_explode] = explode_value
# Add to the list of new observations
new_observations.append(new_observation)
# Create a DataFrame
return_df = pandas.DataFrame(new_observations)
# Return
return return_df
回答7:
Here is a potential optimization for larger dataframes. This runs faster when there are several equal values in the \"exploding\" field. (The larger the dataframe is compared to the unique value count in the field, the better this code will perform.)
def lateral_explode(dataframe, fieldname):
temp_fieldname = fieldname + \'_made_tuple_\'
dataframe[temp_fieldname] = dataframe[fieldname].apply(tuple)
list_of_dataframes = []
for values in dataframe[temp_fieldname].unique().tolist():
list_of_dataframes.append(pd.DataFrame({
temp_fieldname: [values] * len(values),
fieldname: list(values),
}))
dataframe = dataframe[list(set(dataframe.columns) - set([fieldname]))]\\
.merge(pd.concat(list_of_dataframes), how=\'left\', on=temp_fieldname)
del dataframe[temp_fieldname]
return dataframe
回答8:
Extending Oleg\'s .iloc
answer to automatically flatten all list-columns:
def extend_iloc(df):
cols_to_flatten = [colname for colname in df.columns if
isinstance(df.iloc[0][colname], list)]
# Row numbers to repeat
lens = df[cols_to_flatten[0]].apply(len)
vals = range(df.shape[0])
ilocations = np.repeat(vals, lens)
# Replicate rows and add flattened column of lists
with_idxs = [(i, c) for (i, c) in enumerate(df.columns) if c not in cols_to_flatten]
col_idxs = list(zip(*with_idxs)[0])
new_df = df.iloc[ilocations, col_idxs].copy()
# Flatten columns of lists
for col_target in cols_to_flatten:
col_flat = [item for sublist in df[col_target] for item in sublist]
new_df[col_target] = col_flat
return new_df
This assumes that each list-column has equal list length.
回答9:
So all of these answers are good but I wanted something ^really simple^ so here\'s my contribution:
def explode(series):
return pd.Series([x for _list in series for x in _list])
That\'s it.. just use this when you want a new series where the lists are \'exploded\'. Here\'s an example where we do value_counts() on taco choices :)
In [1]: my_df = pd.DataFrame(pd.Series([[\'a\',\'b\',\'c\'],[\'b\',\'c\'],[\'c\']]), columns=[\'tacos\'])
In [2]: my_df.head()
Out[2]:
tacos
0 [a, b, c]
1 [b, c]
2 [c]
In [3]: explode(my_df[\'tacos\']).value_counts()
Out[3]:
c 3
b 2
a 1