I wrote a web scraper to pull information from a table of products and build a dataframe. The data table has a Description column which contains a comma separated string of attributes describing the product. I want to create a column in the dataframe for every unique attribute and populate the row in that column with the attribute's substring. Example df below.
PRODUCTS DATE DESCRIPTION
Product A 2016-9-12 Steel, Red, High Hardness
Product B 2016-9-11 Blue, Lightweight, Steel
Product C 2016-9-12 Red
I figure the first step is to split the description into a list.
In: df2 = df['DESCRIPTION'].str.split(',')
Out:
DESCRIPTION
['Steel', 'Red', 'High Hardness']
['Blue', 'Lightweight', 'Steel']
['Red']
My desired output looks like the table below. The column names are not particularly important.
PRODUCTS DATE STEEL_COL RED_COL HIGH HARDNESS_COL BLUE COL LIGHTWEIGHT_COL
Product A 2016-9-12 Steel Red High Hardness
Product B 2016-9-11 Steel Blue Lightweight
Product C 2016-9-12 Red
I believe the columns can be set up using a Pivot but I'm not sure the most Pythonic way to populate the columns after establishing them. Any help is appreciated.
UPDATE
Thank you very much for the answers. I selected @MaxU's response as correct since it seems slightly more flexible, but @piRSquared's gets a very similar result and may even be considered the more Pythonic approach. I tested both version and both do what I needed. Thanks!
Here is my crack at a solution extended from a problem I was already working on.
I also benchmarked @MaxU's, @piRSquared's, and my solutions on a pandas data frame with 11592 rows, and a column containing lists with 2681 unique values. Obviously the column names are different in the testing data frame but I have kept them the same as in the question.
Here are the benchmarks for each method
1 loop, best of 3: 1.14 s per loop
1 loop, best of 3: 612 ms per loop
1 loop, best of 3: 62.7 ms per loop
My guess is that aggregation and unstacking is faster than either pivot_table() or pd.get_dummies().
you can build up a sparse matrix:
The answers posted by @piRSquared and @MaxU works very well.
But, only when the data doesn't have any
NaN
values. Data I was working with was very sparse. It had around 1M rows which was getting reduced to only some 100 rows after applying the above method as it dropped all the rows withNaN
s in any of the columns. Took me more than a day to figure out the fixes. Sharing the slightly modified code to save time for others.Supposing you have
df
DataFrame as mentioned above,Replace all
NaN
occurrences first with something which is not expected in any of the other columns, as you have to replace it back toNaN
later.This is needed as groupby drops all rows with NaN values. :/
Then we run what is suggested in other answers with a minor modification
stack(dropna=False)
. By default,dropna=True
.And then you put back
NaN
indf
to not to alter data of other columns.Hope this saves hours of frustration for someone.
How about something that places an 'X' in the feature column if the product has that feature.
The below creates a list of unique features ('Steel', 'Red', etc.), then creates a column for each feature in the original df. Then we iterate through each row and for each product feature, we place an 'X' in the cell.
updated with example output:
Use
pd.get_dummies