I have a dataframe where I would like to concatenate certain columns.
My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.
For example, if I had a data frame such as:
pd.read_csv("animal.csv")
animal1 animal2 label
1 cat dog dolphin 19
2 dog cat cat 72
3 pilchard 26 koala 26
4 newt bat 81 bat 81
I want to combine the columns but retain only unique information from each of the strings.
You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.
I combine the columns by doing the following
animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat cat 72
3 pilchard 26 koala 26 pilchard 26 koala 26
4 newt bat 81 bat 81 newt bat 81 bat 81
Row 1 is fine, but the other rows, of course, contain duplicates as described above.
The output I would desire is:
animal1 animal2 label detail
1 cat dog dolphin 19 cat dog dolphin 19
2 dog cat cat 72 dog cat 72
3 pilchard 26 koala 26 pilchard koala 26
4 newt bat 81 bat 81 newt bat 81
or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:
detail
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81
I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?
I have the solution in R but want to recode in python.
Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)
I'd suggest to remove the duplicates at the end of the process by using python
set
.here is an example function to do so:
That works like this:
in case you want the details ordered you can use
oredereddict
from collections orpd.unique
instead of set.then just
apply
it (similar tomap
) on your details columns for the desired result:You can add custom function where first split by whitespace, then get unique values by
pandas.unique
and last join to string back:Also is possible join values in
apply
:Solution with
set
, but it change order:If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.