Pandas: combine columns without duplicates/ find u

2020-07-28 11:42发布

问题:

I have a dataframe where I would like to concatenate certain columns.

My issue is that the text in these columns may or may not contain duplicate information. I would like to strip out the duplicates in order to retain only the relevant information.

For example, if I had a data frame such as:

pd.read_csv("animal.csv")

  animal1         animal2        label  
1 cat dog         dolphin        19
2 dog cat         cat            72
3 pilchard 26     koala          26
4 newt bat 81     bat            81

I want to combine the columns but retain only unique information from each of the strings.

You can see that in row 2, 'cat' is contained in both columns 'Animal1' and 'Animal2'. In row 3, the number 26 is in both column 'Animal1' and 'Label'. Whereas in row 4, information that is in columns 'Animal2' and 'Label' are already contained in order in 'Animal1'.

I combine the columns by doing the following

animals["detail"] = animals["animal1"].map(str) + animals["animal2"].map(str) + animals["label"].map(str)

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat cat 72
3 pilchard 26     koala          26           pilchard 26 koala 26
4 newt bat 81     bat            81           newt bat 81 bat 81

Row 1 is fine, but the other rows, of course, contain duplicates as described above.

The output I would desire is:

  animal1         animal2        label        detail  
1 cat dog         dolphin        19           cat dog dolphin 19
2 dog cat         cat            72           dog cat 72
3 pilchard 26     koala          26           pilchard koala 26
4 newt bat 81     bat            81           newt bat 81

or if I could retain only the first unique instance of each word/ number per row in the detail column, this would also be suitable i.e.:

  detail 
1 cat dog dolphin 19
2 dog cat 72
3 pilchard koala 26
4 newt bat 81

I've had a look at doing this for a string in python e.g. How can I remove duplicate words in a string with Python?, How to get all the unique words in the data frame?, show distinct column values in pyspark dataframe: python but can't figure out how to apply this to individual rows within the detail column. I've looked at splitting the text after I've combined the columns, then using apply and lambda, but haven't got this to work yet. Or is there perhaps a way to do it when combining the columns?

I have the solution in R but want to recode in python.

Would greatly appreciate any help or advice. I'm currently using Spyder(Python 3.5)

回答1:

You can add custom function where first split by whitespace, then get unique values by pandas.unique and last join to string back:

animals["detail"] = animals["animal1"].map(str) + ' ' + 
                    animals["animal2"].map(str) + ' ' +
                    animals["label"].map(str)

animals["detail"] = animals["detail"].apply(lambda x: ' '.join(pd.unique(x.split())))
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Also is possible join values in apply:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(pd.unique(' '.join(x).split())),axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dog dolphin 19
2      dog cat      cat     72          dog cat 72
3  pilchard 26    koala     26   pilchard 26 koala
4  newt bat 81      bat     81         newt bat 81

Solution with set, but it change order:

animals["detail"] = animals.astype(str)
                           .apply(lambda x: ' '.join(set(' '.join(x).split())), axis=1)
print (animals)
       animal1  animal2  label              detail
1      cat dog  dolphin     19  cat dolphin 19 dog
2      dog cat      cat     72          cat dog 72
3  pilchard 26    koala     26   26 pilchard koala
4  newt bat 81      bat     81         bat 81 newt


回答2:

If you want to keep the order of the appearance of the words, you can first split words in each column, merge them, remove duplicates and finally concat them together to a new column.

df['detail'] = df.astype(str).T.apply(lambda x: x.str.split())
                 .apply(lambda x: ' '.join(pd.Series(sum(x,[])).drop_duplicates()))

df
Out[46]: 
         animal1   animal2   label                 detail
0      1 cat dog   dolphin       19  1 cat dog dolphin 19
1      2 dog cat       cat       72          2 dog cat 72
2  3 pilchard 26     koala       26   3 pilchard 26 koala
3  4 newt bat 81       bat       81         4 newt bat 81


回答3:

I'd suggest to remove the duplicates at the end of the process by using python set.

here is an example function to do so:

def dedup(value):
    words = set(value.split(' '))
    return ' '.join(words)

That works like this:

val = 'dog cat cat 81'
print dedup(val)

81 dog cat

in case you want the details ordered you can use oredereddict from collections or pd.unique instead of set.

then just apply it (similar to map) on your details columns for the desired result:

animals.detail = animals.detail.apply(dedup)