Remove duplicate method for Python Pandas doesnt w

2020-03-30 05:08发布

Trying to remove duplicate based on unique values on column 'new', I have even tried two methods, but the output df.shape suggests before/after have the same df shape, meaning remove duplication fails.

import pandas
import numpy as np
import random

df = pandas.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))

df['new'] = [1, 1, 3, 4, 5, 1, 7, 8, 1, 10]
df['new2'] = [1, 1, 2, 4, 5, 3, 7, 8, 9, 5]

print df.shape

df.drop_duplicates('new', take_last=False)
df.groupby('new').max()

print df.shape

# output
(10, 6)
(10, 6)
[Finished in 1.0s]

1条回答
▲ chillily
2楼-- · 2020-03-30 05:39

You need to assign the result of drop_duplicates, by default inplace=False so it returns a copy of the modified df, as you don't pass param inplace=True your original df is unmodified:

In [106]:

df = df.drop_duplicates('new', take_last=False)
df.groupby('new').max()
Out[106]:
            A         B         C         D  new2
new                                              
1   -1.698741 -0.550839 -0.073692  0.618410     1
3    0.519596  1.686003  1.395585  1.298783     2
4    1.557550  1.249577  0.214546 -0.077569     4
5   -0.183454 -0.789351 -0.374092 -1.824240     5
7   -1.176468  0.546904  0.666383 -0.315945     7
8   -1.224640 -0.650131 -0.394125  0.765916     8
10  -1.045131  0.726485 -0.194906 -0.558927     5

if you passed inplace=True it would work:

In [108]:

df.drop_duplicates('new', take_last=False, inplace=True)
df.groupby('new').max()
Out[108]:
            A         B         C         D  new2
new                                              
1    0.334352 -0.355528  0.098418 -0.464126     1
3   -0.394350  0.662889 -1.012554 -0.004122     2
4   -0.288626  0.839906  1.335405  0.701339     4
5    0.973462 -0.818985  1.020348 -0.306149     5
7   -0.710495  0.580081  0.251572 -0.855066     7
8   -1.524862 -0.323492 -0.292751  1.395512     8
10  -1.164393  0.455825 -0.483537  1.357744     5
查看更多
登录 后发表回答