numpy unique could not filter out groups with the

2019-07-11 07:10发布

问题:

I tried to groupby a df and then select groups who do not have the same value on a specific column and whose group size > 1,

df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))

the df looks like, note that some account_no strings only have a single space, ext_id and int_id are also strings, amount is float;

account_no    ext_id    amount        int_id
              2665057   439.504062     D000192
              2665057   439.504062     D000192
              353724    2758.92        952
              353724    2758.92        952

the code supposed to return an empty df, since none of the rows in the sample satisfy the conditions here, but the rows with int_id = 292 remained, so how to fix the issue here.

ps. numpy 1.14.3, pandas 0.22.0, python 3.5.2

回答1:

In my opinion there is problem some traling whitespace or similar.

You can check it:

df = pd.DataFrame({'account_no': ['a', 'a', 'a', 'a'], 
                   'ext_id': [2665057, 2665057, 353724, 353724], 
                   'amount': [439.50406200000003, 439.50406200000003, 2758.92, 2758.92], 
                   'int_id': ['D000192', 'D000192', ' 952', '952']})
print (df)
  account_no       amount   ext_id   int_id
0          a   439.504062  2665057  D000192
1          a   439.504062  2665057  D000192
2          a  2758.920000   353724      952
3          a  2758.920000   353724      952

df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
  account_no   amount  ext_id int_id
2          a  2758.92  353724    952
3          a  2758.92  353724    952

print (df1['int_id'].tolist())
[' 952', '952']

And then remove it by str.strip:

df['int_id'] = df['int_id'].str.strip()
df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
Empty DataFrame
Columns: [account_no, amount, ext_id, int_id]
Index: []