I tried to groupby
a df
and then select groups who do not have the same value on a specific column and whose group size > 1,
df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
the df
looks like, note that some account_no
strings only have a single space, ext_id
and int_id
are also strings, amount
is float
;
account_no ext_id amount int_id
2665057 439.504062 D000192
2665057 439.504062 D000192
353724 2758.92 952
353724 2758.92 952
the code supposed to return an empty df
, since none of the rows in the sample satisfy the conditions here, but the rows with int_id = 292
remained, so how to fix the issue here.
ps. numpy 1.14.3
, pandas 0.22.0
, python 3.5.2
In my opinion there is problem some traling whitespace or similar.
You can check it:
df = pd.DataFrame({'account_no': ['a', 'a', 'a', 'a'],
'ext_id': [2665057, 2665057, 353724, 353724],
'amount': [439.50406200000003, 439.50406200000003, 2758.92, 2758.92],
'int_id': ['D000192', 'D000192', ' 952', '952']})
print (df)
account_no amount ext_id int_id
0 a 439.504062 2665057 D000192
1 a 439.504062 2665057 D000192
2 a 2758.920000 353724 952
3 a 2758.920000 353724 952
df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
account_no amount ext_id int_id
2 a 2758.92 353724 952
3 a 2758.92 353724 952
print (df1['int_id'].tolist())
[' 952', '952']
And then remove it by str.strip
:
df['int_id'] = df['int_id'].str.strip()
df1 = df.groupby(['account_no', 'ext_id', 'amount']).filter(lambda x: (len(x) > 1) & (np.unique(x.int_id).size != 1))
print (df1)
Empty DataFrame
Columns: [account_no, amount, ext_id, int_id]
Index: []