how remove rows in a dataframe that the order of v

2020-07-13 08:29发布

I have a dataframe like this:

source   target   weight
     1       2         5
     2       1         5
     1       2         5
     1       2         7
     3       1         6
     1       1         6
     1       3         6

My goal is to remove the duplicate rows, but the order of source and target columns are not important. In fact, the order of two columns are not important and they should be removed. In this case, the expected result would be

source   target   weight
     1       2         5
     1       2         7
     3       1         6
     1       1         6

Is there any way to this without loops?

2条回答
狗以群分
2楼-- · 2020-07-13 08:35

Should be fairly easy.

data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])

You can drop the duplicates using drop_duplicates

df = df.drop_duplicates(keep=False)
print(df)

would result in:

      source  target  weight
1       2       1       5
3       3       1       6
4       1       1       6
5       1       3       6

because you want to handle the unordered source/target issue.

def pair(row):
    sorted_pair = sorted([row['source'],row['target']])
    row['source'] =  sorted_pair[0]
    row['target'] = sorted_pair[1]
    return row
df = df.apply(pair,axis=1)

and then you can use df.drop_duplicates()

   source  target  weight
0       1       2       5
3       1       2       7
4       1       3       6
5       1       1       6
查看更多
萌系小妹纸
3楼-- · 2020-07-13 08:47

Use frozenset and duplicated

df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]

   source  target  weight
0       1       2       5
3       3       1       6
4       1       1       6

If you want to account for unordered source/target and weight

df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6

However, to be explicit with more readable code.

# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')

# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()

df[~mask]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6
查看更多
登录 后发表回答