how remove rows in a dataframe that the order of v

I have a dataframe like this:

source   target   weight
     1       2         5
     2       1         5
     1       2         5
     1       2         7
     3       1         6
     1       1         6
     1       3         6

My goal is to remove the duplicate rows, but the order of source and target columns are not important. In fact, the order of two columns are not important and they should be removed. In this case, the expected result would be

source   target   weight
     1       2         5
     1       2         7
     3       1         6
     1       1         6

Is there any way to this without loops?

标签： python pandas dataframe

2条回答

狗以群分

2楼-- · 2020-07-13 08:35

Should be fairly easy.

data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])

You can drop the duplicates using drop_duplicates

df = df.drop_duplicates(keep=False)
print(df)

would result in:

      source  target  weight
1       2       1       5
3       3       1       6
4       1       1       6
5       1       3       6

because you want to handle the unordered source/target issue.

def pair(row):
    sorted_pair = sorted([row['source'],row['target']])
    row['source'] =  sorted_pair[0]
    row['target'] = sorted_pair[1]
    return row
df = df.apply(pair,axis=1)

and then you can use df.drop_duplicates()

   source  target  weight
0       1       2       5
3       1       2       7
4       1       3       6
5       1       1       6

0人赞添加讨论(0) 举报

萌系小妹纸

3楼-- · 2020-07-13 08:47

Use frozenset and duplicated

df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]

   source  target  weight
0       1       2       5
3       3       1       6
4       1       1       6

If you want to account for unordered source/target and weight

df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6

However, to be explicit with more readable code.

# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')

# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()

df[~mask]

   source  target  weight
0       1       2       5
3       1       2       7
4       3       1       6
5       1       1       6

0人赞添加讨论(0) 举报

how remove rows in a dataframe that the order of v

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间