可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This question already has an answer here:
-
Select equivalent rows [A-B & B-A] [duplicate]
1 answer
This seems like a simple problem but I can\'t seem to figure it out. I\'d like to remove duplicates from a dataframe (df) if two columns have the same values, even if those values are in the reverse order. What I mean is, say you have the following data frame:
a <- c(rep(\"A\", 3), rep(\"B\", 3), rep(\"C\",2))
b <- c(\'A\',\'B\',\'B\',\'C\',\'A\',\'A\',\'B\',\'B\')
df <-data.frame(a,b)
a b
1 A A
2 A B
3 A B
4 B C
5 B A
6 B A
7 C B
8 C B
If I now remove duplicates, I get the following data frame:
df[duplicated(df),]
a b
3 A B
6 B A
8 C B
However, I would also like to remove the row 6 in this data frame, since \"A\", \"B\" is the same as \"B\", \"A\". How can I do this automatically?
Ideally I could specify which two columns to compare since the data frames could have varying columns and can be quite large.
Thanks!
回答1:
One solution is to first sort each row of df
:
for (i in 1:nrow(df))
{
df[i, ] = sort(df[i, ])
}
df
a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C
At that point it\'s just a matter of removing the duplicated elements:
df = df[!duplicated(df),]
df
a b
1 A A
2 A B
4 B C
As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated
to remove them.
回答2:
Extending Ari\'s answer, to specify columns to check if other columns are also there:
a <- c(rep(\"A\", 3), rep(\"B\", 3), rep(\"C\",2))
b <- c(\'A\',\'B\',\'B\',\'C\',\'A\',\'A\',\'B\',\'B\')
df <-data.frame(a,b)
df$c = sample(1:10,8)
df$d = sample(LETTERS,8)
df
a b c d
1 A A 10 B
2 A B 8 S
3 A B 7 J
4 B C 3 Q
5 B A 2 I
6 B A 6 U
7 C B 4 L
8 C B 5 V
cols = c(1,2)
newdf = df[,cols]
for (i in 1:nrow(df)){
newdf[i, ] = sort(df[i,cols])
}
df[!duplicated(newdf),]
a b c d
1 A A 8 X
2 A B 7 L
4 B C 2 P
回答3:
The other answers use a for
loop to assign a value for each and every row. While this is not an issue if you have 100 rows, or even a thousand, you\'re going to be waiting a while if you have large data of the order of 1M rows.
Stealing from the other linked answer using data.table
, you could try something like:
df[!duplicated(data.frame(list(do.call(pmin,df),do.call(pmax,df)))),]
A comparison benchmark with a larger dataset (df2
):
df2 <- df[sample(1:nrow(df),50000,replace=TRUE),]
system.time(
df2[!duplicated(data.frame(list(do.call(pmin,df2),do.call(pmax,df2)))),]
)
# user system elapsed
# 0.07 0.00 0.06
system.time({
for (i in 1:nrow(df2))
{
df2[i, ] = sort(df2[i, ])
}
df2[!duplicated(df2),]
}
)
# user system elapsed
# 42.07 0.02 42.09
回答4:
Using apply
will be a better option than loops.
newDf <- data.frame(t(apply(df,1,sort)))
All you need to do now is remove duplicates.
newDf <- newDf[!duplicated(newDf),]