Subsetting a data frame to the rows not appearing

2019-02-27 20:30发布

问题:

I have a data frame A with observations

    Var1   Var2  Var3
     1       3    4
     2       5    6
     4       5    7
     4       5    8
     6       7    9

and data frame B with observations

    Var1   Var2  Var3
     1       3    4
     2       5    6

which is basically a subset of A. Now I want to select observations in A NOT in B, i.e, the data frame C with observations

    Var1   Var2  Var3
     4       5    7
     4       5    8
     6       7    9

Is there a way I can do this in R? The data frames I've used are just arbitrary data.

回答1:

dplyr has a nice anti_join function that does exactly that:

> library(dplyr)
> anti_join(A, B)
Joining by: c("Var1", "Var2", "Var3")
  Var1 Var2 Var3
1    6    7    9
2    4    5    8
3    4    5    7


回答2:

Using sqldf is an option.

require(sqldf)

C <- sqldf('SELECT * FROM A EXCEPT SELECT * FROM B')


回答3:

One approach could be to paste all the columns of A and B together, limiting to the rows in A whose pasted representation doesn't appear in the pasted representation of B:

A[!(do.call(paste, A) %in% do.call(paste, B)),]
#   Var1 Var2 Var3
# 3    4    5    7
# 4    4    5    8
# 5    6    7    9

One obvious downside of this approach is that it assumes two rows with the same pasted representation are in fact identical. Here is a slightly more clunky approach that doesn't have this limitation:

combined <- rbind(B, A)
combined[!duplicated(combined) & seq_len(nrow(combined)) > length(B),]
#   Var1 Var2 Var3
# 5    4    5    7
# 6    4    5    8
# 7    6    7    9

Basically I used rbind to append A below B and then limited to rows that are both non-duplicated and that are not originally from B.



回答4:

Another option:

C <- rbind(A, B)
C[!(duplicated(C) | duplicated(C, fromLast = TRUE)), ]

Output:

  Var1 Var2 Var3
3    4    5    7
4    4    5    8
5    6    7    9


回答5:

Using data.table you could do an anti-join as follows:

library(data.table)
setDT(df1)[!df2, on = names(df1)]

which gives the desired result:

   Var1 Var2 Var3
1:    4    5    7
2:    4    5    8
3:    6    7    9