I have a data frame A with observations
Var1 Var2 Var3
1 3 4
2 5 6
4 5 7
4 5 8
6 7 9
and data frame B with observations
Var1 Var2 Var3
1 3 4
2 5 6
which is basically a subset of A.
Now I want to select observations in A NOT in B, i.e, the data frame C with observations
Var1 Var2 Var3
4 5 7
4 5 8
6 7 9
Is there a way I can do this in R? The data frames I've used are just arbitrary data.
dplyr
has a nice anti_join
function that does exactly that:
> library(dplyr)
> anti_join(A, B)
Joining by: c("Var1", "Var2", "Var3")
Var1 Var2 Var3
1 6 7 9
2 4 5 8
3 4 5 7
Using sqldf is an option.
require(sqldf)
C <- sqldf('SELECT * FROM A EXCEPT SELECT * FROM B')
One approach could be to paste all the columns of A
and B
together, limiting to the rows in A
whose pasted representation doesn't appear in the pasted representation of B
:
A[!(do.call(paste, A) %in% do.call(paste, B)),]
# Var1 Var2 Var3
# 3 4 5 7
# 4 4 5 8
# 5 6 7 9
One obvious downside of this approach is that it assumes two rows with the same pasted representation are in fact identical. Here is a slightly more clunky approach that doesn't have this limitation:
combined <- rbind(B, A)
combined[!duplicated(combined) & seq_len(nrow(combined)) > length(B),]
# Var1 Var2 Var3
# 5 4 5 7
# 6 4 5 8
# 7 6 7 9
Basically I used rbind
to append A
below B
and then limited to rows that are both non-duplicated and that are not originally from B
.
Another option:
C <- rbind(A, B)
C[!(duplicated(C) | duplicated(C, fromLast = TRUE)), ]
Output:
Var1 Var2 Var3
3 4 5 7
4 4 5 8
5 6 7 9
Using data.table you could do an anti-join as follows:
library(data.table)
setDT(df1)[!df2, on = names(df1)]
which gives the desired result:
Var1 Var2 Var3
1: 4 5 7
2: 4 5 8
3: 6 7 9