Minus operation of data frames

I have 2 data frames df1 and df2.

df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6) )

> df1
  c1 c2
1  a  1
2  b  2
3  c  3
4  d  4

> df2
  c1 c2
1  c  3
2  d  4
3  e  5
4  f  6

I need to perform set operation of these 2 data frames. I used merge(df1,df2,all=TRUE) and merge(df1,df2,all=FALSE) method to get the union and intersection of these data frames and got the required output. What is the function to get the minus of these data frames,that is all the positions existing on one data frame but not the other? I need the following output.

 c1 c2
1  a  1
2  b  2

标签： r set dataframe

7条回答

做个烂人

2楼-- · 2019-01-23 14:20

One issue with https://stackoverflow.com/a/16144262/2055486 is it assumes neither data frame already has duplicated rows. The following function removes that limitation and also works with arbitrary user defined columns in x or y.

The implementation uses a similar idea to the implementation of duplicated.data.frame in concatenating the columns together with a separator. duplicated.data.frame uses "\r", which can cause collisions if the entries have embedded "\r" characters. This uses the ASCII record separator "\30" which will have a much lower chance of appearing in input data.

setdiff.data.frame <- function(x, y,
    by = intersect(names(x), names(y)),
    by.x = by, by.y = by) {
  stopifnot(
    is.data.frame(x),
    is.data.frame(y),
    length(by.x) == length(by.y))

  !do.call(paste, c(x[by.x], sep = "\30")) %in% do.call(paste, c(y[by.y], sep = "\30"))
}

# Example usage
# remove all 4 or 6 cylinder 4 gear cars or 8 cylinder 3 gear rows
to_remove <- data.frame(cyl = c(4, 6, 8), gear = c(4, 4, 3))
mtcars[setdiff.data.frame(mtcars, to_remove), ]
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

# with differing column names
to_remove2 <- data.frame(a = c(4, 6, 8), b = c(4, 4, 3))
mtcars[setdiff.data.frame(mtcars, to_remove2, by.x = c("cyl", "gear"), by.y = c("a", "b")), ]
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

0人赞添加讨论(0) 举报

霸刀☆藐视天下

3楼-- · 2019-01-23 14:21

I remember coming across this exact issue quite a few months back. Managed to sift through my Evernote one-liners.

Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment).

If you don't worry about rownames then you can do:

df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
#   c1 c2
# 1  a  1
# 2  b  2

Edit: A data.table solution:

dt1 <- data.table(df1, key="c1")
dt2 <- data.table(df2)
dt1[!dt2]

or better one-liner (from v1.9.6+):

setDT(df1)[!df2, on="c1"]

This returns all rows in df1 where df2$c1 doesn't have a match with df1$c1.

0人赞添加讨论(0) 举报

等我变得足够好

4楼-- · 2019-01-23 14:21

You can create identifier columnas then subset:

e.g.

df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4), indf1 = rep("Y",4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6),indf2 = rep("Y",4) )
merge(df1,df2)
#  c1 c2 indf1 indf2
#1  c  3     Y     Y
#2  d  4     Y     Y

bigdf <- merge(df1,df2,all=TRUE)
#  c1 c2 indf1 indf2
#1  a  1     Y  <NA>
#2  b  2     Y  <NA>
#3  c  3     Y     Y
#4  d  4     Y     Y
#5  e  5  <NA>     Y
#6  f  6  <NA>     Y

Then subset how you wish:

 bigdf[is.na(bigdf$indf1) ,]
#  c1 c2 indf1 indf2
#5  e  5  <NA>     Y
#6  f  6  <NA>     Y

 bigdf[is.na(bigdf$indf2) ,]  #<- output you requested those not in df2
#  c1 c2 indf1 indf2
#1  a  1     Y  <NA>
#2  b  2     Y  <NA>

0人赞添加讨论(0) 举报

傲

5楼-- · 2019-01-23 14:25

If you're not planning on using any of the actual data in d2, then you don't need merge at all:

df1[!(df1$c1 %in% df2$c1), ]

0人赞添加讨论(0) 举报

▲ chillily

6楼-- · 2019-01-23 14:28

You can check the values in both columns and subset like this (just adding another solution):

na.omit( df1[ sapply( 1:ncol(df1) , function(x) ! df1[,x] %in% df2[,x] ) , ] )
#  c1 c2
#1  a  1
#2  b  2

0人赞添加讨论(0) 举报

叛逆

7楼-- · 2019-01-23 14:36

I think the simplest solution is with dplyr (tidyverse).

require(tidyverse)
anti_join(df1, df2)

0人赞添加讨论(0) 举报

1 2 下一页

Minus operation of data frames

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间