R: How to efficiently find out whether data.frame

2019-07-01 20:01发布

问题:

In order to find out whether data frame df.a is a subset of data frame df.b I did the following:

df.a <- data.frame( x=1:5, y=6:10 )
df.b <- data.frame( x=1:7, y=6:12 )
inds.x <- as.integer( lapply( df.a$x, function(x) which(df.b$x == x) ))
inds.y <- as.integer( lapply( df.a$y, function(y) which(df.b$y == y) ))
identical( inds.x, inds.y )

The last line gave TRUE, hence df.a is contained in df.b.

Now I wonder whether there is a more elegant - and possibly more efficient - way to answer this question?

This task also is easily extended to find the intersection between two given data frames, possibly based on only a subset of columns.

Help will be much appreciated.

回答1:

I am going to hazard a guess at an answer.

I think semi_join from dplyr will do what you want, even taking into account duplicated rows.

First note the helpfile ?semi_join:

return all rows from x where there are matching values in y, keeping just columns from x.

A semi join differs from an inner join because an inner join will return one row of x for each matching row of y, where a semi join will never duplicate rows of x.

Ok, this suggests that the following should correctly fail:

df.a <- data.frame( x=c(1:5,1), y=c(6:10,6) )
df.b <- data.frame( x=1:7, y=6:12 )
identical(semi_join(df.b, df.a),  semi_join(df.a, df.a))

which gives FALSE, as expected since

> semi_join(df.b, df.a)
Joining by: c("x", "y")
  x  y
1 1  6
2 2  7
3 3  8
4 4  9
5 5 10

However, the following should pass:

df.c <- data.frame( x=c(1:7, 1), y= c(6:12, 6) )
identical(semi_join(df.c, df.a), semi_join(df.a, df.a))

and it does, giving TRUE.

The second semi_join(df.a, df.a) is required to get the canonical sorting on df.a.