Search and find values row-wise a dataframe

2020-03-04 03:40发布

My dataframe looks like this:

x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)

df

x1 x2   x3   x4
1  a  b    b    a
2  c  c    d    e
3  f  g    h    i
4  j  k <NA> <NA>

Now I have an arbitrary vector:

vec <- c("a", "i", "s", "t", "z")

I would like to compare the vector values with each row in the dataframe and create an additional column that indicates whether at least one of the vector values was found or not.

The resulting dataframe should look like this:

  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

I would like to do it without looping. Thank you very much for your support!

Rami

标签: r dataframe
4条回答
Summer. ? 凉城
2楼-- · 2020-03-04 03:58

This would be faster than an apply based solution (despite it's cryptic construction):

as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0

Update -- Some benchmarks

Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.

set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
                        nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)

Here are the approaches we have so far:

AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
  apply(df,1,function(x){
    if(any(x %in% vec)){ 
      1 
    } else {
      0
    }
  })
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])

I'm suspecting the NR functions will be a little slower:

system.time(NR1()) # Other NR functions are about the same
#    user  system elapsed 
#   1.172   0.000   1.196 

And, similarly, Richard's second approach:

system.time(RS2())
#    user  system elapsed 
#   0.918   0.000   0.932 

The grepl and this rowSum function are left for the benchmarks:

library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
#   expr       min       lq      mean    median       uq      max neval
#   AM()  65.75296  67.2527  92.03043  84.58111 102.3199 234.6114   100
#  RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525   100
查看更多
够拽才男人
3楼-- · 2020-03-04 04:04

As another idea, trying to preserve and operate on the "list" structure of a "data.frame" and not converting it to atomic (i.e. sapply, as.matrix, do.call(_bind, ...) etc.) could be efficient. In this case we could use something like:

as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
#[1] 1 0 1 0

And to compare with -the fastest so far- Ananda Mahto's apporach (using the larger "df"):

AL = function() as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
AM = function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
identical(AM(), AL())
#[1] TRUE
microbenchmark::microbenchmark(AM(), AL(), times = 50)
#Unit: milliseconds
# expr      min       lq   median       uq      max neval
# AM() 49.20072 53.53789 58.03740 66.76898 86.04280    50
# AL() 45.24706 49.34271 51.43577 55.05866 74.79533    50

There does not appear any significant efficiency gain, but, I guess, it's worth noting that the 2 loops (in Reduce and lapply) didn't prove to be as slow as -probably- would be expected.

查看更多
smile是对你的礼貌
4楼-- · 2020-03-04 04:05

Since you don't want a loop, you could get creative and paste the columns together by row, and then use grepl to compare it with vec

> as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
[1] 1 0 1 0 

Here's a second option that compares the rows to the unlisted data frame

> as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
[1] 1 0 1 0
查看更多
劳资没心,怎么记你
5楼-- · 2020-03-04 04:17

Here's one way to do this:

df$valueFound <- apply(df,1,function(x){
  if(any(x %in% vec)){ 
    1 
  } else {
    0
  }
})
##
> df
  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

Thanks to @David Arenburg and @CathG, a couple of more concise approaches:

  • apply(df, 1, function(x) any(x %in% vec) + 0)
  • apply(df, 1, function(x) as.numeric(any(x %in% vec)))

Just for fun, a couple of other interesting variants:

  • apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
  • apply(df, 1, function(x) cumprod(any(x %in% vec)))
查看更多
登录 后发表回答