Search and find values row-wise a dataframe

My dataframe looks like this:

x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)

df

x1 x2   x3   x4
1  a  b    b    a
2  c  c    d    e
3  f  g    h    i
4  j  k <NA> <NA>

Now I have an arbitrary vector:

vec <- c("a", "i", "s", "t", "z")

I would like to compare the vector values with each row in the dataframe and create an additional column that indicates whether at least one of the vector values was found or not.

The resulting dataframe should look like this:

  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

I would like to do it without looping. Thank you very much for your support!

Rami

标签： r dataframe

4条回答

Summer. ? 凉城

2楼-- · 2020-03-04 03:58

This would be faster than an apply based solution (despite it's cryptic construction):

as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
[1] 1 0 1 0

Update -- Some benchmarks

Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.

set.seed(1)
nrow <- 100000
ncol <- 10
vec <- c("a", "i", "s", "t", "z")
df <- data.frame(matrix(sample(c(letters, NA), nrow * ncol, TRUE),
                        nrow = nrow, ncol = ncol), stringsAsFactors = FALSE)

Here are the approaches we have so far:

AM <- function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
NR1 <- function() {
  apply(df,1,function(x){
    if(any(x %in% vec)){ 
      1 
    } else {
      0
    }
  })
}
NR2 <- function() apply(df, 1, function(x) any(x %in% vec) + 0)
NR3 <- function() apply(df, 1, function(x) as.numeric(any(x %in% vec)))
NR4 <- function() apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
NR5 <- function() apply(df, 1, function(x) cumprod(any(x %in% vec)))
RS1 <- function() as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
RS2 <- function() as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])

I'm suspecting the NR functions will be a little slower:

system.time(NR1()) # Other NR functions are about the same
#    user  system elapsed 
#   1.172   0.000   1.196

And, similarly, Richard's second approach:

system.time(RS2())
#    user  system elapsed 
#   0.918   0.000   0.932

The grepl and this rowSum function are left for the benchmarks:

library(microbenchmark)
microbenchmark(AM(), RS1())
# Unit: milliseconds
#   expr       min       lq      mean    median       uq      max neval
#   AM()  65.75296  67.2527  92.03043  84.58111 102.3199 234.6114   100
#  RS1() 253.57360 256.6148 266.89640 260.18038 264.1531 385.6525   100

0人赞添加讨论(0) 举报

够拽才男人

3楼-- · 2020-03-04 04:04

As another idea, trying to preserve and operate on the "list" structure of a "data.frame" and not converting it to atomic (i.e. sapply, as.matrix, do.call(_bind, ...) etc.) could be efficient. In this case we could use something like:

as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
#[1] 1 0 1 0

And to compare with -the fastest so far- Ananda Mahto's apporach (using the larger "df"):

AL = function() as.numeric(Reduce("|", lapply(df, function(x) x %in% vec)))
AM = function() as.numeric(rowSums(`dim<-`(as.matrix(df) %in% vec, dim(df))) >= 1)
identical(AM(), AL())
#[1] TRUE
microbenchmark::microbenchmark(AM(), AL(), times = 50)
#Unit: milliseconds
# expr      min       lq   median       uq      max neval
# AM() 49.20072 53.53789 58.03740 66.76898 86.04280    50
# AL() 45.24706 49.34271 51.43577 55.05866 74.79533    50

There does not appear any significant efficiency gain, but, I guess, it's worth noting that the 2 loops (in Reduce and lapply) didn't prove to be as slow as -probably- would be expected.

0人赞添加讨论(0) 举报

smile是对你的礼貌

4楼-- · 2020-03-04 04:05

Since you don't want a loop, you could get creative and paste the columns together by row, and then use grepl to compare it with vec

> as.numeric(grepl(paste(vec, collapse="|"), do.call(paste, df)))
[1] 1 0 1 0

Here's a second option that compares the rows to the unlisted data frame

> as.numeric(seq(nrow(df)) %in% row(df)[unlist(df) %in% vec])
[1] 1 0 1 0

0人赞添加讨论(0) 举报

劳资没心，怎么记你

5楼-- · 2020-03-04 04:17

Here's one way to do this:

df$valueFound <- apply(df,1,function(x){
  if(any(x %in% vec)){ 
    1 
  } else {
    0
  }
})
##
> df
  x1 x2   x3   x4 valueFound
1  a  b    b    a          1
2  c  c    d    e          0
3  f  g    h    i          1
4  j  k <NA> <NA>          0

Thanks to @David Arenburg and @CathG, a couple of more concise approaches:

apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))

Just for fun, a couple of other interesting variants:

apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))

0人赞添加讨论(0) 举报

Search and find values row-wise a dataframe

Update -- Some benchmarks

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间