My dataframe looks like this:
x1 <- c("a", "c", "f", "j")
x2 <- c("b", "c", "g", "k")
x3 <- c("b", "d", "h", NA)
x4 <- c("a", "e", "i", NA)
df <- data.frame(x1, x2, x3, x4, stringsAsFactors=F)
df
x1 x2 x3 x4
1 a b b a
2 c c d e
3 f g h i
4 j k <NA> <NA>
Now I have an arbitrary vector:
vec <- c("a", "i", "s", "t", "z")
I would like to compare the vector values with each row in the dataframe and create an additional column that indicates whether at least one of the vector values was found or not.
The resulting dataframe should look like this:
x1 x2 x3 x4 valueFound
1 a b b a 1
2 c c d e 0
3 f g h i 1
4 j k <NA> <NA> 0
I would like to do it without looping. Thank you very much for your support!
Rami
This would be faster than an
apply
based solution (despite it's cryptic construction):Update -- Some benchmarks
Here, we can make up some bigger data to test on.... These benchmarks are on 100k rows.
Here are the approaches we have so far:
I'm suspecting the NR functions will be a little slower:
And, similarly, Richard's second approach:
The
grepl
and thisrowSum
function are left for the benchmarks:As another idea, trying to preserve and operate on the "list" structure of a "data.frame" and not converting it to atomic (i.e.
sapply
,as.matrix
,do.call(_bind, ...)
etc.) could be efficient. In this case we could use something like:And to compare with -the fastest so far- Ananda Mahto's apporach (using the larger "df"):
There does not appear any significant efficiency gain, but, I guess, it's worth noting that the 2 loops (in
Reduce
andlapply
) didn't prove to be as slow as -probably- would be expected.Since you don't want a loop, you could get creative and paste the columns together by row, and then use
grepl
to compare it withvec
Here's a second option that compares the rows to the unlisted data frame
Here's one way to do this:
Thanks to @David Arenburg and @CathG, a couple of more concise approaches:
apply(df, 1, function(x) any(x %in% vec) + 0)
apply(df, 1, function(x) as.numeric(any(x %in% vec)))
Just for fun, a couple of other interesting variants:
apply(df, 1, function(x) any(x %in% vec) %/% TRUE)
apply(df, 1, function(x) cumprod(any(x %in% vec)))