library(dplyr)
I have the following data set
set.seed(123)
n <- 1e6
d <- data.frame(a = letters[sample(5, n, replace = TRUE)], b = letters[sample(5, n, replace = TRUE)], c = letters[sample(5, n, replace = TRUE)], d = letters[sample(5, n, replace = TRUE)])
And I would like to count the number of distinct letters in each row. To do this I use
sapply(as.data.frame(t(d)), function(x) n_distinct(x))
However because this approach is implementing a loop, it is slow. Do you have an suggestions on how to speed this up?
My laptop is a piece of junk so...
system.time(sapply(as.data.frame(t(d)), function(x) n_distinct(x)))
user system elapsed
185.78 0.86 208.08
Here are some options that are faster (on my machine) than the OP's method (included methods in the other posts)
You can try,
If the different values are not so many, you can try:
For the example you provided, this is much faster than other solutions and yields the same result.