Efficient Means of Identifying Number of Distinct

library(dplyr)

I have the following data set

set.seed(123)
n <- 1e6
d <- data.frame(a = letters[sample(5, n, replace = TRUE)], b = letters[sample(5, n, replace = TRUE)], c = letters[sample(5, n, replace = TRUE)],  d = letters[sample(5, n, replace = TRUE)])

And I would like to count the number of distinct letters in each row. To do this I use

sapply(as.data.frame(t(d)), function(x) n_distinct(x))

However because this approach is implementing a loop, it is slow. Do you have an suggestions on how to speed this up?

My laptop is a piece of junk so...

system.time(sapply(as.data.frame(t(d)), function(x) n_distinct(x)))
  user  system elapsed 
185.78    0.86  208.08

标签： r performance loops vectorization

3条回答

干净又极端

2楼-- · 2019-05-23 16:58

Here are some options that are faster (on my machine) than the OP's method (included methods in the other posts)

system.time({ #@nicola's function
 d<-as.matrix(d)
 uniqueValues<-unique(as.vector(d))
 Reduce("+",lapply(uniqueValues,function(x) rowSums(d==x)>0))
})
#   user  system elapsed 
#  0.61    0.00    0.61 

system.time(colSums(apply(d, 1, function(i) !duplicated(i)))) #@Sotos function
#   user  system elapsed 
#  8.16    0.00    8.18 


system.time(apply(d, 1, function(x) sum(!duplicated(x))))
#  user  system elapsed 
#  8.19    0.01    8.25 



system.time(apply(d, 1, uniqueN)) #uniqueN from `data.table`
#   user  system elapsed 
#  15.59    0.03   15.74 


system.time(apply(d, 1, n_distinct)) #n_distinct from `dplyr`
#  user  system elapsed 
# 31.50    0.04   53.82 

system.time(sapply(as.data.frame(t(d)), function(x) n_distinct(x)))
#   user  system elapsed 
# 70.12    0.36   72.03

0人赞添加讨论(0) 举报

在下西门庆

3楼-- · 2019-05-23 17:00

You can try,

system.time(colSums(apply(d, 1, function(i) !duplicated(i))))
#user  system elapsed 
#6.50    0.02    6.53

0人赞添加讨论(0) 举报

Juvenile、少年°

4楼-- · 2019-05-23 17:17

If the different values are not so many, you can try:

d<-as.matrix(d)
uniqueValues<-unique(as.vector(d))
Reduce("+",lapply(uniqueValues,function(x) rowSums(d==x)>0))

For the example you provided, this is much faster than other solutions and yields the same result.

0人赞添加讨论(0) 举报

Efficient Means of Identifying Number of Distinct

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间