How to test whether a vector contains repetitive e

2020-07-09 08:09发布

问题:

how do you test whether a vector contains repetitive elements in R?

回答1:

I think I found the answer. Use duplicated() function:

a=c(3,5,7,2,7,9)
b=1:10
any(duplicated(a)) #True
any(duplicated(b)) #False


回答2:

Also try rle(x) to find the lengths of runs of identical values in x.



回答3:

If you're looking for consecutive repeats you can use diff.

a <- 1:10
b <- c(1:5, 5, 7, 8, 9, 10)
diff(a)
diff(b)

Or anywhere in the vector:

length(a) == length(unique(a))
length(b) == length(unique(b))


回答4:

check this:

> all(diff(c(1,2,3)))
[1] TRUE
Warning message:
In all(diff(c(1, 2, 3))) : coercing argument of type 'double' to logical
> all(diff(c(1,2,2,3)))
[1] FALSE
Warning message:
In all(diff(sort(c(1, 2, 4, 2, 3)))) : coercing argument of type 'double' to logical

You can add some casting to get rid of warnings.



回答5:

As mentioned in the comment section by Hadley:

anyDuplicated will be a bit faster for very long vectors - it can terminate when it finds the first duplicate.

Example

a=c(3,5,7,2,7,9)
b=1:10
anyDuplicated(b) != 0L # TRUE
anyDuplicated(b) != 0L # FALSE

Benchmark with 1 million observations:

set.seed(2011)
x <- sample(1e7, size = 1e6, replace = TRUE)
bench::mark(
  ZNN = any(duplicated(x)),
  RL = length(x) != length(unique(x)),
  BUA = !all(diff(sort(x))),
  AD = anyDuplicated(x) != 0L
)

# A tibble: 4 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory            time     gc                
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>            <list>   <list>            
1 ZNN         64.62ms  70.04ms      11.5    11.8MB     0        8     0      693ms <lgl [1]> <df[,3] [2 x 3]>  <bch:tm> <tibble [8 x 3]>  
2 RL          66.95ms  70.67ms      12.5    15.4MB     0        7     0      561ms <lgl [1]> <df[,3] [3 x 3]>  <bch:tm> <tibble [7 x 3]>  
3 BUA         84.66ms  87.79ms      10.6      42MB     3.54     3     1      283ms <lgl [1]> <df[,3] [11 x 3]> <bch:tm> <tibble [4 x 3]>  
4 AD           2.45ms   2.87ms     314.        8MB     5.98   105     2      335ms <lgl [1]> <df[,3] [1 x 3]>  <bch:tm> <tibble [107 x 3]>

Benchmark with 100 observations

set.seed(2011)
x <- sample(1e7, size = 100, replace = TRUE)

bench::mark(
  ZNN = any(duplicated(x)),
  RL = length(x) != length(unique(x)),
  BUA = !all(diff(sort(x))),
  AD = anyDuplicated(x) != 0L
)

# A tibble: 4 x 13
  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result    memory            time     gc                   
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>    <list>            <list>   <list>               
1 ZNN          7.14us   8.93us    60429.    1.48KB     6.04  9999     1    165.5ms <lgl [1]> <df[,3] [2 x 3]>  <bch:tm> <tibble [10,000 x 3]>
2 RL           8.03us   9.37us    83754.    1.92KB     0    10000     0    119.4ms <lgl [1]> <df[,3] [3 x 3]>  <bch:tm> <tibble [10,000 x 3]>
3 BUA         54.89us  61.58us     8317.    4.83KB     6.74  3701     3      445ms <lgl [1]> <df[,3] [11 x 3]> <bch:tm> <tibble [3,704 x 3]> 
4 AD            5.8us   6.69us   123838.    1.05KB     0    10000     0     80.8ms <lgl [1]> <df[,3] [1 x 3]>  <bch:tm> <tibble [10,000 x 3]>


标签: r vector