Permuting elements of a vector 10,000 times - effi

2019-05-31 15:14发布

问题:

This question is quite straightforward. However, the solutions that I have found to it are extremely memory and time inefficient. I am wondering if this can be done in R without grinding one's machine into dust.

Take a vector:

x<-c("A", "B", "B", "E", "C", "C", "D", "E", "A', "C")

This one has 10 elements. There are five unique elements. Therefore, importantly, some elements are repeated and any permutation should contain the same total number of each type of element. I wish to permute this sequence/vector 10,000 times with each one being a randomly generated and unique one. With my real data, I could be doing these permutations for up to 1000 elements. This can be very hard to do efficiently.

To get one permutation, you can just do:

sample(x)

or, from the gtools package:

permute(x)

I could write some code to do that 10,000 times, but am likely to have duplicates. Is there way of doing this and dropping duplicates until 10,000 is reached?

Other similar questions on stackoverflow and statsoverflow have asked question about generating all the unique permutations of a sequence. These questions are here:

Shuffling a vector - all possible outcomes of sample()?

Generating all distinct permutations of a list in R

https://stats.stackexchange.com/questions/24300/how-to-resample-in-r-without-repeating-permutations

These are good and the suggestions for generating all the unique permutations are great and it would certainly be quite easy to run them and sample 10,000 random samples from each to get our 10,000. However, if you go beyond about 10 elements in a vector then it gets very memory intensive.

Any comments about how to do this efficiently for up to 1000 elements appreciated. This has me getting very dizzy.

回答1:

I don't think that the computations should be as expensive as you are making them to be. For small "x" vectors, you might want to overshoot a little bit (here, I've sort of overdone it), then check for duplicates using duplicated. If the difference between the number required and the number of duplicated rows is too much for you to get your desired 10,000, repeat the process to fill the difference, using rbind to add the ones you want to keep to the matrix you get from replicate. This could be implemented in a while loop.

x <- c("A", "B", "B", "E", "C", "C", "D", "E", "A", "C")
set.seed(1)
N <- t(replicate(15000, sample(x)))
sum(duplicated(N))
# [1] 1389
out <- N[!(duplicated(N)), ][1:10000, ]
head(out)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] "B"  "E"  "C"  "D"  "B"  "E"  "A"  "C"  "C"  "A"  
# [2,] "B"  "B"  "C"  "C"  "C"  "D"  "E"  "E"  "A"  "A"  
# [3,] "C"  "B"  "C"  "A"  "A"  "E"  "D"  "C"  "B"  "E"  
# [4,] "C"  "C"  "E"  "B"  "C"  "E"  "A"  "A"  "D"  "B"  
# [5,] "A"  "C"  "D"  "E"  "E"  "C"  "A"  "B"  "B"  "C"  
# [6,] "C"  "E"  "E"  "B"  "A"  "C"  "D"  "A"  "B"  "C"

The duplicated step is actually the most expensive, as far as I can see:

y <- sample(500, 1000, TRUE)
system.time(N <- t(replicate(12000, sample(y))))
# user  system elapsed 
# 2.35    0.08    2.43 
system.time(D <- sum(duplicated(N)))
#  user  system elapsed 
# 14.82    0.01   14.84 
D
# [1] 0

^^ There, we have no duplicates in our 12,000 samples.



回答2:

In case you are only interested in the first 10000 permutations (in dictionary order), you can make use of the iterpc library.

library(iterpc)
x <- c("A", "B", "B", "E", "C", "C", "D", "E", "A", "C")
I <- iterpc(table(x), ordered=TRUE)
# first 10000 permutations
result <- getnext(I, d=10000)

And it is very fast to get them.

> system.time(getnext(I, d=10000))
   user  system elapsed 
  0.004   0.000   0.005 


回答3:

Here's an idea. This is not necessarily an answer but it's too big for a comment.

Get the permutations in an orderly way, and add them to a collection. For example, if elements are A, B, C, and D:

A B C D
A B D C
A D B C
... so on

And once you have got required number of permutations (10000 in your case), permute that collection once.

If the cost of randomization is the bottleneck, this approach should solve it.