I have a vector:
seq1<-c('a','b','c','b','a','b','c','b','a','b','c')
I wish to permute the elements of this vector to create multiple (ideally up to 5000) vectors with the condition that the permuted vectors cannot have repeated elements within the vector in consecutive elements. e.g. "abbca...." is not allowed as 'b-b' is a repeat.
I realize that for this small example there probably are not 5000 solutions. I am typically dealing with much larger vectors. I am also willing to consider sampling with replacement, though currently I'm working on solutions without replacement.
I am looking for better solutions than my current thinking.
Option 1. - brute force.
Here, I just repeatedly sample and check if any successive elements are duplicates.
set.seed(18)
seq1b <- sample(seq1a)
seq1b
#[1] "b" "b" "a" "a" "c" "b" "b" "c" "a" "c" "b"
sum(seq1b[-length(seq1b)]==seq1b[-1]) #3
This is not a solution as there are 3 duplicated consecutive elements. I also realize that lag
is probably a better way to check for duplicated elements but for some reason it is being finicky (I think it is being masked by another package I have loaded).
set.seed(1000)
res<-NULL
for (i in 1:10000){res[[i]]<-sample(seq1a)}
res1 <- lapply(res, function(x) sum(x[-length(x)]==x[-1]))
sum(unlist(res1)==0) #228
This produces 228 options out of 10000 iterations. But let's see how many unique ones:
res2 <- res[which(unlist(res1)==0)]
unique(unlist(lapply(res2, paste0, collapse=""))) #134
Out of 10000 attempts we only get 134 unique ones from this short example vector.
Here are 3 of the 134 example sequences produced:
# "bcbabcbabca" "cbabababcbc" "bcbcababacb"
In fact, if I try over 500,000 samples, I can only get 212 unique sequences that match my non-repeating criteria. This is probably close to the upper limit of possible ones.
Option 2. - iteratively
A second idea I had is to be more iterative about the approach.
seq1a
table(seq1a)
#a b c
#3 5 3
We could sample one of these letters as our starting point. Then sample another from the remaining ones, check if it is the same as the previously chosen one and if not, add it to the end. And so on and so forth...
set.seed(10)
newseq <- sample(seq1a,1) #b
newseq #[1] "b"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)]
table(remaining)
#a b c
#3 4 3
set.seed(10)
newone <- sample(remaining,1) #c
#check if newone is same as previous one.
newone==newseq[length(newseq)] #FALSE
newseq <- c(newseq, newone) #update newseq
newseq #[1] "b" "c"
remaining <-seq1a[!seq1a %in% newseq | duplicated(seq1a)] #update remaining
remaining
table(remaining)
#a b c
#3 4 2
This might work, but I can also see it running into lots of issues - e.g. we could go:
# "a" "c" "a" "c" "a" "b" ...
and then be left with 3 more 'b's that cannot go at the end as they'd be duplicates.
Of course, this would be a lot easier if I allowed sampling with replacement, but for now I'm trying to do this without replacement.
The speed of your actual job will depend on a lot of factors (e.g. how many possible passing combinations exist), but I think you can accomplish this relatively quickly by using 2 loops (similarly to how you outlined, but possibly quicker):
In the following example, you set two values to control the searching process:
nsuccess
- Desired number of many unique permutations;nmax
- Maximum number of permutations (sets upper limit on computation time)Example
Here another solution. Please see the comments in the code for an explanation of the algorithm. In a way, it's similar to your second (iterative) approach, but it includes
while
loop that ensures that the next element is validThe algorithm is also quite efficient with longer
seq1
vectors as given in one of your comments. But I guess it's performance will degrade if you have more unique elements inseq1
.Here the code: First a few definitions
Now generate the combinations
Now let's check the results
You can use the
iterpc
package to work with combinations and iterations. I hadn't heard of it until trying to answer this question so there might also be more effective ways to use the same package.Here I've used
iterpc
to set up an iterator, andgetall
to find all combinations of the vector based on that iterator. This seems to just report unique combinations, making it a bit nicer than finding all combinations withexpand.grid
.The
rle
function tells us about consecutive values equal to each other in a vector. Thelengths
component of the output tells us how many times each element ofvalues
is repeated:The length of
values
orlengths
will be equal to the length of the original vector only for combinations which have no consecutive repeats.You can therefore apply
rle
to each row, calculate the length ofvalues
orlengths
and keep rows fromall_seqs
where the calculated value is the same as the length ofseqs1
.all_seqs_good
has annrow
of 212, suggesting that you did indeed find all possible combinations for your example vector.Technically this is still brute forcing (except that it doesn't calculate every possible combination - only unique ones), but is fairly quick for your example. I'm not sure how well it will cope with larger vectors yet...
Edit: this does seem to fail for larger vectors. One solution would be to break larger vectors into smaller chunks, then process those chunks as above and combine them - keeping only the combinations which meet your criteria.
For example, breaking a vector of length 24 into two vectors of length 12, then combining the results can give you 200,000+ combinations which meet your critera and is pretty quick (around 1 minute for me):
You might need to re-order the starting vector for best results. For example, if
seq1
in the above example had started with "a" eight times in a row, there would be no passing solutions. For example, try the splitting up solution withseq1 <- c(rep("a", 8), rep("b", 8), rep("c", 8))
and you get no solutions back, even though there are really the same number of solutions for the random sequence.It doesn't look like you need to find every possible passing combination, but if you do then for larger vectors you'll probably need to iterate through
I
using thegetnext
function fromiterpc
, and check each one in a loop which would be very slow.