What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.
A sequential version of this would be the Fisher-Yates shuffle.
Example:
Let S={1, 2, ..., 7} be the set of source indices. The goal is to generate n random permutations in parallel. Each of the n permutations contains each of the source indices exactly once, e.g. {7, 6, ..., 1}.
For large sets, using a sort primitive on a vector of randomized keys might be efficient enough for your needs. First, setup some vectors:
Then, each time you want to shuffle the d_cards call the pair of:
The random keys are generated from a functor that uses a seed (evaluated in host-code and copied to the kernel at launch-time) and a key number (which tabulate passes in at thread-creation time):
I have found that performance could be improved (by probably 30%) if I could figure out how to cache the allocations that thrust::sort_by_key does internally.
Any corrections or suggestions welcome.
If the length of s = s_L, a very crude way of doing this could be implemented in thrust:
http://thrust.github.com.
First, create a vector val of length s_L x n that repeats s n times.
Create a vector val_keys associate n unique keys repeated s_L times with each element of val, e.g.,
Now the fun part. create a vector of length s_L x n uniformly distributed random variables
then you can do zip iterator over val,val_keys and sort them according to U:
http://codeyarns.com/2011/04/04/thrust-zip_iterator/
both val, val_keys will be all over the place, so you have to put them back together again using thrust::stable_sort_by_key() to make sure that if val[i] and val[j] both belong to key[k] and val[i] precedes val[j] following the random sort, then in the final version val[i] should still precede val[j]. If all goes according to plan, val_keys should look just as before, but val should reflect the shuffling.
hope this link will help you http://www.codeproject.com/Articles/380399/Permutations-with-CUDA-and-OpenCL
Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.
This could be easily implemented as CUDA
__device__
code (inspired by standard min/max reduction):Here the curand initialization code is omitted, and method
swap(int *p, int i, int j)
exchanges valuesp[i]
andp[j]
.Note that the code above has the following assumptions:
__shared__
memory of CUDA deviceTo generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.