I have a large 10,000,000+ length array that contains rows. I need to individually shuffle those rows. For example:
[[1,2,3]
[1,2,3]
[1,2,3]
...
[1,2,3]]
to
[[3,1,2]
[2,1,3]
[1,3,2]
...
[1,2,3]]
I'm currently using
map(numpy.random.shuffle, array)
But it's a python (not NumPy) loop and it's taking 99% of my execution time. Sadly, the PyPy JIT doesn't implement numpypy.random
, so I'm out of luck. Is there any faster way? I'm willing to use any library (pandas
, scikit-learn
, scipy
, theano
, etc. as long as it uses a Numpy ndarray
or a derivative.)
If not, I suppose I'll resort to Cython or C++.
Here are some ideas:
If there are only a few columns, then the number of all possible permutation is much smaller than the number of rows in the array (in this case, when there are only 3 columns, there are only 6 possible permutations). A way to make it faster is to make all the permutations at once first and then rearrange each row by randomly picking one permutation from all possible permutations.
It still appears to be 10 times faster even with larger dimension:
If the permutations of the columns are enumerable, then you could do this:
yields (something like)
Here is a benchmark comparing it to
Edit: CT Zhu's method is faster than mine:
Here is a slight variation of Zhu's method which may be even a bit faster:
I believe I have an alternate, equivalent strategy, building upon the previous answers:
I'm not sure how this compares performance-wise, but I like it for its readability.
You can also try the apply function in pandas
And then extract the numpy array from the dataframe