可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I have a large 10,000,000+ length array that contains rows. I need to individually shuffle those rows. For example:
[[1,2,3]
[1,2,3]
[1,2,3]
...
[1,2,3]]
to
[[3,1,2]
[2,1,3]
[1,3,2]
...
[1,2,3]]
I'm currently using
map(numpy.random.shuffle, array)
But it's a python (not NumPy) loop and it's taking 99% of my execution time. Sadly, the PyPy JIT doesn't implement numpypy.random
, so I'm out of luck. Is there any faster way? I'm willing to use any library (pandas
, scikit-learn
, scipy
, theano
, etc. as long as it uses a Numpy ndarray
or a derivative.)
If not, I suppose I'll resort to Cython or C++.
回答1:
Here are some ideas:
In [10]: a=np.zeros(shape=(1000,3))
In [12]: a[:,0]=1
In [13]: a[:,1]=2
In [14]: a[:,2]=3
In [17]: %timeit map(np.random.shuffle, a)
100 loops, best of 3: 4.65 ms per loop
In [21]: all_perm=np.array((list(itertools.permutations([0,1,2]))))
In [22]: b=all_perm[np.random.randint(0,6,size=1000)]
In [25]: %timeit (a.flatten()[(b+3*np.arange(1000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1000 loops, best of 3: 393 us per loop
If there are only a few columns, then the number of all possible permutation is much smaller than the number of rows in the array (in this case, when there are only 3 columns, there are only 6 possible permutations). A way to make it faster is to make all the permutations at once first and then rearrange each row by randomly picking one permutation from all possible permutations.
It still appears to be 10 times faster even with larger dimension:
#adjust a accordingly
In [32]: b=all_perm[np.random.randint(0,6,size=1000000)]
In [33]: %timeit (a.flatten()[(b+3*np.arange(1000000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1 loops, best of 3: 348 ms per loop
In [34]: %timeit map(np.random.shuffle, a)
1 loops, best of 3: 4.64 s per loop
回答2:
If the permutations of the columns are enumerable, then you could do this:
import itertools as IT
import numpy as np
def using_perms(array):
nrows, ncols = array.shape
perms = np.array(list(IT.permutations(range(ncols))))
choices = np.random.randint(len(perms), size=nrows)
i = np.arange(nrows).reshape(-1, 1)
return array[i, perms[choices]]
N = 10**7
array = np.tile(np.arange(1,4), (N,1))
print(using_perms(array))
yields (something like)
[[3 2 1]
[3 1 2]
[2 3 1]
[1 2 3]
[3 1 2]
...
[1 3 2]
[3 1 2]
[3 2 1]
[2 1 3]
[1 3 2]]
Here is a benchmark comparing it to
def using_shuffle(array):
map(numpy.random.shuffle, array)
return array
In [151]: %timeit using_shuffle(array)
1 loops, best of 3: 7.17 s per loop
In [152]: %timeit using_perms(array)
1 loops, best of 3: 2.78 s per loop
Edit: CT Zhu's method is faster than mine:
def using_Zhu(array):
nrows, ncols = array.shape
all_perm = np.array((list(itertools.permutations(range(ncols)))))
b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
return (array.flatten()[(b+3*np.arange(nrows)[...,np.newaxis]).flatten()]
).reshape(array.shape)
In [177]: %timeit using_Zhu(array)
1 loops, best of 3: 1.7 s per loop
Here is a slight variation of Zhu's method which may be even a bit faster:
def using_Zhu2(array):
nrows, ncols = array.shape
all_perm = np.array((list(itertools.permutations(range(ncols)))))
b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
return array.take((b+3*np.arange(nrows)[...,np.newaxis]).ravel()).reshape(array.shape)
In [201]: %timeit using_Zhu2(array)
1 loops, best of 3: 1.46 s per loop
回答3:
You can also try the apply function in pandas
import pandas as pd
df = pd.DataFrame(array)
df = df.apply(lambda x:np.random.shuffle(x) or x, axis=1)
And then extract the numpy array from the dataframe
print df.values
回答4:
I believe I have an alternate, equivalent strategy, building upon the previous answers:
# original sequence
a0 = np.arange(3) + 1
# length of original sequence
L = a0.shape[0]
# number of random samples/shuffles
N_samp = 1e4
# from above
all_perm = np.array( (list(itertools.permutations(np.arange(L)))) )
b = all_perm[np.random.randint(0, len(all_perm), size=N_samp)]
# index a with b for each row of b and collapse down to expected dimension
a_samp = a0[np.newaxis, b][0]
I'm not sure how this compares performance-wise, but I like it for its readability.