Fast column shuffle of each row numpy

I have a large 10,000,000+ length array that contains rows. I need to individually shuffle those rows. For example:

[[1,2,3]
 [1,2,3]
 [1,2,3]
 ...
 [1,2,3]]

[[3,1,2]
 [2,1,3]
 [1,3,2]
 ...
 [1,2,3]]

I'm currently using

map(numpy.random.shuffle, array)

But it's a python (not NumPy) loop and it's taking 99% of my execution time. Sadly, the PyPy JIT doesn't implement numpypy.random, so I'm out of luck. Is there any faster way? I'm willing to use any library (pandas, scikit-learn, scipy, theano, etc. as long as it uses a Numpy ndarray or a derivative.)

If not, I suppose I'll resort to Cython or C++.

标签： python random numpy vectorization

4条回答

等我变得足够好

2楼-- · 2020-07-03 03:56

Here are some ideas:

In [10]: a=np.zeros(shape=(1000,3))

In [12]: a[:,0]=1

In [13]: a[:,1]=2

In [14]: a[:,2]=3

In [17]: %timeit map(np.random.shuffle, a)
100 loops, best of 3: 4.65 ms per loop

In [21]: all_perm=np.array((list(itertools.permutations([0,1,2]))))

In [22]: b=all_perm[np.random.randint(0,6,size=1000)]

In [25]: %timeit (a.flatten()[(b+3*np.arange(1000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1000 loops, best of 3: 393 us per loop

If there are only a few columns, then the number of all possible permutation is much smaller than the number of rows in the array (in this case, when there are only 3 columns, there are only 6 possible permutations). A way to make it faster is to make all the permutations at once first and then rearrange each row by randomly picking one permutation from all possible permutations.

It still appears to be 10 times faster even with larger dimension:

#adjust a accordingly
In [32]: b=all_perm[np.random.randint(0,6,size=1000000)]

In [33]: %timeit (a.flatten()[(b+3*np.arange(1000000)[...,np.newaxis]).flatten()]).reshape(a.shape)
1 loops, best of 3: 348 ms per loop

In [34]: %timeit map(np.random.shuffle, a)
1 loops, best of 3: 4.64 s per loop

0人赞添加讨论(0) 举报

迷人小祖宗

3楼-- · 2020-07-03 04:12

If the permutations of the columns are enumerable, then you could do this:

import itertools as IT
import numpy as np

def using_perms(array):
    nrows, ncols = array.shape
    perms = np.array(list(IT.permutations(range(ncols))))
    choices = np.random.randint(len(perms), size=nrows)
    i = np.arange(nrows).reshape(-1, 1)
    return array[i, perms[choices]]

N = 10**7
array = np.tile(np.arange(1,4), (N,1))
print(using_perms(array))

yields (something like)

[[3 2 1]
 [3 1 2]
 [2 3 1]
 [1 2 3]
 [3 1 2]
 ...
 [1 3 2]
 [3 1 2]
 [3 2 1]
 [2 1 3]
 [1 3 2]]

Here is a benchmark comparing it to

def using_shuffle(array):
    map(numpy.random.shuffle, array)
    return array

In [151]: %timeit using_shuffle(array)
1 loops, best of 3: 7.17 s per loop

In [152]: %timeit using_perms(array)
1 loops, best of 3: 2.78 s per loop

Edit: CT Zhu's method is faster than mine:

def using_Zhu(array):
    nrows, ncols = array.shape    
    all_perm = np.array((list(itertools.permutations(range(ncols)))))
    b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
    return (array.flatten()[(b+3*np.arange(nrows)[...,np.newaxis]).flatten()]
            ).reshape(array.shape)

In [177]: %timeit using_Zhu(array)
1 loops, best of 3: 1.7 s per loop

Here is a slight variation of Zhu's method which may be even a bit faster:

def using_Zhu2(array):
    nrows, ncols = array.shape    
    all_perm = np.array((list(itertools.permutations(range(ncols)))))
    b = all_perm[np.random.randint(0, all_perm.shape[0], size=nrows)]
    return array.take((b+3*np.arange(nrows)[...,np.newaxis]).ravel()).reshape(array.shape)

In [201]: %timeit using_Zhu2(array)
1 loops, best of 3: 1.46 s per loop

0人赞添加讨论(0) 举报

爷的心禁止访问

4楼-- · 2020-07-03 04:14

I believe I have an alternate, equivalent strategy, building upon the previous answers:

# original sequence
a0 = np.arange(3) + 1

# length of original sequence
L = a0.shape[0]

# number of random samples/shuffles
N_samp = 1e4

# from above
all_perm = np.array( (list(itertools.permutations(np.arange(L)))) )
b = all_perm[np.random.randint(0, len(all_perm), size=N_samp)]

# index a with b for each row of b and collapse down to expected dimension
a_samp = a0[np.newaxis, b][0]

I'm not sure how this compares performance-wise, but I like it for its readability.

0人赞添加讨论(0) 举报

叛逆

5楼-- · 2020-07-03 04:16

You can also try the apply function in pandas

import pandas as pd

df = pd.DataFrame(array)
df = df.apply(lambda x:np.random.shuffle(x) or x, axis=1)

And then extract the numpy array from the dataframe

print df.values

0人赞添加讨论(0) 举报

Fast column shuffle of each row numpy

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间