Combinations without repeat and ordering matters o

2019-04-03 04:38发布

问题:

For a 1D NumPy array, I am looking to get the combinations without the same elements being repeated in a combination. The order is important. So, [a,b] and [b,a] would be two distinct combinations. Since we don't want repeats, [a,a] and [b,b] aren't valid combinations. For simplicity, let's keep it to two elements per combination. Thus, the output would be a 2D NumPy array with 2 columns.

The desired result would be essentially same as itertools.product output except that we need to mask out the combinations that are repeated. As such, we can solve it for a sample case, like so -

In [510]: import numpy as np

In [511]: a = np.array([4,2,9,1,3])

In [512]: from itertools import product

In [513]: np.array(list(product(a,repeat=2)))[~np.eye(len(a),dtype=bool).ravel()]
Out[513]: 
array([[4, 2],
       [4, 9],
       [4, 1],
       [4, 3],
       [2, 4],
       [2, 9],
       [2, 1],
       [2, 3],
       [9, 4],
       [9, 2],
       [9, 1],
       [9, 3],
       [1, 4],
       [1, 2],
       [1, 9],
       [1, 3],
       [3, 4],
       [3, 2],
       [3, 9],
       [3, 1]])

But, creating that huge array and then masking out and hence not using some elements, doesn't look too efficient to me.

That got me thinking if numpy.ndarray.strides could be leveraged here. I have one solution with that idea in mind, which I will be posting as an answer post, but would love to see other efficient ones.

In terms of usage - We come across these cases with adjacency matrices among others and I thought it would be good to solve such a problem. For easier and efficient plug-n-play into other problems, it would be nice to have the final output that's not a view of some intermediate array.

回答1:

Seems like np.lib.stride_tricks.as_strided could be used to maximize the efficiency of views and we delay the copying until the final stage, where we assign into an initialized array. The implementation would be in two steps, with some work needed for the second column (as shown in the sample case in the question), which we are calling as one-cold (fancy name that denotes one element missing per sequence / is cold in a each interval of len(input_array) - 1)

def onecold(a):
    n = len(a)
    s = a.strides[0]
    strided = np.lib.stride_tricks.as_strided
    b = np.concatenate((a,a[:-1]))
    return strided(b[1:], shape=(n-1,n), strides=(s,s))

To showcase, onecold with a sample case -

In [563]: a
Out[563]: array([4, 2, 9, 1, 3])

In [564]: onecold(a).reshape(len(a),-1)
Out[564]: 
array([[2, 9, 1, 3],
       [4, 9, 1, 3],
       [4, 2, 1, 3],
       [4, 2, 9, 3],
       [4, 2, 9, 1]])

To solve the original problem, we will use it like so -

def combinations_without_repeat(a):
    n = len(a)
    out = np.empty((n,n-1,2),dtype=a.dtype)
    out[:,:,0] = np.broadcast_to(a[:,None], (n, n-1))
    out.shape = (n-1,n,2)
    out[:,:,1] = onecold(a)
    out.shape = (-1,2)
    return out  

Sample run -

In [574]: a
Out[574]: array([4, 2, 9, 1, 3])

In [575]: combinations_without_repeat(a)
Out[575]: 
array([[4, 2],
       [4, 9],
       [4, 1],
       [4, 3],
       [2, 4],
       [2, 9],
       [2, 1],
       [2, 3],
       [9, 4],
       [9, 2],
       [9, 1],
       [9, 3],
       [1, 4],
       [1, 2],
       [1, 9],
       [1, 3],
       [3, 4],
       [3, 2],
       [3, 9],
       [3, 1]])

Seems quite efficient for a 1000 elements array of ints -

In [578]: a = np.random.randint(0,9,(1000))

In [579]: %timeit combinations_without_repeat(a)
100 loops, best of 3: 2.35 ms per loop

Would love to see others!



回答2:

"It would be essentially same as itertools.product output, expect that we need to mask out the combinations that are repeated." Actually, what you want is itertools.permutations:

In [7]: import numpy as np

In [8]: from itertools import permutations

In [9]: a = np.array([4,2,9,1,3])

In [10]: list(permutations(a, 2))
Out[10]: 
[(4, 2),
 (4, 9),
 (4, 1),
 (4, 3),
 (2, 4),
 (2, 9),
 (2, 1),
 (2, 3),
 (9, 4),
 (9, 2),
 (9, 1),
 (9, 3),
 (1, 4),
 (1, 2),
 (1, 9),
 (1, 3),
 (3, 4),
 (3, 2),
 (3, 9),
 (3, 1)]


回答3:

Benchmarking Post

Posting the performance numbers/figures for the proposed approaches thus far in this wiki-post.

System config :

NumPy version         : 1.13.3
Python version        : 2.7.12 (GCC 5.4.0)

Operating System: Ubuntu 16.04
RAM: 16GB
CPU Model: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (# Cores=4, # Threads=8)

The benchmarking setup would be :

import numpy as np
import perfplot
from itertools import permutations

# https://stackoverflow.com/a/48234170/ @Divakar
def onecold(a):
    n = len(a)
    s = a.strides[0]
    strided = np.lib.stride_tricks.as_strided
    b = np.concatenate((a,a[:-1]))
    return strided(b[1:], shape=(n-1,n), strides=(s,s))

# https://stackoverflow.com/a/48234170/ @Divakar
def combinations_without_repeat(a):
    n = len(a)
    out = np.empty((n,n-1,2),dtype=a.dtype)
    out[:,:,0] = np.broadcast_to(a[:,None], (n, n-1))
    out.shape = (n-1,n,2)
    out[:,:,1] = onecold(a)
    out.shape = (-1,2)
    return out

# https://stackoverflow.com/a/48234349/ @Warren Weckesser
def itertools_permutations(a):
    return np.array(list(permutations(a, 2)))

perfplot.show(
        setup=lambda n: np.random.rand(n),
        n_range=[10,20,50,100,200,500,1000], # dataset sizes
        kernels=[combinations_without_repeat, itertools_permutations],
        logx=True,
        logy=True,
        )

The performance figure :