Find indexes of matching rows in two 2-D arrays

2020-06-11 10:08发布

问题:

Suppose that I have two 2-D arrays as follows:

array([[3, 3, 1, 0],
       [2, 3, 1, 3],
       [0, 2, 3, 1],
       [1, 0, 2, 3],
       [3, 1, 0, 2]], dtype=int8)

array([[0, 3, 3, 1],
       [0, 2, 3, 1],
       [1, 0, 2, 3],
       [3, 1, 0, 2],
       [3, 3, 1, 0]], dtype=int8)

Some rows in each array have a corresponding row that matches by value (but not necessarily by index) in the other array, and some don't.

I would like to find an efficient way to return pairs of indexes in the two arrays that correspond to matching rows. If they were to be tuples I would expect to return

(0,4)
(2,1)
(3,2)
(4,3)

回答1:

This is an all numpy solution - not that is necessarily better than an iterative Python one. It still has to look at all combinations.

In [53]: np.array(np.all((x[:,None,:]==y[None,:,:]),axis=-1).nonzero()).T.tolist()
Out[53]: [[0, 4], [2, 1], [3, 2], [4, 3]]

The intermediate array is (5,5,4). The np.all reduces it to:

array([[False, False, False, False,  True],
       [False, False, False, False, False],
       [False,  True, False, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True, False]], dtype=bool)

The rest is just extracting the indices where this is True

In crude tests, this times at 47.8 us; the other answer with the L1 dictionary at 38.3 us; and a third with a double loop at 496 us.



回答2:

You can use the void data type trick to use 1D functions on the rows of your two arrays. a_view and b_view are 1D vectors, each entry representing a full row. I have then chosen to sort an array and use np.searchsorted to find the items of the other array in that one. If the array we sort has length m and the other has length n, sorting takes time m * log(m), and the binary searching that np.searchsorted does takes time n * log(m), for a total of (n + m) * log(m). You therefore want to sort the shortest of the two arrays:

def find_rows(a, b):
    dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))

    a_view = np.ascontiguousarray(a).view(dt).ravel()
    b_view = np.ascontiguousarray(b).view(dt).ravel()

    sort_b = np.argsort(b_view)
    where_in_b = np.searchsorted(b_view, a_view,
                                 sorter=sort_b)
    where_in_b = np.take(sort_b, where_in_b)
    which_in_a = np.take(b_view, where_in_b) == a_view
    where_in_b = where_in_b[which_in_a]
    which_in_a = np.nonzero(which_in_a)[0]
    return np.column_stack((which_in_a, where_in_b))

With a and b your two sample arrays:

In [14]: find_rows(a, b)
Out[14]: 
array([[0, 4],
       [2, 1],
       [3, 2],
       [4, 3]], dtype=int64)

In [15]: %timeit find_rows(a, b)
10000 loops, best of 3: 29.7 us per loop

On my system the dictionary approach clocks faster at about 22 us for your test data, but with arrays of 1000x4, this numpy approach is about 6x faster than the pure Python one (483 us vs 2.54 ms).



回答3:

I can't think of a numpy specific way to do it, but here's what I would do with regular lists:

>>> L1= [[3, 3, 1, 0],
...        [2, 3, 1, 3],
...        [0, 2, 3, 1],
...        [1, 0, 2, 3],
...        [3, 1, 0, 2]]
>>> L2 = [[0, 3, 3, 1],
...        [0, 2, 3, 1],
...        [1, 0, 2, 3],
...        [3, 1, 0, 2],
...        [3, 3, 1, 0]]
>>> L1 = {tuple(row):i for i,row in enumerate(L1)}
>>> answer = []
>>> for i,row in enumerate(L2):
...   if tuple(row) in L1:
...     answer.append((L1[tuple(row)], i))
... 
>>> answer
[(2, 1), (3, 2), (4, 3), (0, 4)]


标签: python numpy