可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have two 2D numpy arrays, for example:

A = numpy.array([[1, 2, 4, 8], [16, 32, 32, 8], [64, 32, 16, 8]])

and

B = numpy.array([[1, 2], [32, 32]])

I want to have all lines from A where I can find all elements from any of the lines of B. Where there are 2 of the same element in a row of B, lines from A must contain at least 2 as well. In case of my example, I want to achieve this:

A_filtered = [[1, 2, 4, 8], [16, 32, 32, 8]]

I have control over the values representation so I chose numbers where the binary representation has only one place with 1 (example: 0b00000001 and 0b00000010, etc...) This way I can easily check if all type of values are in the row by using np.logical_or.reduce() function, but I cannot check that the number of the same elements are bigger or equal in a row of A. I was really hoping that I could avoid simple for loop and deep copies of the arrays as the performance is a very important aspect for me.

How can I do that in numpy in an efficient way?

Update:

A solution from here may work, but I think the performance is a big concern for me, the A can be really big (>300000 rows) and B can be moderate (>30):

[set(row).issuperset(hand) for row in A.tolist() for hand in B.tolist()]

Update 2:

The set() solution is not working since the set() drops all duplicated values.

回答1:

I hope I got your question right. At least it works with the problem you described in your question. If the order of the output should stay the same as the input, change the inplace-sort.

The code looks quite ugly, but should perform well and shouldn't be to hard to understand.

Code

import time
import numba as nb
import numpy as np

@nb.njit(fastmath=True,parallel=True)
def filter(A,B):
  iFilter=np.zeros(A.shape[0],dtype=nb.bool_)

  for i in nb.prange(A.shape[0]):
    break_loop=False

    for j in range(B.shape[0]):
      ind_to_B=0
      for k in range(A.shape[1]):
        if A[i,k]==B[j,ind_to_B]:
          ind_to_B+=1

        if ind_to_B==B.shape[1]:
          iFilter[i]=True
          break_loop=True
          break

      if break_loop==True:
        break

  return A[iFilter,:]

Measuring performance

####First call has some compilation overhead####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)

t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)

####Let's measure the second call too####
A=np.random.randint(low=0, high=60, size=300_000*4).reshape(300_000,4)
B=np.random.randint(low=0, high=60, size=30*2).reshape(30,2)

t1=time.time()
#At first sort the arrays
A.sort()
B.sort()
A_filtered=filter(A,B)
print(time.time()-t1)

Results

46ms after the first run on a dual-core Notebook (sorting included)
32ms (sorting excluded)

回答2:

I think this should work:

First, encode the data as follows (this assumes a limited number of 'tokens', as your binary scheme also seems to imply):

Make A shape [n_rows, n_tokens], dtype int8, where each element counts the number of tokens. Encode B in the same way, with shape [n_hands, n_tokens]

This allows for a single vectorized expression of your output; matches = (A[None, :, :] >= B[:, None, :]).all(axis=-1). (exactly how to map this matches array to your desired output format is left as an excerise to the reader since the question leaves it undefined for multiple matches).

But we are talking > 10Mbyte of memory per token here. Even with 32 tokens that should not unthinkable; but in a situation like this it tends to be better to not vectorize the loop over n_tokens or n_hands, or both; for loops are fine for small n, or if there is sufficient work to be done in the body, such that the looping overhead is insignificant.

As long as n_tokens and n_hands remain moderate, I think this will be the fastest solution, if staying within the realm of pure python and numpy.