I have two 2D numpy arrays, for example:
A = numpy.array([[1, 2, 4, 8], [16, 32, 32, 8], [64, 32, 16, 8]])
and
B = numpy.array([[1, 2], [32, 32]])
I want to have all lines from A
where I can find all elements from any of the lines of B
. Where there are 2 of the same element in a row of B
, lines from A
must contain at least 2 as well. In case of my example, I want to achieve this:
A_filtered = [[1, 2, 4, 8], [16, 32, 32, 8]]
I have control over the values representation so I chose numbers where the binary representation has only one place with 1
(example: 0b00000001
and 0b00000010
, etc...) This way I can easily check if all type of values are in the row by using np.logical_or.reduce()
function, but I cannot check that the number of the same elements are bigger or equal in a row of A
. I was really hoping that I could avoid simple for
loop and deep copies of the arrays as the performance is a very important aspect for me.
How can I do that in numpy in an efficient way?
Update:
A solution from here may work, but I think the performance is a big concern for me, the A
can be really big (>300000 rows) and B
can be moderate (>30):
[set(row).issuperset(hand) for row in A.tolist() for hand in B.tolist()]
Update 2:
The set()
solution is not working since the set()
drops all duplicated values.
I hope I got your question right. At least it works with the problem you described in your question. If the order of the output should stay the same as the input, change the inplace-sort.
The code looks quite ugly, but should perform well and shouldn't be to hard to understand.
Code
Measuring performance
Results
I think this should work:
First, encode the data as follows (this assumes a limited number of 'tokens', as your binary scheme also seems to imply):
Make A shape [n_rows, n_tokens], dtype int8, where each element counts the number of tokens. Encode B in the same way, with shape [n_hands, n_tokens]
This allows for a single vectorized expression of your output; matches = (A[None, :, :] >= B[:, None, :]).all(axis=-1). (exactly how to map this matches array to your desired output format is left as an excerise to the reader since the question leaves it undefined for multiple matches).
But we are talking > 10Mbyte of memory per token here. Even with 32 tokens that should not unthinkable; but in a situation like this it tends to be better to not vectorize the loop over n_tokens or n_hands, or both; for loops are fine for small n, or if there is sufficient work to be done in the body, such that the looping overhead is insignificant.
As long as n_tokens and n_hands remain moderate, I think this will be the fastest solution, if staying within the realm of pure python and numpy.