I have two structured 2D numpy
arrays which are equal in principle, meaning
A = numpy.array([[a1,b1,c1],
[a2,b2,c2],
[a3,b3,c3],
[a4,b4,c4]])
B = numpy.array([[a2,b2,c2],
[a4,b4,c4],
[a3,b3,c3],
[a1,b1,c1]])
Not in the sense of
numpy.array_equal(A,B) # False
numpy.array_equiv(A,B) # False
numpy.equal(A,B) # ndarray of True and False
But in the sense that one array (A)
is the original and in the other one (B)
the data is shuffled along one axis (could be along the rows or columns).
What is an efficient way to sort/shuffle B
to match or become equal to A
or alternatively sort A
to become equal to B
? An equality check is indeed not important, as long as both arrays are shuffled to match each other. A
and hence B
have unique rows.
I tried the view
method to sort both the arrays like so
def sort2d(A):
A_view = np.ascontiguousarray(A).view(np.dtype((np.void,
A.dtype.itemsize * A.shape[1])))
A_view.sort()
return A_view.view(A.dtype).reshape(-1,A.shape[1])
but that doesn't work here apparently. This operation needs to be performed for really large arrays, so performance and scalability is critical.
Since either one of the arrays could be shuffled to match the other, no one has stopped us from re-arranging both. Using Jaime's Answer, we can
vstack
both the arrays and find the unique rows. Then the inverse indices returned by unique is essentially the desired mapping (as the arrays don't contain duplicate rows).Let's first define a
unique2d
function for convenience:We can now define our mapping as follows
Verify:
Benchmark: benchmark against @ali_m's solution
Although maybe there is room for improvement, the current solution is 2-3 times slower than ali_m's solution and perhaps a little messier as well as both arrays need to be mapped. Just thought this could be an alternate solution.
Based on your example, it seems that you have shuffled all of the columns simultaneously, such that there is a vector of row indices that maps A→B. Here's a toy example:
We want to recover a set of indices,
idx
, such thatA[idx] == B
. This will be a unique mapping if and only if A and B contain no repeated rows.One efficient* approach would be to find the indices that would lexically sort the rows in A, then find where each row in B would fall within the sorted version of A. A useful trick is to view
A
andB
as 1D arrays using annp.void
dtype that treats each row as a single element:Now we can use
np.searchsorted
to perform a binary search for where each row in B would fall within the sorted version of A:The mapping from A→B can be expressed as a composite of A→As→B
If A and B contain no repeated rows, the inverse mapping from B→A can also be obtained using
As a single function:
Benchmark:
*The most costly step would be the quicksort over rows which is O(n log n) on average. I'm not sure it's possible to do any better than this.