I need to find unique rows in a numpy.array
.
For example:
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy
solution. I believe that there is a way to set data type to void and then I could just use numpy.unique
, but I couldn't figure out how to make it work.
Why not use
drop_duplicates
from pandas:np.unique works by sorting a flattened array, then looking at whether each item is equal to the previous. This can be done manually without flattening:
This method does not use tuples, and should be much faster and simpler than other methods given here.
NOTE: A previous version of this did not have the ind right after a[, which mean that the wrong indices were used. Also, Joe Kington makes a good point that this does make a variety of intermediate copies. The following method makes fewer, by making a sorted copy and then using views of it:
This is faster and uses less memory.
Also, if you want to find unique rows in an ndarray regardless of how many dimensions are in the array, the following will work:
An interesting remaining issue would be if you wanted to sort/unique along an arbitrary axis of an arbitrary-dimension array, something that would be more difficult.
Edit:
To demonstrate the speed differences, I ran a few tests in ipython of the three different methods described in the answers. With your exact a, there isn't too much of a difference, though this version is a bit faster:
With a larger a, however, this version ends up being much, much faster:
If you want to avoid the memory expense of converting to a series of tuples or another similar data structure, you can exploit numpy's structured arrays.
The trick is to view your original array as a structured array where each item corresponds to a row of the original array. This doesn't make a copy, and is quite efficient.
As a quick example:
To understand what's going on, have a look at the intermediary results.
Once we view things as a structured array, each element in the array is a row in your original array. (Basically, it's a similar data structure to a list of tuples.)
Once we run
numpy.unique
, we'll get a structured array back:That we then need to view as a "normal" array (
_
stores the result of the last calculation inipython
, which is why you're seeing_.view...
):And then reshape back into a 2D array (
-1
is a placeholder that tells numpy to calculate the correct number of rows, give the number of columns):Obviously, if you wanted to be more concise, you could write it as:
Which results in:
I didn’t like any of these answers because none handle floating-point arrays in a linear algebra or vector space sense, where two rows being “equal” means “within some
The most straightforward solution is to make the rows a single item by making them strings. Each row then can be compared as a whole for its uniqueness using numpy. This solution is generalize-able you just need to reshape and transpose your array for other combinations. Here is the solution for the problem provided.
Will Give:
Send my nobel prize in the mail