I need to find unique rows in a numpy.array
.
For example:
>>> a # I have
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]])
>>> new_a # I want to get to
array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 1, 1, 0]])
I know that i can create a set and loop over the array, but I am looking for an efficient pure numpy
solution. I believe that there is a way to set data type to void and then I could just use numpy.unique
, but I couldn't figure out how to make it work.
Another option to the use of structured arrays is using a view of a
void
type that joins the whole row into a single item:EDIT Added
np.ascontiguousarray
following @seberg's recommendation. This will slow the method down if the array is not already contiguous.EDIT The above can be slightly sped up, perhaps at the cost of clarity, by doing:
Also, at least on my system, performance wise it is on par, or even better, than the lexsort method:
np.unique
when I run it onnp.random.random(100).reshape(10,10)
returns all the unique individual elements, but you want the unique rows, so first you need to put them into tuples:That is the only way I see you changing the types to do what you want, and I am not sure if the list iteration to change to tuples is okay with your "not looping through"
I've compared the suggested alternative for speed and found that, surprisingly, the void view
unique
solution is even a bit faster than numpy's nativeunique
with theaxis
argument. If you're looking for speed, you'll wantCode to reproduce the plot:
Yet another possible solution
np.unique works given a list of tuples:
With a list of lists it raises a
TypeError: unhashable type: 'list'
Lets get the entire numpy matrix as a list, then drop duplicates from this list, and finally return our unique list back into a numpy matrix: