I have a (N,3)
array of numpy values:
>>> vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
>>> vals
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 7],
[0, 4, 5],
[2, 2, 1],
[0, 0, 0],
[5, 4, 3]])
I'd like to remove rows from the array that have a duplicate value. For example, the result for the above array should be:
>>> duplicates_removed
array([[1, 2, 3],
[4, 5, 6],
[0, 4, 5],
[5, 4, 3]])
I'm not sure how to do this efficiently with numpy without looping (the array could be quite large). Anyone know how I could do this?
Here's an approach to handle generic number of columns and still be a vectorized method -
Steps :
Sort along each row.
Look for differences between consecutive elements in each row. Thus, any row with at least one zero differentiation indicates a duplicate element. We will use this to get a mask of valid rows. So, the final step is to simply select valid rows off input array, using the mask.
Sample run -
Its six years on, but this question helped me, so I ran a comparison for speed for the answers given by Divakar, Benjamin, Marcelo Cantos and Curtis Patrick.
Results:
It seems that using
set
beatsnumpy.unique
. In my case I needed to do this over a much larger array:The methods without list comprehensions are much faster. However, the number of rows are hard coded, and are difficult to extend to more than three columns, so in my case at least the list comprehension with the set is the best answer.
EDITED because I confused rows and columns in
bigvals
This is an option:
Identical to Marcelo, but I think using
numpy.unique()
instead ofset()
may get across exactly what you are shooting for.Mind you, this still loops behind the scenes. You can't avoid that. But it should work fine even for millions of rows.