Set duplicate elements as zeros

2020-07-24 05:14发布

问题:

How can I convert the duplicate elements in a array 'data' into 0? It has to be done row-wise.

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4]])

The answer should be as follows:

ans = array([[1,8,3,0,4],
             [1,8,9,0,4]])

回答1:

Approach #1

One approach with np.unique -

# Find out the unique elements and their starting positions
unq_data, idx = np.unique(data,return_index=True)

# Find out the positions for each unique element, their duplicate positions
dup_idx = np.setdiff1d(np.arange(data.size),idx)

# Set those duplicate positioned elemnents to 0s
data[dup_idx] = 0

Sample run -

In [46]: data
Out[46]: array([1, 8, 3, 3, 4, 1, 3, 3, 9, 4])

In [47]: unq_data, idx = np.unique(data,return_index=True)
    ...: dup_idx = np.setdiff1d(np.arange(data.size),idx)
    ...: data[dup_idx] = 0
    ...: 

In [48]: data
Out[48]: array([1, 8, 3, 0, 4, 0, 0, 0, 9, 0])

Approach #2

You can also use sorting and differentiation as a faster approach -

# Get indices  for sorted data
sort_idx = np.argsort(data)

# Get duplicate indices and set those in data to 0s
dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
data[dup_idx] = 0

Runtime tests -

In [110]: data = np.random.randint(0,100,(10000))
     ...: data1 = data.copy()
     ...: data2 = data.copy()
     ...: 

In [111]: def func1(data):
     ...:     unq_data, idx = np.unique(data,return_index=True)
     ...:     dup_idx = np.setdiff1d(np.arange(data.size),idx)
     ...:     data[dup_idx] = 0
     ...: 
     ...: def func2(data):
     ...:     sort_idx = np.argsort(data)
     ...:     dup_idx = sort_idx[1::][np.diff(np.sort(data))==0]
     ...:     data[dup_idx] = 0
     ...:     

In [112]: %timeit func1(data1)
1000 loops, best of 3: 1.36 ms per loop

In [113]: %timeit func2(data2)
1000 loops, best of 3: 467 µs per loop

Extending to a 2D case :

Approach #2 could be extended to work for a 2D array case, avoiding any loop like so -

# Get indices  for sorted data
sort_idx = np.argsort(data,axis=1)

# Get sorted linear indices
row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
sort_lin_idx = sort_idx[:,1::] + row_offset

# Get duplicate linear indices and set those in data as 0s
dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
data.ravel()[dup_lin_idx] = 0

Sample run -

In [6]: data
Out[6]: 
array([[1, 8, 3, 3, 4, 0, 3, 3],
       [1, 8, 9, 9, 4, 8, 7, 9],
       [1, 8, 9, 9, 4, 8, 7, 3]])

In [7]: sort_idx = np.argsort(data,axis=1)
   ...: row_offset = data.shape[1]*np.arange(data.shape[0])[:,None]
   ...: sort_lin_idx = sort_idx[:,1::] + row_offset
   ...: dup_lin_idx = sort_lin_idx[np.diff(np.sort(data,axis=1),axis=1)==0]
   ...: data.ravel()[dup_lin_idx] = 0
   ...: 

In [8]: data
Out[8]: 
array([[1, 8, 3, 0, 4, 0, 0, 0],
       [1, 8, 9, 0, 4, 0, 7, 0],
       [1, 8, 9, 0, 4, 0, 7, 3]])


回答2:

Here's a simple pure-Python way to do it:

seen = set()
for i, x in enumerate(data):
    if x in seen:
        data[i] = 0
    else:
        seen.add(x)


回答3:

You could use a nested for loop, where you compare each element of the array to every other element to check for duplicate records. Syntax might be a bit off as I am not really familiar with numpy.

for x in range(0, len(data))
   for y in range(x+1, len(data))
      if(data[x] == data[y])
         data[x] = 0


回答4:

@Divakar has it almost right, but there are a few things that can be further optimized, but don't really fit in a comment. To begin:

rows, cols = data.shape

The first operation is to sort the array to identify the duplicates. Since we will want to undo the sorting, we need to use np.argsort, but if you want to make sure that it is the first occurrence of each repeated value that is kept, you need to use a stable sorting algorithm:

sort_idx = data.argsort(axis=1, kind='mergesort')

Once we have the indices to sort data, to get a sorted copy of the array it is faster to use the indices than to re-sort the array:

sorted_data = data[np.arange(rows)[:, None], sort_idx]

While the principle is similar to that in using np.diff, it is typically faster to use boolean operations. We want an array full of False where the first occurrences of each value happen, and True where the duplicates are:

sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
                              sorted_data[:, :-1] == sorted_data[:, 1:]),
                             axis=1)

We now use that mask to set all the duplicates to zero:

sorted_data[sorted_mask] = 0

And we finally undo the sorting. To revert a permutation you can sort the indices that define it, i.e. you could do:

invert_idx = sort_idx.argsort(axis=1, kind='mergesort')
ans = sorted_data[np.arange(rows)[:, None], invert_idx]

But it is more efficient to use assignment, i.e.:

ans = np.empty_like(data)
ans[np.arange(rows), sort_idx] = sorted_data

Putting it all together:

def zero_dups(data):
    rows, cols = data.shape
    sort_idx = data.argsort(axis=1, kind='mergesort')
    sorted_data = data[np.arange(rows)[:, None], sort_idx]
    sorted_mask = np.concatenate((np.zeros((rows, 1), dtype=bool),
                                  sorted_data[:, :-1] == sorted_data[:, 1:]),
                                 axis=1)
    sorted_data[sorted_mask] = 0
    ans = np.empty_like(data)
    ans[np.arange(rows)[:, None], sort_idx] = sorted_data

    return ans


标签: python numpy