Numpy : Grouping/ binning values based on associat

2019-08-01 17:29发布

Forgive me for a vague title. I honestly don't know which title will suit this question. If you have a better title, let's change it so that it will be apt for the problem at hand.

The problem.

Let's say result is a 2D array and values is a 1D array. values holds some values associated with each element in result. The mapping of an element in values to result is stored in x_mapping and y_mapping. A position in result can be associated with different values. Now, I have to find the sum of the values grouped by associations.

An example for better clarification.

result array:

[[0, 0],
[0, 0],
[0, 0],
[0, 0]]

values array:

[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.]

Note: Here result and values have the same number of elements. But it might not be the case. There is no relation between the sizes at all.

x_mapping and y_mapping have mappings from 1D values to 2D result. The sizes of x_mapping, y_mapping and values will be the same.

x_mapping - [0, 1, 0, 0, 0, 0, 0, 0]

y_mapping - [0, 3, 2, 2, 0, 3, 2, 1]

Here, 1st value(values[0]) have x as 0 and y as 0(x_mapping[0] and y_mappping[0]) and hence associated with result[0, 0]. If we are counting the number of associations, then element value at result[0,0] will be 2 as 1st value and 5th value are associated with result[0, 0]. If we are taking the sum, the result[0, 0] = value[0] + value[4] which is 6.

Current solution

# Initialisation. No connection with the solution.
result = np.zeros([4,2], dtype=np.int16)

values =  np.linspace(start=1, stop=8, num=8)
y_mapping = np.random.randint(low=0, high=values.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=values.shape[1], size=values.shape[0])
# Summing the values associated with x,y (current solution.)
for i in range(values.size):
    x = x_mapping[i]
    y = y_mapping[i]
    result[-y, x] = result[-y, x] + values[i]

The result,

[[6, 0],
[ 6, 2],
[14, 0],
[ 8, 0]]

Failed solution; But why?

test_result = np.zeros_like(result)
test_result[-y_mapping, x_mapping] = test_result[-y_mapping, x_mapping] + values # solution

To my surprise elements are overwritten in test_result. Values at test_result,

[[5, 0],
[6, 2],
[7, 0],
[8, 0]]

Question

1. Why, in the second solution, every element is overwritten?

As @Divakar has pointed out in the comment in his answer - NumPy doesn't assign accumulated/summed values when the indices are repeated in test_result[-y_mapping, x_mapping] =. It randomly assigns from one of the instances.

2. Is there any Numpy way to do this? That is without looping? I'm looking for some speed optimization.

Approach #2 in @Divakar's answer gives me good results. For 23315 associations, for loop took 50 ms while Approach #1 took 1.85 ms. Beating all these, Approach #2 took 668 µs.

Side note

I'm using Numpy version 1.14.3 with Python 3.5.2 on an i7 processor.

2条回答
Bombasti
2楼-- · 2019-08-01 17:58

Approach #1

Most intutive one would be with np.add.at for those repeated indices -

np.add.at(result, [-y_mapping, x_mapping], values)

Approach #2

We need to perform binned summations owing to the possible repeated nature of x,y indices. Hence, another way could be to use NumPy's binned summation func : np.bincount and have an implementation like so -

# Get linear index equivalents off the x and y indices into result array
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping

# Get binned summations off values based on linear index as bins
binned_sums = np.bincount(lidx, values, minlength=m*n)

# Finally add into result array
result += binned_sums.astype(result.dtype).reshape(m,n)

If you are always starting off with a zeros array for result, the last step could be made more performant with -

result = binned_sums.astype(out_dtype).reshape(m,n)
查看更多
可以哭但决不认输i
3楼-- · 2019-08-01 18:06

I guess you were to write

y_mapping = np.random.randint(low=0, high=result.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=result.shape[1], size=values.shape[0])

With that correction, the code works for me as expected.

查看更多
登录 后发表回答