Let's say I have an array like:
arr = np.array([[1,20,5],
[1,20,8],
[3,10,4],
[2,30,6],
[3,10,5]])
and I would like to form a dictionary of the sum of the third column for each row that matches each value in the first column, i.e. return {1: 13, 2: 6, 3: 9}
. To make matters more challenging, there's 1 billion rows in my array and 100k unique elements in the first column.
Approach 1: Naively, I can invoke np.unique()
then iterate through each item in the unique array with a combination of np.where()
and np.sum()
in a one-liner dictionary enclosing a list comprehension. This would be reasonably fast if I have a small number of unique elements, but at 100k unique elements, I will incur a lot of wasted page fetches making 100k I/O passes of the entire array.
Approach 2: I could make a single I/O pass of the last column (because having to hash column 1 at each row will probably be cheaper than the excessive page fetches) too, but I lose the advantage of numpy's C inner loop vectorization here.
Is there a fast way to implement Approach 2 without resorting to a pure Python loop?
Here's a NumPy based approach using
np.add.reduceat
-If you would like to get the keys and values in a 2-column array -
Sample run -
The proper way of doing this is with NumPy is using
np.bincount
. If your unique first column labels are already small contiguous integers you could simply do:where the
cum_sum != 0
is an attempt to skip missing first column labels, which may be grossly wrong if your third column includes negative numbers.Alternatively you can do things properly and call
np.unique
first and do:This is a typical grouping problem, which the numpy_indexed package solves efficiently and elegantly (if I may say so myself; I am its author)
Its a much more lightweight solution than the pandas package, and I think the syntax is cleaner, as there is no need to create a special datastructure just to perform this type of elementary operation. Performance should be identical to the solution posed by Divakar, since it follows the same steps; just with a nice and tested interface on top.
numpy approach:
pandas approach: